Skip to content

fix: error on CREATE EXTERNAL TABLE with no files and no explicit schema#21965

Open
adriangb wants to merge 2 commits intoapache:mainfrom
pydantic:empty-dir-schema-error
Open

fix: error on CREATE EXTERNAL TABLE with no files and no explicit schema#21965
adriangb wants to merge 2 commits intoapache:mainfrom
pydantic:empty-dir-schema-error

Conversation

@adriangb
Copy link
Copy Markdown
Contributor

@adriangb adriangb commented Apr 30, 2026

Which issue does this PR close?

  • Closes #.

Rationale for this change

When you point CREATE EXTERNAL TABLE at an empty directory (or one that does not exist yet) without specifying an explicit column list, DataFusion silently creates a table with 0 columns. Any query against that table then fails with a confusing "column not found" / "no such column" error that gives no hint that the underlying issue is actually that schema inference had nothing to look at.

This is the same root cause as the discussion on #21806 (comment) — that thread covered it from the angle of benchmark runners hitting it, but the confusion is not specific to benchmarks. Failing at CREATE EXTERNAL TABLE time with a clear, actionable message seemed like the right fix overall.

What changes are included in this PR?

ListingOptions::infer_schema now returns a Plan error when the location yields no files (after the existing 0-byte filter), telling the user to either add data files or declare an explicit schema:

Error during planning: No files found at file:///tmp/empty_dir/. Cannot infer schema from an empty location; either add data files or declare an explicit schema for the table.

Pre-declaring an empty table with an explicit schema (e.g. CREATE EXTERNAL TABLE t(x int) STORED AS PARQUET LOCATION '...' for later INSERT) still works — the inference path is only triggered when no schema is provided.

Are these changes tested?

Yes. New cases in datafusion/sqllogictest/test_files/ddl.slt cover:

  • Parquet, CSV, and JSON over an empty location without an explicit schema → all return the new Plan error.
  • An empty location with an explicit schema → still works and queries cleanly.
  • Schema inference still succeeds once files exist at the location, so the new check does not regress the happy path.

Are there any user-facing changes?

Yes — CREATE EXTERNAL TABLE ... LOCATION '<empty-dir>' without an explicit schema now errors at planning time instead of creating a 0-column table. Anyone relying on the previous behavior must add an explicit schema declaration. The error message tells them how.

Use of AI

This code was written fully by AI. @adriangb gave it a detailed plan and reviewed the code by hand once this PR was opened and CI was green.

Pointing CREATE EXTERNAL TABLE at an empty (or non-existent) location
without an explicit column list previously produced a 0-column table.
Subsequent queries against that table failed with a confusing
"column not found" error far from the real cause.

Now ListingOptions::infer_schema returns a clear Plan error when the
location yields no files, instructing the user to either add data files
or declare an explicit schema. The existing behavior of pre-declaring
an empty table with an explicit schema (for later INSERT) still works.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added sqllogictest SQL Logic Tests (.slt) catalog Related to the catalog crate labels Apr 30, 2026
Narrows the schema-inference error to the case the user actually
encounters confusion in: an empty or non-existent directory that
returns zero files from list_all_files. Locations that contain files
which all happen to be 0-byte continue to produce an empty inferred
schema as before, preserving the "0-byte files don't crash reads"
behavior that several existing tests depend on.

Also updates a few tests in datafusion/core that previously relied on
empty fixture directories producing a 0-column table:

- listing_table_factory tests now write a 0-byte placeholder file
  matching the format extension so the glob/extension assertions still
  exercise the inference code path.
- read_dummy_folder and the empty-folder branch of
  read_from_different_file_extension now assert the new error.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added the core Core DataFusion crate label Apr 30, 2026
@adriangb adriangb marked this pull request as ready for review April 30, 2026 23:58
@adriangb
Copy link
Copy Markdown
Contributor Author

Related context: #21806 (comment) — that thread surfaced the same root cause from the angle of benchmark runners hitting it. The reasoning for fixing it here at the planning layer rather than in the runner is that the confusion isn't specific to benchmarks, so erroring on CREATE EXTERNAL TABLE over an empty location seemed better as a general fix.

@adriangb adriangb requested a review from alamb May 1, 2026 01:55
Comment on lines -286 to -287
// Empty files cannot affect schema but may throw when trying to read for it
.try_filter(|object_meta| future::ready(object_meta.size > 0))
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This just means we carry around memory for the ObjectMeta of zero sized files until a couple lines later. I think this is not a big problem.

The alternative is that we error even where there are 0 byte files present. I think that's an interesting discussion: e.g. a completely empty data.csv. Or hive partitioned directories with no data. I think all of these should still require an explicit schema or error, but there are tests that check the opposite behavior:

  • test_csv_empty_file — registers tests/data/empty_0_byte.csv (0 bytes, no header, no data) and runs SELECT * FROM empty.
  • test_csv_multiple_empty_files — folder of 0-byte CSVs. Same situation.
  • it_can_read_empty_ndjson — 0-byte JSON file. Same.
  • test_read_empty_parquet — 0-byte parquet file. Same.
  • test_read_partitioned_empty_parquet — partition dir with a 0-byte parquet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

catalog Related to the catalog crate core Core DataFusion crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant