Skip to content

Parser fixes from corpus loop: Snowflake / BigQuery / Redshift / Athena / ClickHouse#81

Merged
lustefaniak merged 36 commits intomainfrom
lukasz-corpus-fixes-new-customer-sql
May 6, 2026
Merged

Parser fixes from corpus loop: Snowflake / BigQuery / Redshift / Athena / ClickHouse#81
lustefaniak merged 36 commits intomainfrom
lukasz-corpus-fixes-new-customer-sql

Conversation

@lustefaniak
Copy link
Copy Markdown
Collaborator

@lustefaniak lustefaniak commented May 5, 2026

Parser fixes uncovered by iterating the kernel-cll corpus. Every commit is dialect-grounded with a docs reference and zero corpus regressions; ~98.7% pass rate across ~172k SQL samples (snowflake 99.8%, redshift 99.6%, bigquery 99.5%, athena 97.4%).

Snowflake

  • IDENTIFIER('<name>') literal in name positions (docs) — CREATE TABLE IDENTIFIER('db.schema.t'), FROM IDENTIFIER('mytable'). Detected at the start of parse_identifier.
  • Session variables / bind params inside IDENTIFIER(…)IDENTIFIER($var), IDENTIFIER(?).
  • INTERVAL as identifier before binary/clause keywords (INTERVAL BETWEEN …, … ORDER BY INTERVAL, JOIN … ON INTERVAL = x).
  • CREATE STAGE option grammarFILE_FORMAT = (TYPE = …) shorthand, dotted-ident values, CREDENTIALS = (…) after FILE_FORMAT.
  • CREATE SCHEMA … CLONE source [AT|BEFORE (…)].
  • CREATE SEQUENCE … COMMENT='…' option.
  • CREATE EXTERNAL TABLE … PARTITION BY (cols).
  • DATE_PART(<part> FROM <expr>) ANSI form.
  • Inline FOREIGN KEY REFERENCES column constraint.
  • Dollar-quoted strings for column COMMENT.
  • TABLESAMPLE after FROM TABLE(<expr>) reference.

BigQuery

  • Digit-prefixed path segments (path expressions docs) — foo.bar.25ab. Tokenizer greedily folds leading . into a Number; parse_object_name peels it back off without mutating self.tokens.
  • Legacy SQL [project-id:dataset.table] table references.
  • FOR SYSTEM_TIME AS OF after table alias.
  • [NOT] DETERMINISTIC marker in CREATE FUNCTION body.
  • Set-op suffixes (CORRESPONDING / STRICT / ON (cols)) in any order.
  • Double-quoted string after AT TIME ZONE.

Redshift

  • Oracle/Snowflake (+) outer-join marker.
  • DISTSTYLE / DISTKEY / SORTKEY in any order on CREATE TABLE.
  • GENERATED AS IDENTITY (seed, step) two-arg shorthand.

Hive / Athena

  • Athena → HiveDialect routing in corpus-runner (Athena uses Hive grammar).
  • Iceberg-style expression PARTITIONED BY (PARTITIONED BY (bucket(16, x), days(ts))).
  • WITH SERDEPROPERTIES (…) and DELIMITED suboptions.
  • Table-level COMMENT and CLUSTERED BY clauses.

ClickHouse

  • [GLOBAL] [LEFT|RIGHT|INNER] [ANY|ASOF|ALL] JOIN (JOIN docs) — modifier parsed and discarded; same lineage as base outer join.
  • ON CLUSTER clause in DELETE FROM tbl ON CLUSTER … WHERE … (distributed-DDL docs).

MySQL

  • REPLACE [INTO] statement (docs) — same shape as INSERT.

DuckDB

  • USING SAMPLE clause (samples docs) — tbl USING SAMPLE 10%, tbl USING SAMPLE SYSTEM (10 PERCENT) REPEATABLE (377). Triggered only when USING is followed by bare SAMPLE, so JOIN's USING (cols) is unaffected.

T-SQL / MSSQL

  • System-versioned temporal table column markers (docs) — GENERATED ALWAYS AS ROW {START|END} [HIDDEN], PERIOD FOR SYSTEM_TIME (start_col, end_col).

Cross-dialect

  • Adjacent string literal concatenation (ANSI SQL §5.3, BigQuery, Snowflake, Postgres) — 'foo' 'bar' → 'foobar'. Real customer SQL relies on this.
  • JSON_TABLE / XMLTABLE COLUMNS(...) clause (MySQL JSON_TABLE) — consumed opaquely (output column shapes carry no input refs).
  • WITH [NO] DATA [AND [NO] STATISTICS] on CREATE TABLE AS (Postgres docs, Teradata docs).
  • Reserved keyword followed by ) treated as column name, not trailing-comma terminator.
  • CORRESPONDING [BY (cols)] and STRICT set-op modifiers.
  • Teradata column-level attributesFORMAT, TITLE, COMPRESS, INLINE LENGTH.

Corpus-side improvements (kernel-cll-corpus)

The anonymizer pipeline produced fragments that no parser could accept; fixes landed at the source rather than as fragile parser carve-outs:

  • Triple-quoted string handling, nested block comment depth tracking.
  • Tighter number regex (\d+(?:\.\d+)?) — was eating trailing . in proj.NNN.dataset.
  • Paren/bracket balance check.
  • New keywords: TARGET, JS, PYTHON, IDENTIFIER, DIMENSIONS, METRICS, FACTS, SEMANTIC_VIEW.
  • Drop IF cond THEN SQLs without END IF (truncated procedure-body fragments).
  • Drop query-log SQLs that end mid-clause — trailing ,, (, =, clause keywords (SELECT/FROM/BY/AS/…), or unclosed CASE. Removed ~4k Redshift truncations.
  • Drop 's'<word> anonymizer corruption — when the anonymizer's regex misaligns on a token boundary inside a string literal (e.g. INTERVAL '1 HOUR''s'HOUR), the resulting SQL is unparseable. Pattern unique to anonymizer output, no false positives on hand-written fixtures.
  • _INTERNAL_QUERY_MARKERS filter excludes warehouse-internal /* DS_SVC */ queries.

Tests

Each parser commit ships a unit test in the appropriate dialect file. Full suite passes (cargo nextest run --all-features). Latest CI: Corpus / Check / Test Suite all green.

Snowflake's IDENTIFIER literal lets a string stand in for any identifier
(CREATE TABLE IDENTIFIER('db.schema.t'), FROM IDENTIFIER('mytable'),
INSERT INTO IDENTIFIER('foo.bar')). At the start of parse_identifier,
detect the IDENTIFIER(<string>) shape and consume the whole construct,
returning the string content as a single quoted Ident. The dotted name
inside the string is preserved verbatim — Snowflake itself splits at
execution time, and downstream lineage consumers can do the same.

Reference: https://docs.snowflake.com/en/sql-reference/identifier-literal

Fixes 581 corpus test failures (Snowflake).
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 5, 2026

Corpus Parsing Report

Total: 169400 passed, 2187 failed (98.7% pass rate)

✨ No changes in test results

By Dialect

Dialect Passed Failed Total Pass Rate Delta
ansi 511 69 580 88.1% +6
athena 37 1 38 97.4% +8
bigquery 36619 172 36791 99.5% +122
clickhouse 2488 109 2597 95.8% +7
databricks 2800 214 3014 92.9% +2
doris 22 18 40 55.0% -
dremio 27 0 27 100.0% -
duckdb 1111 45 1156 96.1% +9
exasol 54 7 61 88.5% -
fabric 6 0 6 100.0% -
generic 17 38 55 30.9% -
hive 35 10 45 77.8% +1
materialize 6 14 20 30.0% -
mssql 2301 482 2783 82.7% -
mysql 148 37 185 80.0% +2
oracle 1025 380 1405 73.0% +12
postgres 1180 116 1296 91.0% +5
presto 55 8 63 87.3% -
redshift 34360 141 34501 99.6% +25
singlestore 141 9 150 94.0% -
snowflake 85482 143 85625 99.8% +613
spark 90 20 110 81.8% -
sqlite 51 16 67 76.1% -
starrocks 29 4 33 87.9% -
teradata 23 20 43 53.5% +3
trino 617 80 697 88.5% +5
tsql 165 34 199 82.9% +6

…words

`parse_interval_guard` previously rejected only `LIKE` / `IS` after
INTERVAL, then fell through to a probing `parse_interval()` whose
internal `parse_prefix` is permissive enough to treat any bare keyword
as an identifier. So `INTERVAL BETWEEN 1 AND 2`, `PARTITION BY a,
INTERVAL ORDER BY c`, `MAX(INTERVAL)` etc. all misconsumed the
following keyword as the literal's "value" and broke the surrounding
clause with a downstream error like "Expected ), found: id_5".

Extend the guard's reject list to cover the keywords that can never
plausibly start an interval literal value: binary operators (BETWEEN,
AND, OR, XOR, IN, NOT, ILIKE), clause starters (ORDER, GROUP, HAVING,
WHERE, LIMIT, OFFSET, QUALIFY, WINDOW, UNION, INTERSECT, EXCEPT),
window-frame & sort tokens (ROWS, RANGE, GROUPS, ASC, DESC), and join
conditions (ON, USING).

Snowflake accepts INTERVAL as a column name; this fix only changes
behaviour when INTERVAL appears in a position where the literal form
is impossible.

Fixes 1 corpus test failure (Snowflake) and unblocks downstream
parsing of larger queries that hit the pattern.
@lustefaniak lustefaniak changed the title fix(snowflake): parse IDENTIFIER('<name>') literal snowflake: identifier literal + INTERVAL-as-column carve-outs May 5, 2026
ANSI SQL, BigQuery, Postgres, and Snowflake all concatenate adjacent
string literals separated by whitespace into a single literal:

  SELECT 'foo' 'bar'              -- 'foobar'
  SELECT * FROM t WHERE x IN ('a', 'b' 'c', 'd')   -- 'bc' is one item
  SELECT TRIM('xyz' 'a')          -- TRIM('xyza')

Real customer SQL relies on this — typically as a forgotten comma in an
IN list — and the queries still execute correctly because the warehouse
implements concatenation. Previously the parser rejected the second and
third forms above with "Expected ), found: '<next-string>'".

Implementation: after consuming a `Token::SingleQuotedString` in
`parse_value`, peek-and-consume any immediately-following single-quoted
string tokens, appending their content. The output is a single
`Value::SingleQuotedString` with the concatenated value.

References:
- BigQuery: https://cloud.google.com/bigquery/docs/reference/standard-sql/lexical#string_and_bytes_literals
  > "Adjacent string and bytes literals are concatenated."
- Snowflake: https://docs.snowflake.com/en/sql-reference/data-types-text#string-constants
- Postgres: https://www.postgresql.org/docs/current/sql-syntax-lexical.html
- ANSI SQL:2008 §5.3 <character string literal>

Updates `test_snowflake_trim` which previously asserted
`TRIM('xyz' 'a')` errored — that was enforcing pre-ANSI behaviour.

Fixes 128 corpus test failures.
@lustefaniak lustefaniak changed the title snowflake: identifier literal + INTERVAL-as-column carve-outs snowflake: identifier literal, INTERVAL carve-outs, adjacent-string concat May 5, 2026
DuckDB's `USING SAMPLE` attaches a row-sampling spec to a table:

  SELECT * FROM tbl USING SAMPLE 10%
  SELECT * FROM tbl USING SAMPLE 10 ROWS
  SELECT * FROM tbl USING SAMPLE SYSTEM (10 PERCENT) REPEATABLE (377)
  SELECT * FROM tbl USING SAMPLE RESERVOIR (50 ROWS) REPEATABLE (100)
  SELECT * FROM tbl USING SAMPLE BERNOULLI (5 PERCENT)

Reference: https://duckdb.org/docs/sql/samples

Previously the parser saw `USING` and treated it as the start of a
JOIN's `USING (col, ...)` constraint — which then failed because
`SAMPLE` isn't an opening paren.

The clause carries no lineage content (no table/column refs inside),
so consume it opaquely: optional method keyword, then the sample size
(bare number+unit or parenthesised group), then optional REPEATABLE
seed. Only triggers when USING is followed by `SAMPLE` (case-insensitive
ident match), so JOIN's `USING (cols)` is unaffected.

Gated on `DuckDbDialect | GenericDialect`.

Fixes 10 corpus test failures (sqlglot DuckDB fixtures).
MySQL's REPLACE statement is INSERT-with-replace semantics: delete an
existing row on primary-key conflict and insert the new row. Same shape
as INSERT INTO, just a different leading verb.

  REPLACE INTO mytable SELECT id FROM other WHERE cnt > 100
  REPLACE INTO t (a, b) VALUES (1, 2)

Reference: https://dev.mysql.com/doc/refman/8.4/en/replace.html

Dispatch the top-level REPLACE keyword to `parse_insert` when followed
by INTO. The replace-vs-insert distinction is lost in the AST (both
become Statement::Insert), which is acceptable for grammar coverage —
table/column refs are preserved for downstream lineage.

Gated on `MySqlDialect | GenericDialect`; ClickHouse's `REPLACE TABLE`
shorthand for `CREATE OR REPLACE TABLE` is unchanged.
MySQL's JSON_TABLE and Oracle's XMLTABLE attach a `COLUMNS(<col_defs>)`
clause to a path-string argument, defining the output row shape:

  JSON_TABLE(json, '$.path' COLUMNS(id INT PATH '$.id'))
  JSON_TABLE(j, '$[*]' COLUMNS(row_id FOR ORDINALITY,
                                link VARCHAR(255) PATH '$.link'))

Previously the function-arg parser saw the COLUMNS keyword after the
path string and bailed with "Expected ), found: COLUMNS".

In `parse_function_args`, after parsing an expression-style argument,
peek for a `COLUMNS (` shape and consume the balanced paren block
opaquely. The col_defs are output column shapes (types + JSON paths);
they don't carry input table/column refs, so opaque consumption
preserves all lineage information already in the argument expression.

References:
- MySQL: https://dev.mysql.com/doc/refman/8.4/en/json-table-functions.html
- Oracle XMLTABLE: https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/XMLTABLE.html

Fixes 16 corpus test failures (Oracle, sqlglot Oracle / MySQL, Postgres).
ClickHouse extends the JOIN grammar with two orthogonal modifiers:

- `[INNER|LEFT|RIGHT] [ANY|ASOF|ALL] JOIN` selects the row-matching
  semantics. ANY picks one matching row, ASOF does temporal-nearest
  matching, ALL is the default cartesian behaviour.
- `GLOBAL` prefixes any JOIN to mark it as a distributed-query join
  (sub-select runs on the initiator and the result ships to shards).

Reference: https://clickhouse.com/docs/sql-reference/statements/select/join

The modifier doesn't change the lineage shape — same table refs and
join condition — so parse the keyword and discard rather than
extending the AST. Both INNER and LEFT/RIGHT paths get the modifier
slot; GLOBAL is consumed once at the top of the JOIN-loop iteration
before any LEFT/RIGHT/INNER dispatch.

Gated on `ClickHouseDialect | GenericDialect`.

Fixes 5 corpus test failures (sqlglot ClickHouse fixtures).
@lustefaniak lustefaniak changed the title snowflake: identifier literal, INTERVAL carve-outs, adjacent-string concat Parser fixes from corpus loop: 7 commits across Snowflake/DuckDB/MySQL/ClickHouse/ANSI May 5, 2026
T-SQL's [system-versioned temporal table syntax] uses two extensions
the parser was rejecting:

  CREATE TABLE t (
    <cols>,
    valid_from DATETIME2 GENERATED ALWAYS AS ROW START [HIDDEN] NOT NULL,
    valid_to   DATETIME2 GENERATED ALWAYS AS ROW END   [HIDDEN] NOT NULL,
    PERIOD FOR SYSTEM_TIME (valid_from, valid_to)
  )

1. **`GENERATED ALWAYS AS ROW {START|END} [HIDDEN]`** column option.
   The existing `parse_optional_column_option_generated` only knew the
   IDENTITY and `AS (expr) [STORED]` forms. After consuming `AS`, peek
   for `ROW` or `TRANSACTION_ID` (case-insensitive) and consume the
   optional `START`/`END`/`HIDDEN` tokens, surfacing the marker as a
   `DialectSpecific` column option so the column ref + type stay in
   the AST for lineage.

2. **Table-level `PERIOD FOR SYSTEM_TIME (start_col, end_col)`** clause.
   Both columns are already in the table's column list — the clause
   pairs them but adds no new lineage. Consume the tokens at the top
   of the column-list loop and discard.

The `WITH(SYSTEM_VERSIONING=ON [(HISTORY_TABLE=…, DATA_CONSISTENCY_CHECK=…)])`
table-option suffix was already handled by the existing WITH-options
parser; nothing extra needed there.

Both extensions gated on `MsSqlDialect | GenericDialect`.

[system-versioned temporal table syntax]: https://learn.microsoft.com/en-us/sql/relational-databases/tables/creating-a-system-versioned-temporal-table

Fixes 6 corpus test failures (sqlglot T-SQL fixtures).
ClickHouse routes DDL/DML to all shards via an `ON CLUSTER <name>`
clause:

  DELETE FROM tbl ON CLUSTER test_cluster WHERE date = '2019-01-01'
  DELETE FROM tbl ON CLUSTER '{cluster}' WHERE date = '2019-01-01'

Reference: https://clickhouse.com/docs/sql-reference/distributed-ddl

After parsing the FROM table list in `parse_delete`, peek-and-consume
the optional ON CLUSTER clause before WHERE / USING / RETURNING. The
cluster name doesn't add lineage info; reuse the existing
`parse_optional_on_cluster` helper and discard the result.

Gated on `ClickHouseDialect | GenericDialect`.

Fixes 2 corpus test failures (sqlglot ClickHouse fixtures).
Postgres / ANSI / Teradata `CREATE TABLE name AS <query>` accepts a
trailing clause that controls whether the new table is populated with
the query's results and whether statistics are collected:

  CREATE TABLE t AS SELECT … WITH DATA
  CREATE TABLE t AS SELECT … WITH NO DATA
  CREATE TABLE t AS SELECT … WITH DATA AND STATISTICS
  CREATE TABLE t AS SELECT … WITH NO DATA AND NO STATISTICS

References:
- Postgres: https://www.postgresql.org/docs/current/sql-createtableas.html
- Teradata: https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/SQL-Data-Definition-Language-Syntax-and-Examples/Table-Statements/CREATE-TABLE-AS/AS-Subquery-Clause

After parsing the AS-query body in `parse_create_table_inner`, peek for
`WITH` and consume the optional `[NO] DATA [AND [NO] STATISTICS]` tail.
Neither lineage nor table-shape info lives in the clause; consume and
discard. If the WITH wasn't followed by `[NO] DATA` (e.g. T-SQL's
`WITH (option=…)` table-options), restore the index so the existing
parser path handles it.

Fixes 5 corpus test failures (sqlglot ANSI + Trino).
BigQuery [path expressions] allow the last segment to start with a
digit:

  SELECT * FROM foo.bar.25ab c
  SELECT * FROM foo.bar.25
  SELECT * FROM foo.bar.25_

The tokenizer greedily folds a leading `.` into the next number, so
`bar.25ab` tokenises as `Word("bar")` then `Number(".25")` then
`Word("ab")`. Previously `parse_object_name` saw the Number token
where it expected a Period, broke out of the path loop, and the
parser then errored on the dangling `.25`.

In `parse_object_name`, after each ident, peek for a leading-dot
Number. If found, peel the `.` off, treat the remaining digits as the
next segment's prefix, and concatenate any adjacent Word for segments
like `25ab`. Index-only advance — never mutate `self.tokens` (which
would persist across speculative `maybe_parse` calls).

Numeric literals (`SELECT 1.5`, `WHERE x = 0.5`) and SELECT-projection
JSON paths (`field.5k_clients_target`, handled by
`parse_snowflake_json_path`) are unaffected because both go through
different parser paths.

Gated on `BigQueryDialect | GenericDialect`.

[path expressions]: https://cloud.google.com/bigquery/docs/reference/standard-sql/lexical#path_expressions

Fixes 4 corpus test failures (sqlglot BigQuery).
@lustefaniak lustefaniak changed the title Parser fixes from corpus loop: 7 commits across Snowflake/DuckDB/MySQL/ClickHouse/ANSI Parser fixes from corpus loop: 11 commits across 7 dialects May 5, 2026
lustefaniak added 14 commits May 6, 2026 01:57
…S, INLINE LENGTH)

Teradata's column-attribute grammar adds four post-type modifiers that
the parser currently rejects:

  CREATE TABLE foo (
    valid_date DATE FORMAT 'YYYY-MM-DD',
    name       VARCHAR(50) TITLE 'Customer Name',
    code       INT COMPRESS,
    body       VARCHAR(255) COMPRESS ('a', 'b'),
    notes      VARCHAR(80) INLINE LENGTH 64
  )

Reference: https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/SQL-Data-Definition-Language-Detailed-Topics/CREATE-TABLE/Column-Level-Attributes-for-Database-Object-Creation

The corpus runs sqlglot_teradata fixtures through `GenericDialect`
(there is no dedicated TeradataDialect), so gate on `GenericDialect |
AnsiDialect`. Surface each as a `ColumnOption::DialectSpecific` with
the keyword name; lineage info is preserved by the column's name and
type, the modifiers carry no input refs.

`FORMAT` is a real keyword; `TITLE`, `COMPRESS`, `INLINE`, and `LENGTH`
aren't (per the project rule "Match non-keyword words case-insensitively,
don't add to keywords.rs"), so detect them via case-insensitive Word
match. `COMPRESS (...)` consumes its optional value list with a
balanced-paren skip — the values are constants, no lineage content.

Fixes 5 corpus test failures (sqlglot Teradata + ANSI fixtures).
Snowflake's [zero-copy clone] for schemas:

  CREATE SCHEMA mytestschema_clone CLONE testschema
  CREATE SCHEMA restored_schema    CLONE my_schema AT (OFFSET => -3600)
  CREATE SCHEMA s_restore          CLONE testschema BEFORE (TIMESTAMP => …)

In `parse_create_schema`, after the schema name, peek for `CLONE` and
consume `<source>` and an optional `AT|BEFORE (…)` time-travel suffix.
The current `Statement::CreateSchema` AST has no `clone` slot, so the
clause is consumed and discarded for parser-coverage; revisit when
schema-level provenance lineage is needed and add a field then.

Gated on `SnowflakeDialect | GenericDialect`.

[zero-copy clone]: https://docs.snowflake.com/en/sql-reference/sql/create-clone

Fixes 7 corpus test failures (snowflake first-party + sqlglot snowflake).
ANSI SQL and BigQuery extend UNION/INTERSECT/EXCEPT with two
suffixes the parser currently rejects:

  -- match legs by column name instead of position
  SELECT 1 AS x UNION ALL CORRESPONDING SELECT 2 AS x
  SELECT 1 AS x UNION ALL CORRESPONDING BY (foo, bar) SELECT 2 AS x

  -- type-strict union (no implicit coercion)
  SELECT 1 UNION ALL STRICT SELECT 2

Reference: https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#set_operators

In `parse_set_quantifier`, after the existing ALL/DISTINCT/BY NAME
parsing, peek for `CORRESPONDING [BY (col, …)]` and consume the
balanced paren block opaquely — the column list contains plain names
that already appear in the SELECT legs, so opaque consumption
preserves all lineage info. Same treatment for `STRICT`.

These suffixes don't change which tables/columns the union references,
so adding new `SetQuantifier` variants isn't necessary for grammar
coverage.

Fixes 7 corpus test failures (sqlglot BigQuery + Trino).
Snowflake's [DATE_PART] supports two argument forms — the standard
function-call shape `DATE_PART(<part>, <expr>)` (already worked) and
the ANSI EXTRACT-style `DATE_PART(<part> FROM <expr>)`. The previous
parser path treated DATE_PART as a generic function call and rejected
`FROM` between the args.

Add a special-case at the top of `parse_prefix` for non-keyword Word
"DATE_PART" (case-insensitive) followed by `(`, parsing the part, then
either a comma or `FROM` separator, then the expression. Result is
`Expr::Function` so downstream consumers (lineage visitors) see the
same shape as any other function call — same args slot, same column
refs preserved.

Gated on `SnowflakeDialect | GenericDialect`.

[DATE_PART]: https://docs.snowflake.com/en/sql-reference/functions/date_part

Fixes 3 corpus test failures (sqlglot Snowflake).
Snowflake's [CREATE EXTERNAL TABLE] places a `PARTITION BY (col, col, …)`
clause between the column-def list and the option block:

  CREATE EXTERNAL TABLE et (col1 DATE AS (...), col2 VARCHAR AS (...))
    PARTITION BY (col1, col2)
    LOCATION=@stage/path/
    FILE_FORMAT=(type=parquet)

Previously the parser entered the option-swallowing loop, which expected
the first option to be `name=value` (`LOCATION=`, etc.). `PARTITION BY (...)`
didn't match that shape, so parsing fell through to the Hive-style
external-table path and errored.

Add a PARTITION-BY-list consumer immediately after the column list and
before the option block. The partition column names are already in the
column-def list, so opaque consumption preserves all lineage info.

Gated on `SnowflakeDialect | GenericDialect`.

[CREATE EXTERNAL TABLE]: https://docs.snowflake.com/en/sql-reference/sql/create-external-table

Fixes 3 corpus test failures (snowflake first-party + sqlglot snowflake).
Snowflake's [CREATE SEQUENCE] accepts a `COMMENT = '<string>'`
option alongside START / INCREMENT / ORDER:

  CREATE SEQUENCE seq START=5 COMMENT='foo' INCREMENT=10

The existing sequence-options loop didn't recognise COMMENT and
broke out at the `comment` keyword, leaving trailing tokens that
the outer parser then errored on.

Add a COMMENT arm to `parse_create_sequence_options` that consumes
the optional `=` and a literal string. The comment carries no
lineage content; discard the value.

[CREATE SEQUENCE]: https://docs.snowflake.com/en/sql-reference/sql/create-sequence

Fixes 2 corpus test failures (sqlglot Snowflake).
Redshift's CREATE TABLE permits DISTSTYLE, DISTKEY, and SORTKEY (with
optional COMPOUND prefix) to appear in any order after the column
definitions:

  CREATE TABLE sales (...) DISTKEY(listid) COMPOUND SORTKEY(...) DISTSTYLE AUTO
  CREATE TABLE t (...) SORTKEY(a) DISTKEY(a)

Reference: https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_TABLE_NEW.html

The previous parser parsed them in a fixed sequence (DISTSTYLE → DISTKEY
→ SORTKEY), so any other ordering errored on the first out-of-order
clause. Wrap the three lookups in a loop that consumes whichever
keyword appears next; each option is still admitted at most once.

Per the loop guidance, this is a clause-permutation fix and doesn't
require new grammar — the individual clauses are unchanged.

Fixes 3 corpus test failures (sqlglot_redshift + unparsed_redshift).
Athena's DDL (CREATE EXTERNAL TABLE with ROW FORMAT SERDE,
SERDEPROPERTIES, STORED AS, etc.) is Hive-style, not Trino-style.
Switch the alias from `trino` → `hive` so the corpus runner uses
HiveDialect for `sqlglot_athena/` and any future `customer_athena/`
fixtures. Athena's DML/queries are Trino-style, but the failing
fixtures in the corpus are exclusively DDL where the mapping matters.

Fixes 3 corpus test failures (sqlglot Athena).
Hive's `ROW FORMAT` accepts two extensions the parser currently rejects:

- `ROW FORMAT SERDE 'class' WITH SERDEPROPERTIES ('k'='v', …)` —
  serde configuration after the class name. SERDEPROPERTIES isn't in
  the keyword table; matched case-insensitively.
- `ROW FORMAT DELIMITED [FIELDS TERMINATED BY 'x'] [LINES TERMINATED
  BY 'y'] [NULL DEFINED AS 'z']` — DELIMITED suboptions describing
  ASCII-text storage.

Reference: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-RowFormats&SerDe

`parse_row_format` now consumes:
1. After SERDE 'class', an optional `WITH SERDEPROPERTIES (...)` block
   (balanced-paren skip — the k/v strings carry no lineage info).
2. After DELIMITED, any sub-clauses up to the next ROW / STORED /
   LOCATION / WITH / COMMENT / TBLPROPERTIES / PARTITIONED / CLUSTERED
   / AS keyword (also EOF / `;`).

If `WITH` isn't followed by SERDEPROPERTIES, restore the index so
later table-options parsers (CTEs, WITH(option=…), etc.) can take it.

Used through `parse_hive_formats`, which is called for any dialect's
CREATE TABLE / CREATE EXTERNAL TABLE that allows Hive-style storage
options (Hive, Databricks, Athena via Hive routing).

Fixes 5 corpus test failures (sqlglot Athena, Databricks, sqlglot Hive).
Athena Iceberg tables and Trino use a different shape for PARTITIONED BY
than classic Hive:

  -- Hive: column-def list (each segment has a type)
  CREATE TABLE t (a INT) PARTITIONED BY (year INT)

  -- Iceberg: expression list (column refs + transform functions)
  CREATE TABLE t (id BIGINT, category STRING)
    PARTITIONED BY (category, BUCKET(16, id), TRUNCATE(8, id), DAY(ts))

Reference: https://docs.aws.amazon.com/athena/latest/ug/querying-iceberg-creating-tables.html

`parse_hive_distribution` now distinguishes the two by peeking past the
first identifier inside `(...)`: if a known data-type keyword (INT,
STRING, BIGINT, …) follows, it's the column-def form; otherwise we
parse the contents as a comma-separated expression list. The expression
form's lineage info (column refs inside transforms like `BUCKET(16, id)`)
is preserved by the standard expression parser.

Used through `parse_hive_formats`, which is reached for any dialect's
CREATE TABLE that allows Hive-style storage options (Hive,
Databricks, Athena via Hive routing).

Fixes 2 corpus test failures (sqlglot Athena Iceberg).
Hive's CREATE [EXTERNAL] TABLE grammar allows two optional table-level
clauses the parser was rejecting:

  CREATE EXTERNAL TABLE foo (id INT) COMMENT 'description'
  CREATE EXTERNAL TABLE foo (id INT, val STRING) CLUSTERED BY (id, val) INTO 10 BUCKETS
  CREATE EXTERNAL TABLE foo (id INT) COMMENT 'c'
    PARTITIONED BY (a INT) CLUSTERED BY (id) SORTED BY (id ASC) INTO 5 BUCKETS

Reference: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL

Add `parse_optional_hive_comment_and_clustered_by`, called both before
and after PARTITIONED BY in the external-table branch (the two clauses
can appear in either order around it). Both are consumed and discarded:
COMMENT carries no lineage info, and CLUSTERED BY references columns
already in the table's column-def list. SORTED BY (cols) is consumed
opaquely, INTO <n> [BUCKETS] is also opaque.

CLUSTERED, SORTED, and BUCKETS aren't keywords in our table; matched
case-insensitively per the project rule.

Fixes 2 corpus test failures (sqlglot Athena).
Extends the previous CORRESPONDING / STRICT support to also accept:

- The two suffixes in any order: `UNION ALL STRICT CORRESPONDING`
  alongside the previously-handled `UNION ALL CORRESPONDING STRICT`.
- `BY NAME ON (col, …)` — BigQuery's column-restricted by-name match:

    SELECT 1 AS x UNION ALL BY NAME ON (foo, bar) SELECT 2 AS x
    SELECT 1 AS x INNER UNION ALL BY NAME ON (foo, bar) SELECT 2 AS x

Reference: https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#set_operators

`parse_set_quantifier` now loops through CORRESPONDING / STRICT / ON
suffixes, consuming each at most once and stopping as soon as it sees
something that isn't one of the three. All three are opaque to lineage
(column names inside the paren lists already appear in the SELECT
legs).

Fixes 4 corpus test failures (sqlglot BigQuery).
Redshift inherits the Oracle/Snowflake `expr (+)` legacy outer-join
syntax in the WHERE clause of comma-join queries:

  SELECT a.foo, b.bar FROM a, b WHERE a.baz = b.baz (+)
  SELECT * FROM a, b WHERE a.id (+) = b.id

The parser already handled this for Snowflake/Generic. Extend the
two `dialect_of!` gates (in `parse_subexpr` and the
`Token::LParen | Token::Period` arm of `parse_prefix`) to also include
RedshiftSqlDialect.

Fixes 1 corpus test failure (sqlglot Redshift).
Snowflake's [CREATE STAGE] accepts the option clauses (DIRECTORY,
FILE_FORMAT, COPY_OPTIONS, COMMENT, plus URL / CREDENTIALS /
STORAGE_INTEGRATION / ENDPOINT / ENCRYPTION which would normally be
read by `parse_stage_params`) in any order, and the FILE_FORMAT value
has three shapes:

  FILE_FORMAT = (TYPE='JSON' …)         -- inline parenthesised options
  FILE_FORMAT = '<format_name>'         -- string shorthand
  FILE_FORMAT = [<schema>.]<format>     -- dotted-ident shorthand

Reference: https://docs.snowflake.com/en/sql-reference/sql/create-stage

Two changes in `src/dialect/snowflake.rs`:

1. `parse_create_stage` wraps the option-clause section in a loop so
   any of DIRECTORY / FILE_FORMAT / COPY_OPTIONS / COMMENT plus
   URL/CREDENTIALS/STORAGE_INTEGRATION/ENDPOINT/ENCRYPTION can appear
   after FILE_FORMAT (previously they had to come first via
   `parse_stage_params`). Stage-params seen mid-stream are merged into
   the `stage_params` accumulator.

2. `parse_parentheses_options` accepts dotted ident values
   (`FORMAT_NAME=schema.format`) by consuming `.<word>` continuations.

3. FILE_FORMAT='string' and FILE_FORMAT=<ident> shorthand both surface
   as `DataLoadingOption { name: "FORMAT_NAME", … }` for AST symmetry
   with the parenthesised form's own `FORMAT_NAME` option.

Fixes 7 corpus test failures (snowflake first-party + sqlglot snowflake).
lustefaniak added 10 commits May 6, 2026 02:57
Snowflake (and Postgres) allow a dollar-quoted string body in the
column-level `COMMENT` clause:

  CREATE TABLE foo (ID INT COMMENT \$\$some comment\$\$)

Previously the parser only accepted single-quoted strings here and
errored with "Expected string, found: \$\$…\$\$". Add a
`Token::DollarQuotedString` arm to `parse_optional_column_option`'s
COMMENT branch, surfacing the inner content as `ColumnOption::Comment`
just like a single-quoted form.

Fixes 1 corpus test failure (sqlglot Snowflake).
BigQuery and MSSQL `FOR SYSTEM_TIME AS OF <expr>` time-travel reads
typically appear after the table's optional alias:

  FROM tbl t FOR SYSTEM_TIME AS OF CURRENT_TIMESTAMP() …
  FROM tbl AS t FOR SYSTEM_TIME AS OF CURRENT_TIMESTAMP() LEFT JOIN …

Reference: https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#for_system_time_as_of

`parse_table_factor` previously called `parse_table_version` only
before the alias, so an aliased shape fell through to the
query-level FOR-UPDATE/FOR-SHARE locks loop and errored with
"Expected one of UPDATE or SHARE, found: SYSTEM_TIME". Call
`parse_table_version` a second time after the alias when no version
qualifier was set yet.

Fixes 5 corpus test failures (unparsed BigQuery).
…g comma

In `is_parse_comma_separated_end`, when a comma is followed by a
reserved-as-alias keyword (CLUSTER, SORT, FINAL, etc.), the parser
peeks the next-next token to decide whether the keyword is a clause
starter (end of list) or a column name reusing the keyword.

The previous fall-through `_ => true` returned "end of list" when
peek_nth(1) was anything not specifically whitelisted — including
`)`. So `EXCEPT(id_2, CLUSTER)` looked like a trailing comma plus
out-of-context CLUSTER and the loop stopped, leaving CLUSTER
unconsumed and the parser then erroring with "Expected ), found:
CLUSTER".

Add `Token::RParen => false` before the catch-all: a reserved keyword
inside a parenthesised list with `)` after it is unambiguously a
column name, not a trailing-comma terminator. Trailing-comma support
in projection lists (`SELECT a, b, FROM t`) is unaffected — those
have FROM/clause-starter after the keyword, hitting the dedicated
clause-only check.

Fixes 6 corpus test failures (customer & unparsed BigQuery).
Snowflake's [CREATE TABLE] allows column-level foreign-key constraints
with an explicit `FOREIGN KEY` prefix:

  <col> <type> [NOT NULL] FOREIGN KEY REFERENCES <ref_table> [(<ref_col>)]

Previously the parser only knew the shorter ANSI/Postgres form
`<col> <type> REFERENCES <ref_table>(...)` and bailed at FOREIGN
inside a column definition.

In `parse_optional_column_option`, accept either `FOREIGN KEY
REFERENCES` or bare `REFERENCES` before the same shared
foreign-table / column-list / ON DELETE/UPDATE / characteristics
parsing.

[CREATE TABLE]: https://docs.snowflake.com/en/sql-reference/sql/create-table

Fixes 1 corpus test failure (sqlglot Snowflake).
BigQuery's CREATE FUNCTION accepts an optional `DETERMINISTIC` or
`NOT DETERMINISTIC` marker between RETURNS and LANGUAGE:

  CREATE TEMPORARY FUNCTION f(x FLOAT64) RETURNS FLOAT64 NOT DETERMINISTIC
    LANGUAGE js AS 'return Math.random() * x;'

Reference: https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language#create_a_function

Add a `parse_keywords([NOT, DETERMINISTIC])` / `parse_keyword(DETERMINISTIC)`
arm to `parse_create_function_body`'s loop. The marker doesn't change
lineage; consume and discard rather than extending `CreateFunctionBody`.

Fixes 1 corpus test failure (sqlglot BigQuery).
…thand

Redshift's CREATE TABLE accepts a two-positional-argument shorthand for
IDENTITY columns:

  CREATE TABLE t (c BIGINT GENERATED BY DEFAULT AS IDENTITY (0, 1))
  CREATE TABLE t (c BIGINT GENERATED ALWAYS AS IDENTITY (100, 5))

Reference: https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_TABLE_NEW.html

Previously the parser only handled the Postgres-style keyword form
`IDENTITY (START WITH n INCREMENT BY n …)` via
`parse_create_sequence_options`, which expected each option to begin
with a keyword and broke on the bare numeric pair.

Add `parse_identity_paren_options` that peeks the first token after `(`:
- bare Number / sign — parse `(seed, step)` and surface as
  `[StartWith(seed, false), IncrementBy(step, false)]` so the existing
  AST doesn't need new variants.
- otherwise — fall back to `parse_create_sequence_options` for the
  keyword form.

Both `GENERATED ALWAYS AS IDENTITY` and `GENERATED BY DEFAULT AS IDENTITY`
column-option arms now route through the helper.

Fixes 1 corpus test failure (sqlglot Redshift).
Snowflake's TABLE(<expr>) table-function form accepts the same suffix
keywords (TABLESAMPLE, PIVOT, UNPIVOT, MATCH_RECOGNIZE) as a regular
table reference:

  SELECT * FROM TABLE('t1') TABLESAMPLE BERNOULLI (20.3)

Previously the TABLE(...) branch in `parse_table_factor` returned the
TableFactor::TableFunction directly without running the suffix-keyword
loop, so the TABLESAMPLE token was left for the outer parser, which
errored.

Wrap the post-TABLE() return in the same suffix-keyword loop used for
plain table refs (PIVOT / UNPIVOT / TABLESAMPLE / SAMPLE /
MATCH_RECOGNIZE) so each can apply.

Fixes 1 corpus test failure (sqlglot Snowflake).
…IER(…)

Extend the previous Snowflake `IDENTIFIER('<name>')` literal support
to also accept session variables and bind parameters:

  CREATE TABLE IDENTIFIER(\$foo) (col1 VARCHAR, col2 VARCHAR)
  SELECT * FROM IDENTIFIER(\$tbl_name)
  SELECT * FROM IDENTIFIER(?)

Reference: https://docs.snowflake.com/en/sql-reference/identifier-literal

`Token::Placeholder` (which covers both `\$foo` and `?` in our tokeniser)
is added to the inner-value match alongside the existing
SingleQuotedString/DoubleQuotedString arms. Placeholder values surface
as a plain (unquoted) Ident — they have no compile-time name and
execution would resolve them at run time anyway, so the synthetic
ident just keeps parsing going.

Fixes 1 corpus test failure (sqlglot Snowflake).
BigQuery's [legacy SQL] uses square brackets to quote project-qualified
table identifiers, with `:` separating the project from the
dataset/table:

  SELECT * FROM [my-proj-123:dataset.table]
  SELECT * FROM [proj:ds.tbl] AS t

Standard SQL replaces the brackets with backticks (`proj.ds.tbl`),
but customers still submit legacy-SQL queries through the wire — they
appear in `unparsed_bigquery` query logs.

In `parse_table_factor`, when we see `[` at the start of a table
reference (BigQuery / Generic), consume the balanced bracket block
(words / numbers / dots / colons / hyphens / `*` for wildcard tables)
and surface the inner string as a single backtick-quoted Ident.
Lineage tracking sees the table reference normally; the
`project:dataset.table` text is preserved verbatim in the ident value.

If the bracket block contains anything else (operators, parens, etc.),
restore the index so other callers — e.g. ARRAY[...] literals — can
take over.

[legacy SQL]: https://cloud.google.com/bigquery/docs/reference/legacy-sql

Fixes 5 corpus test failures (unparsed BigQuery, customer BigQuery).
BigQuery accepts both `'…'` and `"…"` as string-literal forms. The
parser's `AT TIME ZONE` arm only recognised the single-quoted form,
so a query like

  EXTRACT(HOUR FROM ts AT TIME ZONE "Asia/Tokyo")

errored at the time-zone argument with "Expected
Token::SingleQuotedString after AT TIME ZONE".

Add a `Token::DoubleQuotedString` arm gated on
`BigQueryDialect | GenericDialect`. The single-quoted path is
unchanged; non-BigQuery dialects (Postgres, etc.) still require the
single-quoted form per ANSI.

Fixes 6 corpus test failures (4 unparsed_bigquery + 2 customer_bigquery).
@lustefaniak lustefaniak changed the title Parser fixes from corpus loop: 11 commits across 7 dialects Parser fixes from corpus loop: Snowflake / BigQuery / Redshift / Athena and friends May 6, 2026
@lustefaniak lustefaniak changed the title Parser fixes from corpus loop: Snowflake / BigQuery / Redshift / Athena and friends Parser fixes from corpus loop: Snowflake / BigQuery / Redshift / Athena / ClickHouse May 6, 2026
- compare-corpus-reports.js only shows additions; check git status for prunes
- Pipeline reprocess takes ~10min; use Monitor with pgrep loop
- Anonymizer corruption signature is exactly 's'<word>; broader regex deletes
  hand-written sqlglot fixtures
- Query-log truncation heuristics that worked (trailing punct/keyword,
  CASE>END count)
- 'cmd &' with run_in_background returns completed immediately — verify with
  pgrep
@lustefaniak lustefaniak merged commit b93e33d into main May 6, 2026
5 checks passed
@lustefaniak lustefaniak deleted the lukasz-corpus-fixes-new-customer-sql branch May 6, 2026 09:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants