Parser fixes from corpus loop: Snowflake / BigQuery / Redshift / Athena / ClickHouse by lustefaniak · Pull Request #81 · getsynq/sqlparser-rs

lustefaniak · 2026-05-05T20:43:43Z

Parser fixes uncovered by iterating the kernel-cll corpus. Every commit is dialect-grounded with a docs reference and zero corpus regressions; ~98.7% pass rate across ~172k SQL samples (snowflake 99.8%, redshift 99.6%, bigquery 99.5%, athena 97.4%).

Snowflake

IDENTIFIER('<name>') literal in name positions (docs) — CREATE TABLE IDENTIFIER('db.schema.t'), FROM IDENTIFIER('mytable'). Detected at the start of parse_identifier.
Session variables / bind params inside IDENTIFIER(…) — IDENTIFIER($var), IDENTIFIER(?).
INTERVAL as identifier before binary/clause keywords (INTERVAL BETWEEN …, … ORDER BY INTERVAL, JOIN … ON INTERVAL = x).
CREATE STAGE option grammar — FILE_FORMAT = (TYPE = …) shorthand, dotted-ident values, CREDENTIALS = (…) after FILE_FORMAT.
CREATE SCHEMA … CLONE source [AT|BEFORE (…)].
CREATE SEQUENCE … COMMENT='…' option.
CREATE EXTERNAL TABLE … PARTITION BY (cols).
DATE_PART(<part> FROM <expr>) ANSI form.
Inline FOREIGN KEY REFERENCES column constraint.
Dollar-quoted strings for column COMMENT.
TABLESAMPLE after FROM TABLE(<expr>) reference.

BigQuery

Digit-prefixed path segments (path expressions docs) — foo.bar.25ab. Tokenizer greedily folds leading . into a Number; parse_object_name peels it back off without mutating self.tokens.
Legacy SQL [project-id:dataset.table] table references.
FOR SYSTEM_TIME AS OF after table alias.
[NOT] DETERMINISTIC marker in CREATE FUNCTION body.
Set-op suffixes (CORRESPONDING / STRICT / ON (cols)) in any order.
Double-quoted string after AT TIME ZONE.

Redshift

Oracle/Snowflake (+) outer-join marker.
DISTSTYLE / DISTKEY / SORTKEY in any order on CREATE TABLE.
GENERATED AS IDENTITY (seed, step) two-arg shorthand.

Hive / Athena

Athena → HiveDialect routing in corpus-runner (Athena uses Hive grammar).
Iceberg-style expression PARTITIONED BY (PARTITIONED BY (bucket(16, x), days(ts))).
WITH SERDEPROPERTIES (…) and DELIMITED suboptions.
Table-level COMMENT and CLUSTERED BY clauses.

ClickHouse

[GLOBAL] [LEFT|RIGHT|INNER] [ANY|ASOF|ALL] JOIN (JOIN docs) — modifier parsed and discarded; same lineage as base outer join.
ON CLUSTER clause in DELETE FROM tbl ON CLUSTER … WHERE … (distributed-DDL docs).

MySQL

REPLACE [INTO] statement (docs) — same shape as INSERT.

DuckDB

USING SAMPLE clause (samples docs) — tbl USING SAMPLE 10%, tbl USING SAMPLE SYSTEM (10 PERCENT) REPEATABLE (377). Triggered only when USING is followed by bare SAMPLE, so JOIN's USING (cols) is unaffected.

T-SQL / MSSQL

System-versioned temporal table column markers (docs) — GENERATED ALWAYS AS ROW {START|END} [HIDDEN], PERIOD FOR SYSTEM_TIME (start_col, end_col).

Cross-dialect

Adjacent string literal concatenation (ANSI SQL §5.3, BigQuery, Snowflake, Postgres) — 'foo' 'bar' → 'foobar'. Real customer SQL relies on this.
JSON_TABLE / XMLTABLE COLUMNS(...) clause (MySQL JSON_TABLE) — consumed opaquely (output column shapes carry no input refs).
WITH [NO] DATA [AND [NO] STATISTICS] on CREATE TABLE AS (Postgres docs, Teradata docs).
Reserved keyword followed by ) treated as column name, not trailing-comma terminator.
CORRESPONDING [BY (cols)] and STRICT set-op modifiers.
Teradata column-level attributes — FORMAT, TITLE, COMPRESS, INLINE LENGTH.

Corpus-side improvements (kernel-cll-corpus)

The anonymizer pipeline produced fragments that no parser could accept; fixes landed at the source rather than as fragile parser carve-outs:

Triple-quoted string handling, nested block comment depth tracking.
Tighter number regex (\d+(?:\.\d+)?) — was eating trailing . in proj.NNN.dataset.
Paren/bracket balance check.
New keywords: TARGET, JS, PYTHON, IDENTIFIER, DIMENSIONS, METRICS, FACTS, SEMANTIC_VIEW.
Drop IF cond THEN SQLs without END IF (truncated procedure-body fragments).
Drop query-log SQLs that end mid-clause — trailing ,, (, =, clause keywords (SELECT/FROM/BY/AS/…), or unclosed CASE. Removed ~4k Redshift truncations.
Drop 's'<word> anonymizer corruption — when the anonymizer's regex misaligns on a token boundary inside a string literal (e.g. INTERVAL '1 HOUR' → 's'HOUR), the resulting SQL is unparseable. Pattern unique to anonymizer output, no false positives on hand-written fixtures.
_INTERNAL_QUERY_MARKERS filter excludes warehouse-internal /* DS_SVC */ queries.

Tests

Each parser commit ships a unit test in the appropriate dialect file. Full suite passes (cargo nextest run --all-features). Latest CI: Corpus / Check / Test Suite all green.

Snowflake's IDENTIFIER literal lets a string stand in for any identifier (CREATE TABLE IDENTIFIER('db.schema.t'), FROM IDENTIFIER('mytable'), INSERT INTO IDENTIFIER('foo.bar')). At the start of parse_identifier, detect the IDENTIFIER(<string>) shape and consume the whole construct, returning the string content as a single quoted Ident. The dotted name inside the string is preserved verbatim — Snowflake itself splits at execution time, and downstream lineage consumers can do the same. Reference: https://docs.snowflake.com/en/sql-reference/identifier-literal Fixes 581 corpus test failures (Snowflake).

github-actions · 2026-05-05T20:45:28Z

Corpus Parsing Report

Total: 169400 passed, 2187 failed (98.7% pass rate)

✨ No changes in test results

By Dialect

Dialect	Passed	Failed	Total	Pass Rate	Delta
ansi	511	69	580	88.1%	+6
athena	37	1	38	97.4%	+8
bigquery	36619	172	36791	99.5%	+122
clickhouse	2488	109	2597	95.8%	+7
databricks	2800	214	3014	92.9%	+2
doris	22	18	40	55.0%	-
dremio	27	0	27	100.0%	-
duckdb	1111	45	1156	96.1%	+9
exasol	54	7	61	88.5%	-
fabric	6	0	6	100.0%	-
generic	17	38	55	30.9%	-
hive	35	10	45	77.8%	+1
materialize	6	14	20	30.0%	-
mssql	2301	482	2783	82.7%	-
mysql	148	37	185	80.0%	+2
oracle	1025	380	1405	73.0%	+12
postgres	1180	116	1296	91.0%	+5
presto	55	8	63	87.3%	-
redshift	34360	141	34501	99.6%	+25
singlestore	141	9	150	94.0%	-
snowflake	85482	143	85625	99.8%	+613
spark	90	20	110	81.8%	-
sqlite	51	16	67	76.1%	-
starrocks	29	4	33	87.9%	-
teradata	23	20	43	53.5%	+3
trino	617	80	697	88.5%	+5
tsql	165	34	199	82.9%	+6

…words `parse_interval_guard` previously rejected only `LIKE` / `IS` after INTERVAL, then fell through to a probing `parse_interval()` whose internal `parse_prefix` is permissive enough to treat any bare keyword as an identifier. So `INTERVAL BETWEEN 1 AND 2`, `PARTITION BY a, INTERVAL ORDER BY c`, `MAX(INTERVAL)` etc. all misconsumed the following keyword as the literal's "value" and broke the surrounding clause with a downstream error like "Expected ), found: id_5". Extend the guard's reject list to cover the keywords that can never plausibly start an interval literal value: binary operators (BETWEEN, AND, OR, XOR, IN, NOT, ILIKE), clause starters (ORDER, GROUP, HAVING, WHERE, LIMIT, OFFSET, QUALIFY, WINDOW, UNION, INTERSECT, EXCEPT), window-frame & sort tokens (ROWS, RANGE, GROUPS, ASC, DESC), and join conditions (ON, USING). Snowflake accepts INTERVAL as a column name; this fix only changes behaviour when INTERVAL appears in a position where the literal form is impossible. Fixes 1 corpus test failure (Snowflake) and unblocks downstream parsing of larger queries that hit the pattern.

ANSI SQL, BigQuery, Postgres, and Snowflake all concatenate adjacent string literals separated by whitespace into a single literal: SELECT 'foo' 'bar' -- 'foobar' SELECT * FROM t WHERE x IN ('a', 'b' 'c', 'd') -- 'bc' is one item SELECT TRIM('xyz' 'a') -- TRIM('xyza') Real customer SQL relies on this — typically as a forgotten comma in an IN list — and the queries still execute correctly because the warehouse implements concatenation. Previously the parser rejected the second and third forms above with "Expected ), found: '<next-string>'". Implementation: after consuming a `Token::SingleQuotedString` in `parse_value`, peek-and-consume any immediately-following single-quoted string tokens, appending their content. The output is a single `Value::SingleQuotedString` with the concatenated value. References: - BigQuery: https://cloud.google.com/bigquery/docs/reference/standard-sql/lexical#string_and_bytes_literals > "Adjacent string and bytes literals are concatenated." - Snowflake: https://docs.snowflake.com/en/sql-reference/data-types-text#string-constants - Postgres: https://www.postgresql.org/docs/current/sql-syntax-lexical.html - ANSI SQL:2008 §5.3 <character string literal> Updates `test_snowflake_trim` which previously asserted `TRIM('xyz' 'a')` errored — that was enforcing pre-ANSI behaviour. Fixes 128 corpus test failures.

DuckDB's `USING SAMPLE` attaches a row-sampling spec to a table: SELECT * FROM tbl USING SAMPLE 10% SELECT * FROM tbl USING SAMPLE 10 ROWS SELECT * FROM tbl USING SAMPLE SYSTEM (10 PERCENT) REPEATABLE (377) SELECT * FROM tbl USING SAMPLE RESERVOIR (50 ROWS) REPEATABLE (100) SELECT * FROM tbl USING SAMPLE BERNOULLI (5 PERCENT) Reference: https://duckdb.org/docs/sql/samples Previously the parser saw `USING` and treated it as the start of a JOIN's `USING (col, ...)` constraint — which then failed because `SAMPLE` isn't an opening paren. The clause carries no lineage content (no table/column refs inside), so consume it opaquely: optional method keyword, then the sample size (bare number+unit or parenthesised group), then optional REPEATABLE seed. Only triggers when USING is followed by `SAMPLE` (case-insensitive ident match), so JOIN's `USING (cols)` is unaffected. Gated on `DuckDbDialect | GenericDialect`. Fixes 10 corpus test failures (sqlglot DuckDB fixtures).

MySQL's REPLACE statement is INSERT-with-replace semantics: delete an existing row on primary-key conflict and insert the new row. Same shape as INSERT INTO, just a different leading verb. REPLACE INTO mytable SELECT id FROM other WHERE cnt > 100 REPLACE INTO t (a, b) VALUES (1, 2) Reference: https://dev.mysql.com/doc/refman/8.4/en/replace.html Dispatch the top-level REPLACE keyword to `parse_insert` when followed by INTO. The replace-vs-insert distinction is lost in the AST (both become Statement::Insert), which is acceptable for grammar coverage — table/column refs are preserved for downstream lineage. Gated on `MySqlDialect | GenericDialect`; ClickHouse's `REPLACE TABLE` shorthand for `CREATE OR REPLACE TABLE` is unchanged.

MySQL's JSON_TABLE and Oracle's XMLTABLE attach a `COLUMNS(<col_defs>)` clause to a path-string argument, defining the output row shape: JSON_TABLE(json, '$.path' COLUMNS(id INT PATH '$.id')) JSON_TABLE(j, '$[*]' COLUMNS(row_id FOR ORDINALITY, link VARCHAR(255) PATH '$.link')) Previously the function-arg parser saw the COLUMNS keyword after the path string and bailed with "Expected ), found: COLUMNS". In `parse_function_args`, after parsing an expression-style argument, peek for a `COLUMNS (` shape and consume the balanced paren block opaquely. The col_defs are output column shapes (types + JSON paths); they don't carry input table/column refs, so opaque consumption preserves all lineage information already in the argument expression. References: - MySQL: https://dev.mysql.com/doc/refman/8.4/en/json-table-functions.html - Oracle XMLTABLE: https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/XMLTABLE.html Fixes 16 corpus test failures (Oracle, sqlglot Oracle / MySQL, Postgres).

ClickHouse extends the JOIN grammar with two orthogonal modifiers: - `[INNER|LEFT|RIGHT] [ANY|ASOF|ALL] JOIN` selects the row-matching semantics. ANY picks one matching row, ASOF does temporal-nearest matching, ALL is the default cartesian behaviour. - `GLOBAL` prefixes any JOIN to mark it as a distributed-query join (sub-select runs on the initiator and the result ships to shards). Reference: https://clickhouse.com/docs/sql-reference/statements/select/join The modifier doesn't change the lineage shape — same table refs and join condition — so parse the keyword and discard rather than extending the AST. Both INNER and LEFT/RIGHT paths get the modifier slot; GLOBAL is consumed once at the top of the JOIN-loop iteration before any LEFT/RIGHT/INNER dispatch. Gated on `ClickHouseDialect | GenericDialect`. Fixes 5 corpus test failures (sqlglot ClickHouse fixtures).

T-SQL's [system-versioned temporal table syntax] uses two extensions the parser was rejecting: CREATE TABLE t ( <cols>, valid_from DATETIME2 GENERATED ALWAYS AS ROW START [HIDDEN] NOT NULL, valid_to DATETIME2 GENERATED ALWAYS AS ROW END [HIDDEN] NOT NULL, PERIOD FOR SYSTEM_TIME (valid_from, valid_to) ) 1. **`GENERATED ALWAYS AS ROW {START|END} [HIDDEN]`** column option. The existing `parse_optional_column_option_generated` only knew the IDENTITY and `AS (expr) [STORED]` forms. After consuming `AS`, peek for `ROW` or `TRANSACTION_ID` (case-insensitive) and consume the optional `START`/`END`/`HIDDEN` tokens, surfacing the marker as a `DialectSpecific` column option so the column ref + type stay in the AST for lineage. 2. **Table-level `PERIOD FOR SYSTEM_TIME (start_col, end_col)`** clause. Both columns are already in the table's column list — the clause pairs them but adds no new lineage. Consume the tokens at the top of the column-list loop and discard. The `WITH(SYSTEM_VERSIONING=ON [(HISTORY_TABLE=…, DATA_CONSISTENCY_CHECK=…)])` table-option suffix was already handled by the existing WITH-options parser; nothing extra needed there. Both extensions gated on `MsSqlDialect | GenericDialect`. [system-versioned temporal table syntax]: https://learn.microsoft.com/en-us/sql/relational-databases/tables/creating-a-system-versioned-temporal-table Fixes 6 corpus test failures (sqlglot T-SQL fixtures).

ClickHouse routes DDL/DML to all shards via an `ON CLUSTER <name>` clause: DELETE FROM tbl ON CLUSTER test_cluster WHERE date = '2019-01-01' DELETE FROM tbl ON CLUSTER '{cluster}' WHERE date = '2019-01-01' Reference: https://clickhouse.com/docs/sql-reference/distributed-ddl After parsing the FROM table list in `parse_delete`, peek-and-consume the optional ON CLUSTER clause before WHERE / USING / RETURNING. The cluster name doesn't add lineage info; reuse the existing `parse_optional_on_cluster` helper and discard the result. Gated on `ClickHouseDialect | GenericDialect`. Fixes 2 corpus test failures (sqlglot ClickHouse fixtures).

Postgres / ANSI / Teradata `CREATE TABLE name AS <query>` accepts a trailing clause that controls whether the new table is populated with the query's results and whether statistics are collected: CREATE TABLE t AS SELECT … WITH DATA CREATE TABLE t AS SELECT … WITH NO DATA CREATE TABLE t AS SELECT … WITH DATA AND STATISTICS CREATE TABLE t AS SELECT … WITH NO DATA AND NO STATISTICS References: - Postgres: https://www.postgresql.org/docs/current/sql-createtableas.html - Teradata: https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/SQL-Data-Definition-Language-Syntax-and-Examples/Table-Statements/CREATE-TABLE-AS/AS-Subquery-Clause After parsing the AS-query body in `parse_create_table_inner`, peek for `WITH` and consume the optional `[NO] DATA [AND [NO] STATISTICS]` tail. Neither lineage nor table-shape info lives in the clause; consume and discard. If the WITH wasn't followed by `[NO] DATA` (e.g. T-SQL's `WITH (option=…)` table-options), restore the index so the existing parser path handles it. Fixes 5 corpus test failures (sqlglot ANSI + Trino).

BigQuery [path expressions] allow the last segment to start with a digit: SELECT * FROM foo.bar.25ab c SELECT * FROM foo.bar.25 SELECT * FROM foo.bar.25_ The tokenizer greedily folds a leading `.` into the next number, so `bar.25ab` tokenises as `Word("bar")` then `Number(".25")` then `Word("ab")`. Previously `parse_object_name` saw the Number token where it expected a Period, broke out of the path loop, and the parser then errored on the dangling `.25`. In `parse_object_name`, after each ident, peek for a leading-dot Number. If found, peel the `.` off, treat the remaining digits as the next segment's prefix, and concatenate any adjacent Word for segments like `25ab`. Index-only advance — never mutate `self.tokens` (which would persist across speculative `maybe_parse` calls). Numeric literals (`SELECT 1.5`, `WHERE x = 0.5`) and SELECT-projection JSON paths (`field.5k_clients_target`, handled by `parse_snowflake_json_path`) are unaffected because both go through different parser paths. Gated on `BigQueryDialect | GenericDialect`. [path expressions]: https://cloud.google.com/bigquery/docs/reference/standard-sql/lexical#path_expressions Fixes 4 corpus test failures (sqlglot BigQuery).

…S, INLINE LENGTH) Teradata's column-attribute grammar adds four post-type modifiers that the parser currently rejects: CREATE TABLE foo ( valid_date DATE FORMAT 'YYYY-MM-DD', name VARCHAR(50) TITLE 'Customer Name', code INT COMPRESS, body VARCHAR(255) COMPRESS ('a', 'b'), notes VARCHAR(80) INLINE LENGTH 64 ) Reference: https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/SQL-Data-Definition-Language-Detailed-Topics/CREATE-TABLE/Column-Level-Attributes-for-Database-Object-Creation The corpus runs sqlglot_teradata fixtures through `GenericDialect` (there is no dedicated TeradataDialect), so gate on `GenericDialect | AnsiDialect`. Surface each as a `ColumnOption::DialectSpecific` with the keyword name; lineage info is preserved by the column's name and type, the modifiers carry no input refs. `FORMAT` is a real keyword; `TITLE`, `COMPRESS`, `INLINE`, and `LENGTH` aren't (per the project rule "Match non-keyword words case-insensitively, don't add to keywords.rs"), so detect them via case-insensitive Word match. `COMPRESS (...)` consumes its optional value list with a balanced-paren skip — the values are constants, no lineage content. Fixes 5 corpus test failures (sqlglot Teradata + ANSI fixtures).

Snowflake's [zero-copy clone] for schemas: CREATE SCHEMA mytestschema_clone CLONE testschema CREATE SCHEMA restored_schema CLONE my_schema AT (OFFSET => -3600) CREATE SCHEMA s_restore CLONE testschema BEFORE (TIMESTAMP => …) In `parse_create_schema`, after the schema name, peek for `CLONE` and consume `<source>` and an optional `AT|BEFORE (…)` time-travel suffix. The current `Statement::CreateSchema` AST has no `clone` slot, so the clause is consumed and discarded for parser-coverage; revisit when schema-level provenance lineage is needed and add a field then. Gated on `SnowflakeDialect | GenericDialect`. [zero-copy clone]: https://docs.snowflake.com/en/sql-reference/sql/create-clone Fixes 7 corpus test failures (snowflake first-party + sqlglot snowflake).

ANSI SQL and BigQuery extend UNION/INTERSECT/EXCEPT with two suffixes the parser currently rejects: -- match legs by column name instead of position SELECT 1 AS x UNION ALL CORRESPONDING SELECT 2 AS x SELECT 1 AS x UNION ALL CORRESPONDING BY (foo, bar) SELECT 2 AS x -- type-strict union (no implicit coercion) SELECT 1 UNION ALL STRICT SELECT 2 Reference: https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#set_operators In `parse_set_quantifier`, after the existing ALL/DISTINCT/BY NAME parsing, peek for `CORRESPONDING [BY (col, …)]` and consume the balanced paren block opaquely — the column list contains plain names that already appear in the SELECT legs, so opaque consumption preserves all lineage info. Same treatment for `STRICT`. These suffixes don't change which tables/columns the union references, so adding new `SetQuantifier` variants isn't necessary for grammar coverage. Fixes 7 corpus test failures (sqlglot BigQuery + Trino).

Snowflake's [DATE_PART] supports two argument forms — the standard function-call shape `DATE_PART(<part>, <expr>)` (already worked) and the ANSI EXTRACT-style `DATE_PART(<part> FROM <expr>)`. The previous parser path treated DATE_PART as a generic function call and rejected `FROM` between the args. Add a special-case at the top of `parse_prefix` for non-keyword Word "DATE_PART" (case-insensitive) followed by `(`, parsing the part, then either a comma or `FROM` separator, then the expression. Result is `Expr::Function` so downstream consumers (lineage visitors) see the same shape as any other function call — same args slot, same column refs preserved. Gated on `SnowflakeDialect | GenericDialect`. [DATE_PART]: https://docs.snowflake.com/en/sql-reference/functions/date_part Fixes 3 corpus test failures (sqlglot Snowflake).

Snowflake's [CREATE EXTERNAL TABLE] places a `PARTITION BY (col, col, …)` clause between the column-def list and the option block: CREATE EXTERNAL TABLE et (col1 DATE AS (...), col2 VARCHAR AS (...)) PARTITION BY (col1, col2) LOCATION=@stage/path/ FILE_FORMAT=(type=parquet) Previously the parser entered the option-swallowing loop, which expected the first option to be `name=value` (`LOCATION=`, etc.). `PARTITION BY (...)` didn't match that shape, so parsing fell through to the Hive-style external-table path and errored. Add a PARTITION-BY-list consumer immediately after the column list and before the option block. The partition column names are already in the column-def list, so opaque consumption preserves all lineage info. Gated on `SnowflakeDialect | GenericDialect`. [CREATE EXTERNAL TABLE]: https://docs.snowflake.com/en/sql-reference/sql/create-external-table Fixes 3 corpus test failures (snowflake first-party + sqlglot snowflake).

Snowflake's [CREATE SEQUENCE] accepts a `COMMENT = '<string>'` option alongside START / INCREMENT / ORDER: CREATE SEQUENCE seq START=5 COMMENT='foo' INCREMENT=10 The existing sequence-options loop didn't recognise COMMENT and broke out at the `comment` keyword, leaving trailing tokens that the outer parser then errored on. Add a COMMENT arm to `parse_create_sequence_options` that consumes the optional `=` and a literal string. The comment carries no lineage content; discard the value. [CREATE SEQUENCE]: https://docs.snowflake.com/en/sql-reference/sql/create-sequence Fixes 2 corpus test failures (sqlglot Snowflake).

Redshift's CREATE TABLE permits DISTSTYLE, DISTKEY, and SORTKEY (with optional COMPOUND prefix) to appear in any order after the column definitions: CREATE TABLE sales (...) DISTKEY(listid) COMPOUND SORTKEY(...) DISTSTYLE AUTO CREATE TABLE t (...) SORTKEY(a) DISTKEY(a) Reference: https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_TABLE_NEW.html The previous parser parsed them in a fixed sequence (DISTSTYLE → DISTKEY → SORTKEY), so any other ordering errored on the first out-of-order clause. Wrap the three lookups in a loop that consumes whichever keyword appears next; each option is still admitted at most once. Per the loop guidance, this is a clause-permutation fix and doesn't require new grammar — the individual clauses are unchanged. Fixes 3 corpus test failures (sqlglot_redshift + unparsed_redshift).

Athena's DDL (CREATE EXTERNAL TABLE with ROW FORMAT SERDE, SERDEPROPERTIES, STORED AS, etc.) is Hive-style, not Trino-style. Switch the alias from `trino` → `hive` so the corpus runner uses HiveDialect for `sqlglot_athena/` and any future `customer_athena/` fixtures. Athena's DML/queries are Trino-style, but the failing fixtures in the corpus are exclusively DDL where the mapping matters. Fixes 3 corpus test failures (sqlglot Athena).

Hive's `ROW FORMAT` accepts two extensions the parser currently rejects: - `ROW FORMAT SERDE 'class' WITH SERDEPROPERTIES ('k'='v', …)` — serde configuration after the class name. SERDEPROPERTIES isn't in the keyword table; matched case-insensitively. - `ROW FORMAT DELIMITED [FIELDS TERMINATED BY 'x'] [LINES TERMINATED BY 'y'] [NULL DEFINED AS 'z']` — DELIMITED suboptions describing ASCII-text storage. Reference: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-RowFormats&SerDe `parse_row_format` now consumes: 1. After SERDE 'class', an optional `WITH SERDEPROPERTIES (...)` block (balanced-paren skip — the k/v strings carry no lineage info). 2. After DELIMITED, any sub-clauses up to the next ROW / STORED / LOCATION / WITH / COMMENT / TBLPROPERTIES / PARTITIONED / CLUSTERED / AS keyword (also EOF / `;`). If `WITH` isn't followed by SERDEPROPERTIES, restore the index so later table-options parsers (CTEs, WITH(option=…), etc.) can take it. Used through `parse_hive_formats`, which is called for any dialect's CREATE TABLE / CREATE EXTERNAL TABLE that allows Hive-style storage options (Hive, Databricks, Athena via Hive routing). Fixes 5 corpus test failures (sqlglot Athena, Databricks, sqlglot Hive).

Athena Iceberg tables and Trino use a different shape for PARTITIONED BY than classic Hive: -- Hive: column-def list (each segment has a type) CREATE TABLE t (a INT) PARTITIONED BY (year INT) -- Iceberg: expression list (column refs + transform functions) CREATE TABLE t (id BIGINT, category STRING) PARTITIONED BY (category, BUCKET(16, id), TRUNCATE(8, id), DAY(ts)) Reference: https://docs.aws.amazon.com/athena/latest/ug/querying-iceberg-creating-tables.html `parse_hive_distribution` now distinguishes the two by peeking past the first identifier inside `(...)`: if a known data-type keyword (INT, STRING, BIGINT, …) follows, it's the column-def form; otherwise we parse the contents as a comma-separated expression list. The expression form's lineage info (column refs inside transforms like `BUCKET(16, id)`) is preserved by the standard expression parser. Used through `parse_hive_formats`, which is reached for any dialect's CREATE TABLE that allows Hive-style storage options (Hive, Databricks, Athena via Hive routing). Fixes 2 corpus test failures (sqlglot Athena Iceberg).

Hive's CREATE [EXTERNAL] TABLE grammar allows two optional table-level clauses the parser was rejecting: CREATE EXTERNAL TABLE foo (id INT) COMMENT 'description' CREATE EXTERNAL TABLE foo (id INT, val STRING) CLUSTERED BY (id, val) INTO 10 BUCKETS CREATE EXTERNAL TABLE foo (id INT) COMMENT 'c' PARTITIONED BY (a INT) CLUSTERED BY (id) SORTED BY (id ASC) INTO 5 BUCKETS Reference: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL Add `parse_optional_hive_comment_and_clustered_by`, called both before and after PARTITIONED BY in the external-table branch (the two clauses can appear in either order around it). Both are consumed and discarded: COMMENT carries no lineage info, and CLUSTERED BY references columns already in the table's column-def list. SORTED BY (cols) is consumed opaquely, INTO <n> [BUCKETS] is also opaque. CLUSTERED, SORTED, and BUCKETS aren't keywords in our table; matched case-insensitively per the project rule. Fixes 2 corpus test failures (sqlglot Athena).

Extends the previous CORRESPONDING / STRICT support to also accept: - The two suffixes in any order: `UNION ALL STRICT CORRESPONDING` alongside the previously-handled `UNION ALL CORRESPONDING STRICT`. - `BY NAME ON (col, …)` — BigQuery's column-restricted by-name match: SELECT 1 AS x UNION ALL BY NAME ON (foo, bar) SELECT 2 AS x SELECT 1 AS x INNER UNION ALL BY NAME ON (foo, bar) SELECT 2 AS x Reference: https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#set_operators `parse_set_quantifier` now loops through CORRESPONDING / STRICT / ON suffixes, consuming each at most once and stopping as soon as it sees something that isn't one of the three. All three are opaque to lineage (column names inside the paren lists already appear in the SELECT legs). Fixes 4 corpus test failures (sqlglot BigQuery).

Redshift inherits the Oracle/Snowflake `expr (+)` legacy outer-join syntax in the WHERE clause of comma-join queries: SELECT a.foo, b.bar FROM a, b WHERE a.baz = b.baz (+) SELECT * FROM a, b WHERE a.id (+) = b.id The parser already handled this for Snowflake/Generic. Extend the two `dialect_of!` gates (in `parse_subexpr` and the `Token::LParen | Token::Period` arm of `parse_prefix`) to also include RedshiftSqlDialect. Fixes 1 corpus test failure (sqlglot Redshift).

Snowflake's [CREATE STAGE] accepts the option clauses (DIRECTORY, FILE_FORMAT, COPY_OPTIONS, COMMENT, plus URL / CREDENTIALS / STORAGE_INTEGRATION / ENDPOINT / ENCRYPTION which would normally be read by `parse_stage_params`) in any order, and the FILE_FORMAT value has three shapes: FILE_FORMAT = (TYPE='JSON' …) -- inline parenthesised options FILE_FORMAT = '<format_name>' -- string shorthand FILE_FORMAT = [<schema>.]<format> -- dotted-ident shorthand Reference: https://docs.snowflake.com/en/sql-reference/sql/create-stage Two changes in `src/dialect/snowflake.rs`: 1. `parse_create_stage` wraps the option-clause section in a loop so any of DIRECTORY / FILE_FORMAT / COPY_OPTIONS / COMMENT plus URL/CREDENTIALS/STORAGE_INTEGRATION/ENDPOINT/ENCRYPTION can appear after FILE_FORMAT (previously they had to come first via `parse_stage_params`). Stage-params seen mid-stream are merged into the `stage_params` accumulator. 2. `parse_parentheses_options` accepts dotted ident values (`FORMAT_NAME=schema.format`) by consuming `.<word>` continuations. 3. FILE_FORMAT='string' and FILE_FORMAT=<ident> shorthand both surface as `DataLoadingOption { name: "FORMAT_NAME", … }` for AST symmetry with the parenthesised form's own `FORMAT_NAME` option. Fixes 7 corpus test failures (snowflake first-party + sqlglot snowflake).

Snowflake (and Postgres) allow a dollar-quoted string body in the column-level `COMMENT` clause: CREATE TABLE foo (ID INT COMMENT \$\$some comment\$\$) Previously the parser only accepted single-quoted strings here and errored with "Expected string, found: \$\$…\$\$". Add a `Token::DollarQuotedString` arm to `parse_optional_column_option`'s COMMENT branch, surfacing the inner content as `ColumnOption::Comment` just like a single-quoted form. Fixes 1 corpus test failure (sqlglot Snowflake).

BigQuery and MSSQL `FOR SYSTEM_TIME AS OF <expr>` time-travel reads typically appear after the table's optional alias: FROM tbl t FOR SYSTEM_TIME AS OF CURRENT_TIMESTAMP() … FROM tbl AS t FOR SYSTEM_TIME AS OF CURRENT_TIMESTAMP() LEFT JOIN … Reference: https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#for_system_time_as_of `parse_table_factor` previously called `parse_table_version` only before the alias, so an aliased shape fell through to the query-level FOR-UPDATE/FOR-SHARE locks loop and errored with "Expected one of UPDATE or SHARE, found: SYSTEM_TIME". Call `parse_table_version` a second time after the alias when no version qualifier was set yet. Fixes 5 corpus test failures (unparsed BigQuery).

…g comma In `is_parse_comma_separated_end`, when a comma is followed by a reserved-as-alias keyword (CLUSTER, SORT, FINAL, etc.), the parser peeks the next-next token to decide whether the keyword is a clause starter (end of list) or a column name reusing the keyword. The previous fall-through `_ => true` returned "end of list" when peek_nth(1) was anything not specifically whitelisted — including `)`. So `EXCEPT(id_2, CLUSTER)` looked like a trailing comma plus out-of-context CLUSTER and the loop stopped, leaving CLUSTER unconsumed and the parser then erroring with "Expected ), found: CLUSTER". Add `Token::RParen => false` before the catch-all: a reserved keyword inside a parenthesised list with `)` after it is unambiguously a column name, not a trailing-comma terminator. Trailing-comma support in projection lists (`SELECT a, b, FROM t`) is unaffected — those have FROM/clause-starter after the keyword, hitting the dedicated clause-only check. Fixes 6 corpus test failures (customer & unparsed BigQuery).

Snowflake's [CREATE TABLE] allows column-level foreign-key constraints with an explicit `FOREIGN KEY` prefix: <col> <type> [NOT NULL] FOREIGN KEY REFERENCES <ref_table> [(<ref_col>)] Previously the parser only knew the shorter ANSI/Postgres form `<col> <type> REFERENCES <ref_table>(...)` and bailed at FOREIGN inside a column definition. In `parse_optional_column_option`, accept either `FOREIGN KEY REFERENCES` or bare `REFERENCES` before the same shared foreign-table / column-list / ON DELETE/UPDATE / characteristics parsing. [CREATE TABLE]: https://docs.snowflake.com/en/sql-reference/sql/create-table Fixes 1 corpus test failure (sqlglot Snowflake).

BigQuery's CREATE FUNCTION accepts an optional `DETERMINISTIC` or `NOT DETERMINISTIC` marker between RETURNS and LANGUAGE: CREATE TEMPORARY FUNCTION f(x FLOAT64) RETURNS FLOAT64 NOT DETERMINISTIC LANGUAGE js AS 'return Math.random() * x;' Reference: https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language#create_a_function Add a `parse_keywords([NOT, DETERMINISTIC])` / `parse_keyword(DETERMINISTIC)` arm to `parse_create_function_body`'s loop. The marker doesn't change lineage; consume and discard rather than extending `CreateFunctionBody`. Fixes 1 corpus test failure (sqlglot BigQuery).

…thand Redshift's CREATE TABLE accepts a two-positional-argument shorthand for IDENTITY columns: CREATE TABLE t (c BIGINT GENERATED BY DEFAULT AS IDENTITY (0, 1)) CREATE TABLE t (c BIGINT GENERATED ALWAYS AS IDENTITY (100, 5)) Reference: https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_TABLE_NEW.html Previously the parser only handled the Postgres-style keyword form `IDENTITY (START WITH n INCREMENT BY n …)` via `parse_create_sequence_options`, which expected each option to begin with a keyword and broke on the bare numeric pair. Add `parse_identity_paren_options` that peeks the first token after `(`: - bare Number / sign — parse `(seed, step)` and surface as `[StartWith(seed, false), IncrementBy(step, false)]` so the existing AST doesn't need new variants. - otherwise — fall back to `parse_create_sequence_options` for the keyword form. Both `GENERATED ALWAYS AS IDENTITY` and `GENERATED BY DEFAULT AS IDENTITY` column-option arms now route through the helper. Fixes 1 corpus test failure (sqlglot Redshift).

Snowflake's TABLE(<expr>) table-function form accepts the same suffix keywords (TABLESAMPLE, PIVOT, UNPIVOT, MATCH_RECOGNIZE) as a regular table reference: SELECT * FROM TABLE('t1') TABLESAMPLE BERNOULLI (20.3) Previously the TABLE(...) branch in `parse_table_factor` returned the TableFactor::TableFunction directly without running the suffix-keyword loop, so the TABLESAMPLE token was left for the outer parser, which errored. Wrap the post-TABLE() return in the same suffix-keyword loop used for plain table refs (PIVOT / UNPIVOT / TABLESAMPLE / SAMPLE / MATCH_RECOGNIZE) so each can apply. Fixes 1 corpus test failure (sqlglot Snowflake).

…IER(…) Extend the previous Snowflake `IDENTIFIER('<name>')` literal support to also accept session variables and bind parameters: CREATE TABLE IDENTIFIER(\$foo) (col1 VARCHAR, col2 VARCHAR) SELECT * FROM IDENTIFIER(\$tbl_name) SELECT * FROM IDENTIFIER(?) Reference: https://docs.snowflake.com/en/sql-reference/identifier-literal `Token::Placeholder` (which covers both `\$foo` and `?` in our tokeniser) is added to the inner-value match alongside the existing SingleQuotedString/DoubleQuotedString arms. Placeholder values surface as a plain (unquoted) Ident — they have no compile-time name and execution would resolve them at run time anyway, so the synthetic ident just keeps parsing going. Fixes 1 corpus test failure (sqlglot Snowflake).

BigQuery's [legacy SQL] uses square brackets to quote project-qualified table identifiers, with `:` separating the project from the dataset/table: SELECT * FROM [my-proj-123:dataset.table] SELECT * FROM [proj:ds.tbl] AS t Standard SQL replaces the brackets with backticks (`proj.ds.tbl`), but customers still submit legacy-SQL queries through the wire — they appear in `unparsed_bigquery` query logs. In `parse_table_factor`, when we see `[` at the start of a table reference (BigQuery / Generic), consume the balanced bracket block (words / numbers / dots / colons / hyphens / `*` for wildcard tables) and surface the inner string as a single backtick-quoted Ident. Lineage tracking sees the table reference normally; the `project:dataset.table` text is preserved verbatim in the ident value. If the bracket block contains anything else (operators, parens, etc.), restore the index so other callers — e.g. ARRAY[...] literals — can take over. [legacy SQL]: https://cloud.google.com/bigquery/docs/reference/legacy-sql Fixes 5 corpus test failures (unparsed BigQuery, customer BigQuery).

BigQuery accepts both `'…'` and `"…"` as string-literal forms. The parser's `AT TIME ZONE` arm only recognised the single-quoted form, so a query like EXTRACT(HOUR FROM ts AT TIME ZONE "Asia/Tokyo") errored at the time-zone argument with "Expected Token::SingleQuotedString after AT TIME ZONE". Add a `Token::DoubleQuotedString` arm gated on `BigQueryDialect | GenericDialect`. The single-quoted path is unchanged; non-BigQuery dialects (Postgres, etc.) still require the single-quoted form per ANSI. Fixes 6 corpus test failures (4 unparsed_bigquery + 2 customer_bigquery).

- compare-corpus-reports.js only shows additions; check git status for prunes - Pipeline reprocess takes ~10min; use Monitor with pgrep loop - Anonymizer corruption signature is exactly 's'<word>; broader regex deletes hand-written sqlglot fixtures - Query-log truncation heuristics that worked (trailing punct/keyword, CASE>END count) - 'cmd &' with run_in_background returns completed immediately — verify with pgrep

lustefaniak changed the title ~~fix(snowflake): parse IDENTIFIER('<name>') literal~~ snowflake: identifier literal + INTERVAL-as-column carve-outs May 5, 2026

lustefaniak changed the title ~~snowflake: identifier literal + INTERVAL-as-column carve-outs~~ snowflake: identifier literal, INTERVAL carve-outs, adjacent-string concat May 5, 2026

lustefaniak added 4 commits May 6, 2026 00:16

lustefaniak changed the title ~~snowflake: identifier literal, INTERVAL carve-outs, adjacent-string concat~~ Parser fixes from corpus loop: 7 commits across Snowflake/DuckDB/MySQL/ClickHouse/ANSI May 5, 2026

lustefaniak added 4 commits May 6, 2026 01:16

lustefaniak changed the title ~~Parser fixes from corpus loop: 7 commits across Snowflake/DuckDB/MySQL/ClickHouse/ANSI~~ Parser fixes from corpus loop: 11 commits across 7 dialects May 5, 2026

lustefaniak added 14 commits May 6, 2026 01:57

lustefaniak added 10 commits May 6, 2026 02:57

lustefaniak changed the title ~~Parser fixes from corpus loop: 11 commits across 7 dialects~~ Parser fixes from corpus loop: Snowflake / BigQuery / Redshift / Athena and friends May 6, 2026

lustefaniak changed the title ~~Parser fixes from corpus loop: Snowflake / BigQuery / Redshift / Athena and friends~~ Parser fixes from corpus loop: Snowflake / BigQuery / Redshift / Athena / ClickHouse May 6, 2026

lustefaniak requested review from LiHRaM, grasskode, iamjasinski and zdenal May 6, 2026 08:55

jakubjasinsky approved these changes May 6, 2026

View reviewed changes

zdenal approved these changes May 6, 2026

View reviewed changes

lustefaniak merged commit b93e33d into main May 6, 2026
5 checks passed

lustefaniak deleted the lukasz-corpus-fixes-new-customer-sql branch May 6, 2026 09:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parser fixes from corpus loop: Snowflake / BigQuery / Redshift / Athena / ClickHouse#81

Parser fixes from corpus loop: Snowflake / BigQuery / Redshift / Athena / ClickHouse#81
lustefaniak merged 36 commits intomainfrom
lukasz-corpus-fixes-new-customer-sql

lustefaniak commented May 5, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 5, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

Conversation

lustefaniak commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Snowflake

BigQuery

Redshift

Hive / Athena

ClickHouse

MySQL

DuckDB

T-SQL / MSSQL

Cross-dialect

Corpus-side improvements (kernel-cll-corpus)

Tests

Uh oh!

github-actions Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Corpus Parsing Report

By Dialect

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

lustefaniak commented May 5, 2026 •

edited

Loading

github-actions Bot commented May 5, 2026 •

edited

Loading