Parser fixes from corpus loop: Snowflake / BigQuery / Redshift / Athena / ClickHouse#81
Merged
lustefaniak merged 36 commits intomainfrom May 6, 2026
Merged
Conversation
Snowflake's IDENTIFIER literal lets a string stand in for any identifier
(CREATE TABLE IDENTIFIER('db.schema.t'), FROM IDENTIFIER('mytable'),
INSERT INTO IDENTIFIER('foo.bar')). At the start of parse_identifier,
detect the IDENTIFIER(<string>) shape and consume the whole construct,
returning the string content as a single quoted Ident. The dotted name
inside the string is preserved verbatim — Snowflake itself splits at
execution time, and downstream lineage consumers can do the same.
Reference: https://docs.snowflake.com/en/sql-reference/identifier-literal
Fixes 581 corpus test failures (Snowflake).
Corpus Parsing ReportTotal: 169400 passed, 2187 failed (98.7% pass rate) ✨ No changes in test results By Dialect
|
…words `parse_interval_guard` previously rejected only `LIKE` / `IS` after INTERVAL, then fell through to a probing `parse_interval()` whose internal `parse_prefix` is permissive enough to treat any bare keyword as an identifier. So `INTERVAL BETWEEN 1 AND 2`, `PARTITION BY a, INTERVAL ORDER BY c`, `MAX(INTERVAL)` etc. all misconsumed the following keyword as the literal's "value" and broke the surrounding clause with a downstream error like "Expected ), found: id_5". Extend the guard's reject list to cover the keywords that can never plausibly start an interval literal value: binary operators (BETWEEN, AND, OR, XOR, IN, NOT, ILIKE), clause starters (ORDER, GROUP, HAVING, WHERE, LIMIT, OFFSET, QUALIFY, WINDOW, UNION, INTERSECT, EXCEPT), window-frame & sort tokens (ROWS, RANGE, GROUPS, ASC, DESC), and join conditions (ON, USING). Snowflake accepts INTERVAL as a column name; this fix only changes behaviour when INTERVAL appears in a position where the literal form is impossible. Fixes 1 corpus test failure (Snowflake) and unblocks downstream parsing of larger queries that hit the pattern.
ANSI SQL, BigQuery, Postgres, and Snowflake all concatenate adjacent
string literals separated by whitespace into a single literal:
SELECT 'foo' 'bar' -- 'foobar'
SELECT * FROM t WHERE x IN ('a', 'b' 'c', 'd') -- 'bc' is one item
SELECT TRIM('xyz' 'a') -- TRIM('xyza')
Real customer SQL relies on this — typically as a forgotten comma in an
IN list — and the queries still execute correctly because the warehouse
implements concatenation. Previously the parser rejected the second and
third forms above with "Expected ), found: '<next-string>'".
Implementation: after consuming a `Token::SingleQuotedString` in
`parse_value`, peek-and-consume any immediately-following single-quoted
string tokens, appending their content. The output is a single
`Value::SingleQuotedString` with the concatenated value.
References:
- BigQuery: https://cloud.google.com/bigquery/docs/reference/standard-sql/lexical#string_and_bytes_literals
> "Adjacent string and bytes literals are concatenated."
- Snowflake: https://docs.snowflake.com/en/sql-reference/data-types-text#string-constants
- Postgres: https://www.postgresql.org/docs/current/sql-syntax-lexical.html
- ANSI SQL:2008 §5.3 <character string literal>
Updates `test_snowflake_trim` which previously asserted
`TRIM('xyz' 'a')` errored — that was enforcing pre-ANSI behaviour.
Fixes 128 corpus test failures.
DuckDB's `USING SAMPLE` attaches a row-sampling spec to a table: SELECT * FROM tbl USING SAMPLE 10% SELECT * FROM tbl USING SAMPLE 10 ROWS SELECT * FROM tbl USING SAMPLE SYSTEM (10 PERCENT) REPEATABLE (377) SELECT * FROM tbl USING SAMPLE RESERVOIR (50 ROWS) REPEATABLE (100) SELECT * FROM tbl USING SAMPLE BERNOULLI (5 PERCENT) Reference: https://duckdb.org/docs/sql/samples Previously the parser saw `USING` and treated it as the start of a JOIN's `USING (col, ...)` constraint — which then failed because `SAMPLE` isn't an opening paren. The clause carries no lineage content (no table/column refs inside), so consume it opaquely: optional method keyword, then the sample size (bare number+unit or parenthesised group), then optional REPEATABLE seed. Only triggers when USING is followed by `SAMPLE` (case-insensitive ident match), so JOIN's `USING (cols)` is unaffected. Gated on `DuckDbDialect | GenericDialect`. Fixes 10 corpus test failures (sqlglot DuckDB fixtures).
MySQL's REPLACE statement is INSERT-with-replace semantics: delete an existing row on primary-key conflict and insert the new row. Same shape as INSERT INTO, just a different leading verb. REPLACE INTO mytable SELECT id FROM other WHERE cnt > 100 REPLACE INTO t (a, b) VALUES (1, 2) Reference: https://dev.mysql.com/doc/refman/8.4/en/replace.html Dispatch the top-level REPLACE keyword to `parse_insert` when followed by INTO. The replace-vs-insert distinction is lost in the AST (both become Statement::Insert), which is acceptable for grammar coverage — table/column refs are preserved for downstream lineage. Gated on `MySqlDialect | GenericDialect`; ClickHouse's `REPLACE TABLE` shorthand for `CREATE OR REPLACE TABLE` is unchanged.
MySQL's JSON_TABLE and Oracle's XMLTABLE attach a `COLUMNS(<col_defs>)`
clause to a path-string argument, defining the output row shape:
JSON_TABLE(json, '$.path' COLUMNS(id INT PATH '$.id'))
JSON_TABLE(j, '$[*]' COLUMNS(row_id FOR ORDINALITY,
link VARCHAR(255) PATH '$.link'))
Previously the function-arg parser saw the COLUMNS keyword after the
path string and bailed with "Expected ), found: COLUMNS".
In `parse_function_args`, after parsing an expression-style argument,
peek for a `COLUMNS (` shape and consume the balanced paren block
opaquely. The col_defs are output column shapes (types + JSON paths);
they don't carry input table/column refs, so opaque consumption
preserves all lineage information already in the argument expression.
References:
- MySQL: https://dev.mysql.com/doc/refman/8.4/en/json-table-functions.html
- Oracle XMLTABLE: https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/XMLTABLE.html
Fixes 16 corpus test failures (Oracle, sqlglot Oracle / MySQL, Postgres).
ClickHouse extends the JOIN grammar with two orthogonal modifiers: - `[INNER|LEFT|RIGHT] [ANY|ASOF|ALL] JOIN` selects the row-matching semantics. ANY picks one matching row, ASOF does temporal-nearest matching, ALL is the default cartesian behaviour. - `GLOBAL` prefixes any JOIN to mark it as a distributed-query join (sub-select runs on the initiator and the result ships to shards). Reference: https://clickhouse.com/docs/sql-reference/statements/select/join The modifier doesn't change the lineage shape — same table refs and join condition — so parse the keyword and discard rather than extending the AST. Both INNER and LEFT/RIGHT paths get the modifier slot; GLOBAL is consumed once at the top of the JOIN-loop iteration before any LEFT/RIGHT/INNER dispatch. Gated on `ClickHouseDialect | GenericDialect`. Fixes 5 corpus test failures (sqlglot ClickHouse fixtures).
T-SQL's [system-versioned temporal table syntax] uses two extensions
the parser was rejecting:
CREATE TABLE t (
<cols>,
valid_from DATETIME2 GENERATED ALWAYS AS ROW START [HIDDEN] NOT NULL,
valid_to DATETIME2 GENERATED ALWAYS AS ROW END [HIDDEN] NOT NULL,
PERIOD FOR SYSTEM_TIME (valid_from, valid_to)
)
1. **`GENERATED ALWAYS AS ROW {START|END} [HIDDEN]`** column option.
The existing `parse_optional_column_option_generated` only knew the
IDENTITY and `AS (expr) [STORED]` forms. After consuming `AS`, peek
for `ROW` or `TRANSACTION_ID` (case-insensitive) and consume the
optional `START`/`END`/`HIDDEN` tokens, surfacing the marker as a
`DialectSpecific` column option so the column ref + type stay in
the AST for lineage.
2. **Table-level `PERIOD FOR SYSTEM_TIME (start_col, end_col)`** clause.
Both columns are already in the table's column list — the clause
pairs them but adds no new lineage. Consume the tokens at the top
of the column-list loop and discard.
The `WITH(SYSTEM_VERSIONING=ON [(HISTORY_TABLE=…, DATA_CONSISTENCY_CHECK=…)])`
table-option suffix was already handled by the existing WITH-options
parser; nothing extra needed there.
Both extensions gated on `MsSqlDialect | GenericDialect`.
[system-versioned temporal table syntax]: https://learn.microsoft.com/en-us/sql/relational-databases/tables/creating-a-system-versioned-temporal-table
Fixes 6 corpus test failures (sqlglot T-SQL fixtures).
ClickHouse routes DDL/DML to all shards via an `ON CLUSTER <name>`
clause:
DELETE FROM tbl ON CLUSTER test_cluster WHERE date = '2019-01-01'
DELETE FROM tbl ON CLUSTER '{cluster}' WHERE date = '2019-01-01'
Reference: https://clickhouse.com/docs/sql-reference/distributed-ddl
After parsing the FROM table list in `parse_delete`, peek-and-consume
the optional ON CLUSTER clause before WHERE / USING / RETURNING. The
cluster name doesn't add lineage info; reuse the existing
`parse_optional_on_cluster` helper and discard the result.
Gated on `ClickHouseDialect | GenericDialect`.
Fixes 2 corpus test failures (sqlglot ClickHouse fixtures).
Postgres / ANSI / Teradata `CREATE TABLE name AS <query>` accepts a trailing clause that controls whether the new table is populated with the query's results and whether statistics are collected: CREATE TABLE t AS SELECT … WITH DATA CREATE TABLE t AS SELECT … WITH NO DATA CREATE TABLE t AS SELECT … WITH DATA AND STATISTICS CREATE TABLE t AS SELECT … WITH NO DATA AND NO STATISTICS References: - Postgres: https://www.postgresql.org/docs/current/sql-createtableas.html - Teradata: https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/SQL-Data-Definition-Language-Syntax-and-Examples/Table-Statements/CREATE-TABLE-AS/AS-Subquery-Clause After parsing the AS-query body in `parse_create_table_inner`, peek for `WITH` and consume the optional `[NO] DATA [AND [NO] STATISTICS]` tail. Neither lineage nor table-shape info lives in the clause; consume and discard. If the WITH wasn't followed by `[NO] DATA` (e.g. T-SQL's `WITH (option=…)` table-options), restore the index so the existing parser path handles it. Fixes 5 corpus test failures (sqlglot ANSI + Trino).
BigQuery [path expressions] allow the last segment to start with a
digit:
SELECT * FROM foo.bar.25ab c
SELECT * FROM foo.bar.25
SELECT * FROM foo.bar.25_
The tokenizer greedily folds a leading `.` into the next number, so
`bar.25ab` tokenises as `Word("bar")` then `Number(".25")` then
`Word("ab")`. Previously `parse_object_name` saw the Number token
where it expected a Period, broke out of the path loop, and the
parser then errored on the dangling `.25`.
In `parse_object_name`, after each ident, peek for a leading-dot
Number. If found, peel the `.` off, treat the remaining digits as the
next segment's prefix, and concatenate any adjacent Word for segments
like `25ab`. Index-only advance — never mutate `self.tokens` (which
would persist across speculative `maybe_parse` calls).
Numeric literals (`SELECT 1.5`, `WHERE x = 0.5`) and SELECT-projection
JSON paths (`field.5k_clients_target`, handled by
`parse_snowflake_json_path`) are unaffected because both go through
different parser paths.
Gated on `BigQueryDialect | GenericDialect`.
[path expressions]: https://cloud.google.com/bigquery/docs/reference/standard-sql/lexical#path_expressions
Fixes 4 corpus test failures (sqlglot BigQuery).
…S, INLINE LENGTH)
Teradata's column-attribute grammar adds four post-type modifiers that
the parser currently rejects:
CREATE TABLE foo (
valid_date DATE FORMAT 'YYYY-MM-DD',
name VARCHAR(50) TITLE 'Customer Name',
code INT COMPRESS,
body VARCHAR(255) COMPRESS ('a', 'b'),
notes VARCHAR(80) INLINE LENGTH 64
)
Reference: https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/SQL-Data-Definition-Language-Detailed-Topics/CREATE-TABLE/Column-Level-Attributes-for-Database-Object-Creation
The corpus runs sqlglot_teradata fixtures through `GenericDialect`
(there is no dedicated TeradataDialect), so gate on `GenericDialect |
AnsiDialect`. Surface each as a `ColumnOption::DialectSpecific` with
the keyword name; lineage info is preserved by the column's name and
type, the modifiers carry no input refs.
`FORMAT` is a real keyword; `TITLE`, `COMPRESS`, `INLINE`, and `LENGTH`
aren't (per the project rule "Match non-keyword words case-insensitively,
don't add to keywords.rs"), so detect them via case-insensitive Word
match. `COMPRESS (...)` consumes its optional value list with a
balanced-paren skip — the values are constants, no lineage content.
Fixes 5 corpus test failures (sqlglot Teradata + ANSI fixtures).
Snowflake's [zero-copy clone] for schemas: CREATE SCHEMA mytestschema_clone CLONE testschema CREATE SCHEMA restored_schema CLONE my_schema AT (OFFSET => -3600) CREATE SCHEMA s_restore CLONE testschema BEFORE (TIMESTAMP => …) In `parse_create_schema`, after the schema name, peek for `CLONE` and consume `<source>` and an optional `AT|BEFORE (…)` time-travel suffix. The current `Statement::CreateSchema` AST has no `clone` slot, so the clause is consumed and discarded for parser-coverage; revisit when schema-level provenance lineage is needed and add a field then. Gated on `SnowflakeDialect | GenericDialect`. [zero-copy clone]: https://docs.snowflake.com/en/sql-reference/sql/create-clone Fixes 7 corpus test failures (snowflake first-party + sqlglot snowflake).
ANSI SQL and BigQuery extend UNION/INTERSECT/EXCEPT with two suffixes the parser currently rejects: -- match legs by column name instead of position SELECT 1 AS x UNION ALL CORRESPONDING SELECT 2 AS x SELECT 1 AS x UNION ALL CORRESPONDING BY (foo, bar) SELECT 2 AS x -- type-strict union (no implicit coercion) SELECT 1 UNION ALL STRICT SELECT 2 Reference: https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#set_operators In `parse_set_quantifier`, after the existing ALL/DISTINCT/BY NAME parsing, peek for `CORRESPONDING [BY (col, …)]` and consume the balanced paren block opaquely — the column list contains plain names that already appear in the SELECT legs, so opaque consumption preserves all lineage info. Same treatment for `STRICT`. These suffixes don't change which tables/columns the union references, so adding new `SetQuantifier` variants isn't necessary for grammar coverage. Fixes 7 corpus test failures (sqlglot BigQuery + Trino).
Snowflake's [DATE_PART] supports two argument forms — the standard function-call shape `DATE_PART(<part>, <expr>)` (already worked) and the ANSI EXTRACT-style `DATE_PART(<part> FROM <expr>)`. The previous parser path treated DATE_PART as a generic function call and rejected `FROM` between the args. Add a special-case at the top of `parse_prefix` for non-keyword Word "DATE_PART" (case-insensitive) followed by `(`, parsing the part, then either a comma or `FROM` separator, then the expression. Result is `Expr::Function` so downstream consumers (lineage visitors) see the same shape as any other function call — same args slot, same column refs preserved. Gated on `SnowflakeDialect | GenericDialect`. [DATE_PART]: https://docs.snowflake.com/en/sql-reference/functions/date_part Fixes 3 corpus test failures (sqlglot Snowflake).
Snowflake's [CREATE EXTERNAL TABLE] places a `PARTITION BY (col, col, …)`
clause between the column-def list and the option block:
CREATE EXTERNAL TABLE et (col1 DATE AS (...), col2 VARCHAR AS (...))
PARTITION BY (col1, col2)
LOCATION=@stage/path/
FILE_FORMAT=(type=parquet)
Previously the parser entered the option-swallowing loop, which expected
the first option to be `name=value` (`LOCATION=`, etc.). `PARTITION BY (...)`
didn't match that shape, so parsing fell through to the Hive-style
external-table path and errored.
Add a PARTITION-BY-list consumer immediately after the column list and
before the option block. The partition column names are already in the
column-def list, so opaque consumption preserves all lineage info.
Gated on `SnowflakeDialect | GenericDialect`.
[CREATE EXTERNAL TABLE]: https://docs.snowflake.com/en/sql-reference/sql/create-external-table
Fixes 3 corpus test failures (snowflake first-party + sqlglot snowflake).
Snowflake's [CREATE SEQUENCE] accepts a `COMMENT = '<string>'` option alongside START / INCREMENT / ORDER: CREATE SEQUENCE seq START=5 COMMENT='foo' INCREMENT=10 The existing sequence-options loop didn't recognise COMMENT and broke out at the `comment` keyword, leaving trailing tokens that the outer parser then errored on. Add a COMMENT arm to `parse_create_sequence_options` that consumes the optional `=` and a literal string. The comment carries no lineage content; discard the value. [CREATE SEQUENCE]: https://docs.snowflake.com/en/sql-reference/sql/create-sequence Fixes 2 corpus test failures (sqlglot Snowflake).
Redshift's CREATE TABLE permits DISTSTYLE, DISTKEY, and SORTKEY (with optional COMPOUND prefix) to appear in any order after the column definitions: CREATE TABLE sales (...) DISTKEY(listid) COMPOUND SORTKEY(...) DISTSTYLE AUTO CREATE TABLE t (...) SORTKEY(a) DISTKEY(a) Reference: https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_TABLE_NEW.html The previous parser parsed them in a fixed sequence (DISTSTYLE → DISTKEY → SORTKEY), so any other ordering errored on the first out-of-order clause. Wrap the three lookups in a loop that consumes whichever keyword appears next; each option is still admitted at most once. Per the loop guidance, this is a clause-permutation fix and doesn't require new grammar — the individual clauses are unchanged. Fixes 3 corpus test failures (sqlglot_redshift + unparsed_redshift).
Athena's DDL (CREATE EXTERNAL TABLE with ROW FORMAT SERDE, SERDEPROPERTIES, STORED AS, etc.) is Hive-style, not Trino-style. Switch the alias from `trino` → `hive` so the corpus runner uses HiveDialect for `sqlglot_athena/` and any future `customer_athena/` fixtures. Athena's DML/queries are Trino-style, but the failing fixtures in the corpus are exclusively DDL where the mapping matters. Fixes 3 corpus test failures (sqlglot Athena).
Hive's `ROW FORMAT` accepts two extensions the parser currently rejects:
- `ROW FORMAT SERDE 'class' WITH SERDEPROPERTIES ('k'='v', …)` —
serde configuration after the class name. SERDEPROPERTIES isn't in
the keyword table; matched case-insensitively.
- `ROW FORMAT DELIMITED [FIELDS TERMINATED BY 'x'] [LINES TERMINATED
BY 'y'] [NULL DEFINED AS 'z']` — DELIMITED suboptions describing
ASCII-text storage.
Reference: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-RowFormats&SerDe
`parse_row_format` now consumes:
1. After SERDE 'class', an optional `WITH SERDEPROPERTIES (...)` block
(balanced-paren skip — the k/v strings carry no lineage info).
2. After DELIMITED, any sub-clauses up to the next ROW / STORED /
LOCATION / WITH / COMMENT / TBLPROPERTIES / PARTITIONED / CLUSTERED
/ AS keyword (also EOF / `;`).
If `WITH` isn't followed by SERDEPROPERTIES, restore the index so
later table-options parsers (CTEs, WITH(option=…), etc.) can take it.
Used through `parse_hive_formats`, which is called for any dialect's
CREATE TABLE / CREATE EXTERNAL TABLE that allows Hive-style storage
options (Hive, Databricks, Athena via Hive routing).
Fixes 5 corpus test failures (sqlglot Athena, Databricks, sqlglot Hive).
Athena Iceberg tables and Trino use a different shape for PARTITIONED BY
than classic Hive:
-- Hive: column-def list (each segment has a type)
CREATE TABLE t (a INT) PARTITIONED BY (year INT)
-- Iceberg: expression list (column refs + transform functions)
CREATE TABLE t (id BIGINT, category STRING)
PARTITIONED BY (category, BUCKET(16, id), TRUNCATE(8, id), DAY(ts))
Reference: https://docs.aws.amazon.com/athena/latest/ug/querying-iceberg-creating-tables.html
`parse_hive_distribution` now distinguishes the two by peeking past the
first identifier inside `(...)`: if a known data-type keyword (INT,
STRING, BIGINT, …) follows, it's the column-def form; otherwise we
parse the contents as a comma-separated expression list. The expression
form's lineage info (column refs inside transforms like `BUCKET(16, id)`)
is preserved by the standard expression parser.
Used through `parse_hive_formats`, which is reached for any dialect's
CREATE TABLE that allows Hive-style storage options (Hive,
Databricks, Athena via Hive routing).
Fixes 2 corpus test failures (sqlglot Athena Iceberg).
Hive's CREATE [EXTERNAL] TABLE grammar allows two optional table-level
clauses the parser was rejecting:
CREATE EXTERNAL TABLE foo (id INT) COMMENT 'description'
CREATE EXTERNAL TABLE foo (id INT, val STRING) CLUSTERED BY (id, val) INTO 10 BUCKETS
CREATE EXTERNAL TABLE foo (id INT) COMMENT 'c'
PARTITIONED BY (a INT) CLUSTERED BY (id) SORTED BY (id ASC) INTO 5 BUCKETS
Reference: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL
Add `parse_optional_hive_comment_and_clustered_by`, called both before
and after PARTITIONED BY in the external-table branch (the two clauses
can appear in either order around it). Both are consumed and discarded:
COMMENT carries no lineage info, and CLUSTERED BY references columns
already in the table's column-def list. SORTED BY (cols) is consumed
opaquely, INTO <n> [BUCKETS] is also opaque.
CLUSTERED, SORTED, and BUCKETS aren't keywords in our table; matched
case-insensitively per the project rule.
Fixes 2 corpus test failures (sqlglot Athena).
Extends the previous CORRESPONDING / STRICT support to also accept:
- The two suffixes in any order: `UNION ALL STRICT CORRESPONDING`
alongside the previously-handled `UNION ALL CORRESPONDING STRICT`.
- `BY NAME ON (col, …)` — BigQuery's column-restricted by-name match:
SELECT 1 AS x UNION ALL BY NAME ON (foo, bar) SELECT 2 AS x
SELECT 1 AS x INNER UNION ALL BY NAME ON (foo, bar) SELECT 2 AS x
Reference: https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#set_operators
`parse_set_quantifier` now loops through CORRESPONDING / STRICT / ON
suffixes, consuming each at most once and stopping as soon as it sees
something that isn't one of the three. All three are opaque to lineage
(column names inside the paren lists already appear in the SELECT
legs).
Fixes 4 corpus test failures (sqlglot BigQuery).
Redshift inherits the Oracle/Snowflake `expr (+)` legacy outer-join syntax in the WHERE clause of comma-join queries: SELECT a.foo, b.bar FROM a, b WHERE a.baz = b.baz (+) SELECT * FROM a, b WHERE a.id (+) = b.id The parser already handled this for Snowflake/Generic. Extend the two `dialect_of!` gates (in `parse_subexpr` and the `Token::LParen | Token::Period` arm of `parse_prefix`) to also include RedshiftSqlDialect. Fixes 1 corpus test failure (sqlglot Redshift).
Snowflake's [CREATE STAGE] accepts the option clauses (DIRECTORY, FILE_FORMAT, COPY_OPTIONS, COMMENT, plus URL / CREDENTIALS / STORAGE_INTEGRATION / ENDPOINT / ENCRYPTION which would normally be read by `parse_stage_params`) in any order, and the FILE_FORMAT value has three shapes: FILE_FORMAT = (TYPE='JSON' …) -- inline parenthesised options FILE_FORMAT = '<format_name>' -- string shorthand FILE_FORMAT = [<schema>.]<format> -- dotted-ident shorthand Reference: https://docs.snowflake.com/en/sql-reference/sql/create-stage Two changes in `src/dialect/snowflake.rs`: 1. `parse_create_stage` wraps the option-clause section in a loop so any of DIRECTORY / FILE_FORMAT / COPY_OPTIONS / COMMENT plus URL/CREDENTIALS/STORAGE_INTEGRATION/ENDPOINT/ENCRYPTION can appear after FILE_FORMAT (previously they had to come first via `parse_stage_params`). Stage-params seen mid-stream are merged into the `stage_params` accumulator. 2. `parse_parentheses_options` accepts dotted ident values (`FORMAT_NAME=schema.format`) by consuming `.<word>` continuations. 3. FILE_FORMAT='string' and FILE_FORMAT=<ident> shorthand both surface as `DataLoadingOption { name: "FORMAT_NAME", … }` for AST symmetry with the parenthesised form's own `FORMAT_NAME` option. Fixes 7 corpus test failures (snowflake first-party + sqlglot snowflake).
Snowflake (and Postgres) allow a dollar-quoted string body in the column-level `COMMENT` clause: CREATE TABLE foo (ID INT COMMENT \$\$some comment\$\$) Previously the parser only accepted single-quoted strings here and errored with "Expected string, found: \$\$…\$\$". Add a `Token::DollarQuotedString` arm to `parse_optional_column_option`'s COMMENT branch, surfacing the inner content as `ColumnOption::Comment` just like a single-quoted form. Fixes 1 corpus test failure (sqlglot Snowflake).
BigQuery and MSSQL `FOR SYSTEM_TIME AS OF <expr>` time-travel reads typically appear after the table's optional alias: FROM tbl t FOR SYSTEM_TIME AS OF CURRENT_TIMESTAMP() … FROM tbl AS t FOR SYSTEM_TIME AS OF CURRENT_TIMESTAMP() LEFT JOIN … Reference: https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#for_system_time_as_of `parse_table_factor` previously called `parse_table_version` only before the alias, so an aliased shape fell through to the query-level FOR-UPDATE/FOR-SHARE locks loop and errored with "Expected one of UPDATE or SHARE, found: SYSTEM_TIME". Call `parse_table_version` a second time after the alias when no version qualifier was set yet. Fixes 5 corpus test failures (unparsed BigQuery).
…g comma In `is_parse_comma_separated_end`, when a comma is followed by a reserved-as-alias keyword (CLUSTER, SORT, FINAL, etc.), the parser peeks the next-next token to decide whether the keyword is a clause starter (end of list) or a column name reusing the keyword. The previous fall-through `_ => true` returned "end of list" when peek_nth(1) was anything not specifically whitelisted — including `)`. So `EXCEPT(id_2, CLUSTER)` looked like a trailing comma plus out-of-context CLUSTER and the loop stopped, leaving CLUSTER unconsumed and the parser then erroring with "Expected ), found: CLUSTER". Add `Token::RParen => false` before the catch-all: a reserved keyword inside a parenthesised list with `)` after it is unambiguously a column name, not a trailing-comma terminator. Trailing-comma support in projection lists (`SELECT a, b, FROM t`) is unaffected — those have FROM/clause-starter after the keyword, hitting the dedicated clause-only check. Fixes 6 corpus test failures (customer & unparsed BigQuery).
Snowflake's [CREATE TABLE] allows column-level foreign-key constraints with an explicit `FOREIGN KEY` prefix: <col> <type> [NOT NULL] FOREIGN KEY REFERENCES <ref_table> [(<ref_col>)] Previously the parser only knew the shorter ANSI/Postgres form `<col> <type> REFERENCES <ref_table>(...)` and bailed at FOREIGN inside a column definition. In `parse_optional_column_option`, accept either `FOREIGN KEY REFERENCES` or bare `REFERENCES` before the same shared foreign-table / column-list / ON DELETE/UPDATE / characteristics parsing. [CREATE TABLE]: https://docs.snowflake.com/en/sql-reference/sql/create-table Fixes 1 corpus test failure (sqlglot Snowflake).
BigQuery's CREATE FUNCTION accepts an optional `DETERMINISTIC` or
`NOT DETERMINISTIC` marker between RETURNS and LANGUAGE:
CREATE TEMPORARY FUNCTION f(x FLOAT64) RETURNS FLOAT64 NOT DETERMINISTIC
LANGUAGE js AS 'return Math.random() * x;'
Reference: https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language#create_a_function
Add a `parse_keywords([NOT, DETERMINISTIC])` / `parse_keyword(DETERMINISTIC)`
arm to `parse_create_function_body`'s loop. The marker doesn't change
lineage; consume and discard rather than extending `CreateFunctionBody`.
Fixes 1 corpus test failure (sqlglot BigQuery).
…thand Redshift's CREATE TABLE accepts a two-positional-argument shorthand for IDENTITY columns: CREATE TABLE t (c BIGINT GENERATED BY DEFAULT AS IDENTITY (0, 1)) CREATE TABLE t (c BIGINT GENERATED ALWAYS AS IDENTITY (100, 5)) Reference: https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_TABLE_NEW.html Previously the parser only handled the Postgres-style keyword form `IDENTITY (START WITH n INCREMENT BY n …)` via `parse_create_sequence_options`, which expected each option to begin with a keyword and broke on the bare numeric pair. Add `parse_identity_paren_options` that peeks the first token after `(`: - bare Number / sign — parse `(seed, step)` and surface as `[StartWith(seed, false), IncrementBy(step, false)]` so the existing AST doesn't need new variants. - otherwise — fall back to `parse_create_sequence_options` for the keyword form. Both `GENERATED ALWAYS AS IDENTITY` and `GENERATED BY DEFAULT AS IDENTITY` column-option arms now route through the helper. Fixes 1 corpus test failure (sqlglot Redshift).
Snowflake's TABLE(<expr>) table-function form accepts the same suffix
keywords (TABLESAMPLE, PIVOT, UNPIVOT, MATCH_RECOGNIZE) as a regular
table reference:
SELECT * FROM TABLE('t1') TABLESAMPLE BERNOULLI (20.3)
Previously the TABLE(...) branch in `parse_table_factor` returned the
TableFactor::TableFunction directly without running the suffix-keyword
loop, so the TABLESAMPLE token was left for the outer parser, which
errored.
Wrap the post-TABLE() return in the same suffix-keyword loop used for
plain table refs (PIVOT / UNPIVOT / TABLESAMPLE / SAMPLE /
MATCH_RECOGNIZE) so each can apply.
Fixes 1 corpus test failure (sqlglot Snowflake).
…IER(…)
Extend the previous Snowflake `IDENTIFIER('<name>')` literal support
to also accept session variables and bind parameters:
CREATE TABLE IDENTIFIER(\$foo) (col1 VARCHAR, col2 VARCHAR)
SELECT * FROM IDENTIFIER(\$tbl_name)
SELECT * FROM IDENTIFIER(?)
Reference: https://docs.snowflake.com/en/sql-reference/identifier-literal
`Token::Placeholder` (which covers both `\$foo` and `?` in our tokeniser)
is added to the inner-value match alongside the existing
SingleQuotedString/DoubleQuotedString arms. Placeholder values surface
as a plain (unquoted) Ident — they have no compile-time name and
execution would resolve them at run time anyway, so the synthetic
ident just keeps parsing going.
Fixes 1 corpus test failure (sqlglot Snowflake).
BigQuery's [legacy SQL] uses square brackets to quote project-qualified table identifiers, with `:` separating the project from the dataset/table: SELECT * FROM [my-proj-123:dataset.table] SELECT * FROM [proj:ds.tbl] AS t Standard SQL replaces the brackets with backticks (`proj.ds.tbl`), but customers still submit legacy-SQL queries through the wire — they appear in `unparsed_bigquery` query logs. In `parse_table_factor`, when we see `[` at the start of a table reference (BigQuery / Generic), consume the balanced bracket block (words / numbers / dots / colons / hyphens / `*` for wildcard tables) and surface the inner string as a single backtick-quoted Ident. Lineage tracking sees the table reference normally; the `project:dataset.table` text is preserved verbatim in the ident value. If the bracket block contains anything else (operators, parens, etc.), restore the index so other callers — e.g. ARRAY[...] literals — can take over. [legacy SQL]: https://cloud.google.com/bigquery/docs/reference/legacy-sql Fixes 5 corpus test failures (unparsed BigQuery, customer BigQuery).
BigQuery accepts both `'…'` and `"…"` as string-literal forms. The parser's `AT TIME ZONE` arm only recognised the single-quoted form, so a query like EXTRACT(HOUR FROM ts AT TIME ZONE "Asia/Tokyo") errored at the time-zone argument with "Expected Token::SingleQuotedString after AT TIME ZONE". Add a `Token::DoubleQuotedString` arm gated on `BigQueryDialect | GenericDialect`. The single-quoted path is unchanged; non-BigQuery dialects (Postgres, etc.) still require the single-quoted form per ANSI. Fixes 6 corpus test failures (4 unparsed_bigquery + 2 customer_bigquery).
- compare-corpus-reports.js only shows additions; check git status for prunes - Pipeline reprocess takes ~10min; use Monitor with pgrep loop - Anonymizer corruption signature is exactly 's'<word>; broader regex deletes hand-written sqlglot fixtures - Query-log truncation heuristics that worked (trailing punct/keyword, CASE>END count) - 'cmd &' with run_in_background returns completed immediately — verify with pgrep
jakubjasinsky
approved these changes
May 6, 2026
zdenal
approved these changes
May 6, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Parser fixes uncovered by iterating the kernel-cll corpus. Every commit is dialect-grounded with a docs reference and zero corpus regressions; ~98.7% pass rate across ~172k SQL samples (snowflake 99.8%, redshift 99.6%, bigquery 99.5%, athena 97.4%).
Snowflake
IDENTIFIER('<name>')literal in name positions (docs) —CREATE TABLE IDENTIFIER('db.schema.t'),FROM IDENTIFIER('mytable'). Detected at the start ofparse_identifier.IDENTIFIER(…)—IDENTIFIER($var),IDENTIFIER(?).INTERVALas identifier before binary/clause keywords (INTERVAL BETWEEN …,… ORDER BY INTERVAL,JOIN … ON INTERVAL = x).CREATE STAGEoption grammar —FILE_FORMAT = (TYPE = …)shorthand, dotted-ident values,CREDENTIALS = (…)afterFILE_FORMAT.CREATE SCHEMA … CLONE source [AT|BEFORE (…)].CREATE SEQUENCE … COMMENT='…'option.CREATE EXTERNAL TABLE … PARTITION BY (cols).DATE_PART(<part> FROM <expr>)ANSI form.FOREIGN KEY REFERENCEScolumn constraint.COMMENT.TABLESAMPLEafterFROM TABLE(<expr>)reference.BigQuery
foo.bar.25ab. Tokenizer greedily folds leading.into a Number;parse_object_namepeels it back off without mutatingself.tokens.[project-id:dataset.table]table references.FOR SYSTEM_TIME AS OFafter table alias.[NOT] DETERMINISTICmarker inCREATE FUNCTIONbody.CORRESPONDING / STRICT / ON (cols)) in any order.AT TIME ZONE.Redshift
(+)outer-join marker.DISTSTYLE / DISTKEY / SORTKEYin any order onCREATE TABLE.GENERATED AS IDENTITY (seed, step)two-arg shorthand.Hive / Athena
PARTITIONED BY(PARTITIONED BY (bucket(16, x), days(ts))).WITH SERDEPROPERTIES (…)andDELIMITEDsuboptions.COMMENTandCLUSTERED BYclauses.ClickHouse
[GLOBAL] [LEFT|RIGHT|INNER] [ANY|ASOF|ALL] JOIN(JOIN docs) — modifier parsed and discarded; same lineage as base outer join.ON CLUSTERclause inDELETE FROM tbl ON CLUSTER … WHERE …(distributed-DDL docs).MySQL
REPLACE [INTO]statement (docs) — same shape as INSERT.DuckDB
USING SAMPLEclause (samples docs) —tbl USING SAMPLE 10%,tbl USING SAMPLE SYSTEM (10 PERCENT) REPEATABLE (377). Triggered only whenUSINGis followed by bareSAMPLE, so JOIN'sUSING (cols)is unaffected.T-SQL / MSSQL
GENERATED ALWAYS AS ROW {START|END} [HIDDEN],PERIOD FOR SYSTEM_TIME (start_col, end_col).Cross-dialect
'foo' 'bar' → 'foobar'. Real customer SQL relies on this.JSON_TABLE/XMLTABLECOLUMNS(...)clause (MySQLJSON_TABLE) — consumed opaquely (output column shapes carry no input refs).WITH [NO] DATA [AND [NO] STATISTICS]onCREATE TABLE AS(Postgres docs, Teradata docs).)treated as column name, not trailing-comma terminator.CORRESPONDING [BY (cols)]andSTRICTset-op modifiers.FORMAT,TITLE,COMPRESS,INLINE LENGTH.Corpus-side improvements (kernel-cll-corpus)
The anonymizer pipeline produced fragments that no parser could accept; fixes landed at the source rather than as fragile parser carve-outs:
\d+(?:\.\d+)?) — was eating trailing.inproj.NNN.dataset.TARGET,JS,PYTHON,IDENTIFIER,DIMENSIONS,METRICS,FACTS,SEMANTIC_VIEW.IF cond THENSQLs withoutEND IF(truncated procedure-body fragments).,,(,=, clause keywords (SELECT/FROM/BY/AS/…), or unclosedCASE. Removed ~4k Redshift truncations.'s'<word>anonymizer corruption — when the anonymizer's regex misaligns on a token boundary inside a string literal (e.g.INTERVAL '1 HOUR'→'s'HOUR), the resulting SQL is unparseable. Pattern unique to anonymizer output, no false positives on hand-written fixtures._INTERNAL_QUERY_MARKERSfilter excludes warehouse-internal/* DS_SVC */queries.Tests
Each parser commit ships a unit test in the appropriate dialect file. Full suite passes (
cargo nextest run --all-features). Latest CI: Corpus / Check / Test Suite all green.