Working with Postgres WAL format from Elixir by icehaunter · Pull Request #1 · electric-sql/electric

icehaunter · 2022-06-14T19:25:10Z

No description provided.

linear · 2022-06-14T19:25:20Z

VAX-57 Transform Postgres WAL files into internal replication message format

Postgres produces WAL with "logical" level of detail, which can be decoded into logical replication streams. WAL files replication is managed through replication slots. Once some entry is read from the replication slot, the log entries are cleared from the server.

In this task we want to read entries from a PG WAL file, decode and transform them into a format of our choice. We shall define the message format of the events that are passed down the pipeline.

This task is already partly done in https://github.com/vaxine-io/electric

Produce events from source Pg instances
Decode and handle Pg replication message
Specify the message format we use internally and transform incoming messages to that format

thruflo

Love it :)

* browser: wip structuring the web browser adapter. * browser: more wip on browser adapter. * browser: tweak to await return values. * browser: use run not exec in db adapter. * deps: tidy up dependencies. * browser: support opening multiple databases. * browser: flesh out testing the db proxy API. * browser: test the statement API. * browser: worker side `ElectricStatement` proxy wrapper. * browser: test db and statement commit notifications. * browser: actually run all the browser tests. * refactor: rename adapters to drivers. * scripts: rm `./dist` before building.

chore: Fix types

Based on thorough validation against PostgreSQL documentation: **Issue #1 - Troubleshooting "must be owner" error:** - REMOVED incorrect suggestion to use GRANT ALL PRIVILEGES - PostgreSQL ownership rights cannot be granted via privileges - ALTER TABLE and adding tables to publications require actual ownership **Issue #2 - Quick Start ownership transfer:** - ADDED GRANT CREATE ON SCHEMA public before ownership transfer - PostgreSQL requires new owner to have CREATE privilege on schema - Without this, ALTER TABLE ... OWNER TO will fail **Issue #3 - Publication ownership requirements:** - CLARIFIED that you must own BOTH publication AND each table - Updated Core Permission Requirements table - Per PostgreSQL docs: Adding a table additionally requires owning that table **Issue #4 - AWS wal_level defaults:** - CORRECTED incorrect claim about wal_level defaults - PostgreSQL default is replica (not minimal for RDS) - RDS/Aurora use standard PostgreSQL defaults All fixes validated by research agents against official PostgreSQL docs. Thanks to external reviewer for catching these critical errors.

Based on thorough validation against PlanetScale documentation: **Issue #1 - Default postgres role and REPLICATION:** - FIXED incorrect claim that default roles lack REPLICATION - PlanetScale's default postgres role DOES include REPLICATION attribute - Updated to: "You can use it directly, or create dedicated role for least-privilege" - Source: PlanetScale roles documentation shows CREATE ROLE ... REPLICATION **Issue #2 - Logical replication defaults:** - SOFTENED absolute claim "not enabled by default" - Changed to: "may not be enabled. Verify and configure as needed" - More future-proof: PlanetScale deliberately avoids stating defaults - Moved verification step (SHOW wal_level) to top for clarity **Issue #3 - Connection limit defaults:** - REMOVED undocumented "25 connections" default claim - No public PlanetScale docs state this specific default - Changed to: "Small clusters may start with low max_connections" - Provided sizing guidance: "≥ 3× Electric's pool size" without hard numbers - Example: 20 connections → set max_connections to at least 60 All fixes validated by research agents against official documentation. Thanks to external reviewer for catching these precision issues.

Fixes sentry errors [#1](https://electricsql-04.sentry.io/issues/74727274/) and [#2](https://electricsql-04.sentry.io/issues/74727257) I have marked the `:disk_full` error as retryable, since storage might be freed up automatically or added by the db administrator as a response to this error and should thus not shut down the system. I have marked the `:duplicate_file` for the replication slot specifically as retryable as well, as it is a tmp file for an atomic write that seems like a race. Interesting that this occurred and might be worth looking into if it keeps occurring.

Fixes an additional race condition where a stale abort completion could overwrite the state after resume() has been called. Timeline of the bug: 1. Request #1 running with AbortController #1 2. Tab hidden → pause() sets state to 'pause-requested', aborts #1 3. Tab visible → resume() sets state to 'active', starts request #2 4. Old request #1's abort completes, sets state to 'paused' 5. Stream stuck because state is 'paused' but should be 'active' Fix: Only transition to 'paused' if state is still 'pause-requested'. If resume() already changed it to 'active', don't overwrite it. This ensures that old abort completions don't interfere with the new active request started by resume().

### 🔗 Context When this workflow is called by the Changesets workflow, the `github.event_name` context variable evaluates to `push` (the parent's trigger) rather than `workflow_call`. This caused our previous logic to skip the release flow and default to canary builds. So the most recent [package publishing](https://github.com/electric-sql/electric/actions/runs/21003147814/job/60379128532) only pushed canary images: ``` #1 [internal] pushing docker.io/electricsql/electric:canary #1 0.000 pushing sha256:e372c6ad86713cdbf726bbef83920eb32ecf9034d3794abde8d9fa73805413b1 to docker.io/electricsql/electric:canary #1 DONE 1.9s #1 [internal] pushing docker.io/electricsql/electric-canary:3516b9780 #1 0.000 pushing sha256:e372c6ad86713cdbf726bbef83920eb32ecf9034d3794abde8d9fa73805413b1 to docker.io/electricsql/electric-canary:3516b9780 #1 ... #2 [internal] pushing docker.io/electricsql/electric-canary:latest #2 0.000 pushing sha256:e372c6ad86713cdbf726bbef83920eb32ecf9034d3794abde8d9fa73805413b1 to docker.io/electricsql/electric-canary:latest #2 DONE 1.8s #1 [internal] pushing docker.io/electricsql/electric-canary:3516b9780 #1 DONE 1.8s ``` ### 🛠️ Changes - Updated `derive_build_vars` job to prioritize `inputs.release_tag` and `github.event.release.tag_name` over the event name string. - Refactored shell logic to use more idiomatic `-n` (non-zero string) checks.  ## Summary by CodeRabbit * **Refactor** * Simplified the Docker Hub image sync workflow's release detection logic to prioritize explicit release tags and fallback to commit-based runs, preserving existing behavior for release and manual triggers. <sub>✏️ Tip: You can customize this high-level summary in your review settings.</sub>

- Remove expiry_manager.get_shape_count span: runs every 60s per stack, is a simple ETS count, generated 887K events/day - Remove expiry_manager.get_least_recently_used span: ETS fold, only meaningful as part of the expire_shapes parent span - Remove shape_status.validate_shape_handle span: sub-millisecond ETS hash comparison called 2x per request, was the #1 span by volume at 3.5M events/day These operations are too fast and frequent to warrant individual spans. The parent spans (Plug_shape_get, expiry_manager.expire_shapes) provide sufficient context for debugging. Fixes #3977 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Adds test/pbt-micro.test.ts — a dedicated PBT suite that exercises twelve narrow invariants in the ShapeStream client. Each TARGET's opening comment documents the invariant under test, and the PBTs shook loose eight real bugs that are fixed in this commit: #1 canonicalShapeKey used URLSearchParams.set() for custom params, collapsing duplicate keys (e.g. ?tag=a&tag=b → ?tag=b) so two genuinely distinct shapes shared a cache key. Switched to append(). #2 Shape#process clobbered shouldNotify by assignment in three places, so any sequence where a change message followed a status-change message would silently drop the change's notification. OR-accumulate instead, and track hadData before must-refetch clears state so subscribers still see the reset. #3 SubsetParams GET serialization dropped limit=0 and offset=0 via falsy checks. Switched to explicit !== undefined guards. #4 Shape#requestedSubSnapshots dedup keyed on bigintSafeStringify, which preserves insertion order, so permutation-equivalent params produced different keys and re-execution fired the same snapshot N times. Added canonicalBigintSafeStringify to helpers.ts that recursively sorts object keys. #5 snakeToCamel collapsed runs of underscores into a single camelCase boundary, so user_id and user__id (distinct db columns) decoded to the same app key, corrupting rows with mapped values. snakeToCamel now preserves (n-1) literal underscores for a run of n, and camelToSnake's boundary regex was widened to ([a-z_])([A-Z]) so the round-trip is injective. #6 Shape#reexecuteSnapshots caught and discarded errors from stream.requestSnapshot, silently dropping failed sub-snapshot re-execution on shape rotation. Errors are now collected and the first one is surfaced via #error + #notify. #7 SnapshotTracker populated xmaxSnapshots and snapshotsByDatabaseLsn in addSnapshot but never cleaned them up — neither on removeSnapshot nor on addSnapshot with a repeated mark. A later shouldRejectMessage eviction loop would walk the stale reverse index and wrongly delete the current snapshot, allowing duplicate change messages to slip through. Stored databaseLsn on each entry and added #detachFromReverseIndexes that runs on both add (before overwriting) and remove. #8 Shape#awaitUpToDate never observed the stream's error state, so calling requestSnapshot on a terminally-errored stream would hang forever on the setInterval polling loop. The helper now checks #error / stream.error up front, subscribes to the stream's onError, and settles the internal promise via reject on any terminal error path. Also: - vitest.pbt.config.ts — dedicated config that skips the real-Electric globalSetup and includes both model-based and pbt-micro test files. - bin/pbt-soak.sh — soak runner that loops PBT iterations with fresh seeds, captures counterexamples on failure. - test/pbt-micro.test.ts TARGET 4 (UpToDateTracker) restores real timers in afterEach so TARGET 12's setInterval doesn't hang when the suite runs end-to-end. - SPEC.md cross-references the L6 fetchSnapshotWithRetry PBT from the unconditional-409-cache-buster invariant. - model-based.test.ts gains response builders for update, delete, and mixed-batch 200s with lsn/op_position/txids headers for more realistic change sequences. Verified with 319 unit tests, 44 PBT tests at 500 runs each (and 2000-run soak), 62 column-mapper/snapshot-tracker tests, typecheck clean, eslint clean, static-analysis tests clean. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…th (#4127) ## Summary During deployments, we observed an onslaught of `ArgumentError` crashes when the stack is trying to get ready: ``` ** (ArgumentError) errors were found at the given arguments: * 1st argument: the table identifier does not refer to an existing ETS table (stdlib) :ets.next_lookup(#Reference<...>, {51460894745528, 0}) pure_file_storage.ex:1148: Electric.ShapeCache.PureFileStorage.read_range_from_ets_cache/5 pure_file_storage.ex:1076: Electric.ShapeCache.PureFileStorage.stream_main_log/3 ``` ### Root cause Each shape's Consumer process owns an unnamed ETS buffer table for in-memory log data. The table's TID (reference) is stored in the stack-wide named ETS table so that HTTP reader processes can look it up via `for_shape/2` and `read_or_initialize_metadata/2`. During stack restarts, Consumer processes terminate (or crash and restart), which destroys their buffer ETS tables. HTTP requests that are in-flight — or arrive during this window — can capture a stale TID and crash when they attempt `:ets.next_lookup` on it. ### Race windows identified 1. **Terminate ordering gap** — Between `:ets.delete(buffer_ets)` and `clean_shape_ets_entry` (which nils the reference in stack ETS), new readers could look up the dead TID from stack ETS. 2. **Already-captured TID** — Even after the reference is nil'd, readers that captured the TID *before* terminate started still hold the stale reference in a local variable. 3. **Startup churn** — During stack initialization many Consumers start simultaneously and some may fail/restart, creating repeated windows where stale TIDs exist in the stack ETS. ### Fix - **`safe_next_lookup/2`**: Wraps `:ets.next_lookup` in a rescue for `ArgumentError`, returning `:ets_dead`. The recursive `read_range_from_ets_cache/5` handles this the same as `:"$end_of_table"` — returning whatever was accumulated so far. The callers at both call sites (pure ETS path and mixed disk+ETS path) already detect empty/partial reads and fall back to reading from disk. The rescue is isolated to this leaf function specifically to **preserve tail call optimization** in the recursive reader. - **Reversed terminate ordering**: `clean_shape_ets_entry` (which nils the `ets_table` field) now runs *before* `:ets.delete(buffer_ets)`, closing race window #1 for new readers — they see `ets_table: nil` and hit the existing nil guard at `read_range_from_ets_cache(nil, _, _)`. Git history confirms the original intent was nil-before-delete (commit 5677df7), and the current ordering was an artifact of successive refactors. ### Data correctness analysis The key invariant that makes this safe: **`terminate` calls `close_all_files` (which flushes buffered data to disk via `IO.binwrite` + `:file.datasync`) *before* touching the ETS table.** After terminate, all data that was in the buffer ETS is on disk. The existing code already handles the semantically identical case of "ETS cleared by a concurrent flush" — comments at lines 1060 and 1079 describe this exact pattern. A deleted table is the same situation: data was in ETS, now it's on disk. Failure scenario analysis: | Scenario | Wrong data? | Wrong order? | Duplicates? | Missing data? | |---|---|---|---|---| | Clean terminate (flush succeeds) | No | No | No | No — chunk index finds flushed data | | Partial ETS read before crash | No | No | No | No — partial data discarded, full range re-read from disk | | Consumer killed without terminate | No | No | No | Yes — tail data lost (inherent to unclean kill, not introduced by this fix) | | No chunk index entry yet | No | No | No | Yes — empty response, client retries | In all cases: data returned is correct and ordered. The only failure mode is returning less data than the absolute latest, which triggers client retry. ## Test plan - [x] New test: reader falls back to disk when ETS table is deleted (pure ETS path) - [x] New test: reader falls back to disk when ETS table is deleted (mixed disk + ETS path) - [x] Both tests fail with `ArgumentError` before the fix, pass after - [x] Existing "ETS read/write race condition" tests still pass - [x] Full `test/electric/shape_cache/` suite passes (181 tests, 0 failures) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

#1 — `mcp.json` is now explicit opt-in. New `loadProjectMcpConfig` option (boolean | string) on `BuiltinAgentsServer`; defaults to off because stdio servers spawn local commands. The Electron desktop and the `electric-ax` CLI opt in; library embedders get the safe default. #2 — `stop()` tears down MCP resources. Added `Registry.close()` (closes every transport, forgets auth state, emits a final empty snapshot). `BuiltinAgentsServer.stop()` now also disposes the `mcp.json` chokidar watcher and unregisters the `mcp` tool provider. #3 — `RuntimeRegistry.register()` accumulates types per runtime instead of last-write-wins, fixing `/api/runtimes` losing earlier types when entity-type registration POSTs land in parallel. #4 — `applyMerged` is async/await so `mcpRegistry.applyConfig` rejections actually reach the catch (previously voided inside a `.then`, escaping as unhandled rejections). Optional `onConfigError` callback exposed for embedders. #5 — `composeToolsWithProviders` warns when a named MCP server in `mcp.tools(['x'])` is unavailable (unknown or not yet ready). Wildcard sentinels stay silent; missing names dedupe within a single call. #6a — `hashConfig()` includes `timeoutMs` so timeout-only edits no longer skip the reconfigure path and leave the entry's stale. #6b — `mcp.tools()` (no arg) is the canonical "every registered server" form. `mcp.tools('*')` kept for back-compat. Built-ins `horton`/`worker` and the docs use the no-arg form. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

icehaunter changed the title ~~Working with Postgres WAL format from Elixir~~ [DRAFT] Working with Postgres WAL format from Elixir Jun 14, 2022

icehaunter marked this pull request as draft June 15, 2022 13:33

icehaunter changed the title ~~[DRAFT] Working with Postgres WAL format from Elixir~~ Working with Postgres WAL format from Elixir Jun 15, 2022

icehaunter added 5 commits June 16, 2022 19:23

Imported logical replication decoder

45c0b79

Added logical replication encoder

6084d2b

Updated how the producer & client are configured

0455cdc

Added a basic producer test

267828a

Fixed message ordering within the transaction on producer

7f206c3

icehaunter force-pushed the ilia/vax-57-transform-postgres-wal-files-into branch from da5afce to 7f206c3 Compare June 16, 2022 16:25

icehaunter marked this pull request as ready for review June 16, 2022 16:25

icehaunter requested review from thruflo and v0idpwn June 16, 2022 16:26

thruflo approved these changes Jun 17, 2022

View reviewed changes

icehaunter merged commit af34a44 into main Jun 20, 2022

icehaunter deleted the ilia/vax-57-transform-postgres-wal-files-into branch June 20, 2022 08:51

KyleAMathews pushed a commit that referenced this pull request Nov 1, 2024

Merge pull request #1 from electric-sql/msfstef/fix-types

769f640

chore: Fix types

jonocodes mentioned this pull request Jun 5, 2025

Error: Table does not exist (using quickstart) #2802

Closed

msfstef mentioned this pull request Jan 15, 2026

fix: Race condition for reader after shape writer flushes data #3719

Merged

6 tasks

claude Bot mentioned this pull request Feb 11, 2026

feat: Write transaction fragments directly to storage to reduce consumer memory footprint #3783

Merged

KyleAMathews mentioned this pull request Feb 13, 2026

Separate data-completeness from connection-health in isUpToDate #3843

Open

claude Bot mentioned this pull request Feb 19, 2026

fix: Ensure that ShapeCache.await_snapshot_start() cannot loop indefinitely #3865

Merged

This was referenced Mar 3, 2026

Fix head-of-line blocking in SLC for subquery shapes via ETS link-values cache and inverted index #3937

Merged

feat(sync-service): Scale SQLite connection pool to 0 #3908

Merged

claude Bot mentioned this pull request Mar 9, 2026

Fix stale PID race in ConsumerRegistry.unregister_name/1 #3979

Closed

2 tasks

This was referenced Mar 23, 2026

fix: Upgrade existing publications to publish generated columns on PG18+ #4045

Closed

Improved reporting of top process groups by memory usage #4056

Merged

KyleAMathews mentioned this pull request Apr 4, 2026

test(client): add fast-check model-based property tests and retry bound analysis #4089

Merged

msfstef mentioned this pull request Apr 9, 2026

Fix replication connection drops from wal_sender_timeout during backpressure #4105

Merged

6 tasks

balegas mentioned this pull request Apr 10, 2026

fix(client): self-healing for permanently stuck expired shape handles #4087

Merged

msfstef mentioned this pull request Apr 14, 2026

fix(sync-service): handle deleted ETS buffer table in storage read path #4127

Merged

5 tasks

kevin-dp mentioned this pull request Apr 28, 2026

Add coder entity + let Horton spawn and converse with coders #4190

Merged

2 tasks

claude Bot mentioned this pull request May 7, 2026

feat(sync-service): hibernate before suspend to enable GC #4284

Draft

4 tasks

This was referenced May 13, 2026

feat(sync-service): add subset duration span attribute #4143

Closed

chore(sync-service): upgrade Elixir to 1.20.0 #3992

Open

feat(agents): pull-wake runner health check, principal rename, and lifecycle hardening #4339

Merged

balegas mentioned this pull request May 19, 2026

fix(agents-server): apply routing adapter to subscription paths #4356

Closed

7 tasks

This was referenced May 19, 2026

feat(agents-runtime)!: bash tool runs children with an env allowlist #4362

Closed

feat(sync-service): adaptive poll-based wait_until under StatusMonitor congestion #4376

Draft

feat: add DeepSeek provider support to Horton & UI #4406

Merged

claude Bot mentioned this pull request May 27, 2026

feat(agents-runtime): Sandbox primitive + Docker/E2B providers + sandbox profile picker #4369

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Working with Postgres WAL format from Elixir#1

Working with Postgres WAL format from Elixir#1
icehaunter merged 5 commits into
mainfrom
ilia/vax-57-transform-postgres-wal-files-into

icehaunter commented Jun 14, 2022

Uh oh!

linear Bot commented Jun 14, 2022

Uh oh!

thruflo left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

icehaunter commented Jun 14, 2022

Uh oh!

linear Bot commented Jun 14, 2022

Uh oh!

thruflo left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants