Skip to content

Working with Postgres WAL format from Elixir#1

Merged
icehaunter merged 5 commits into
mainfrom
ilia/vax-57-transform-postgres-wal-files-into
Jun 20, 2022
Merged

Working with Postgres WAL format from Elixir#1
icehaunter merged 5 commits into
mainfrom
ilia/vax-57-transform-postgres-wal-files-into

Conversation

@icehaunter
Copy link
Copy Markdown
Contributor

No description provided.

@linear
Copy link
Copy Markdown

linear Bot commented Jun 14, 2022

VAX-57 Transform Postgres WAL files into internal replication message format

Postgres produces WAL with "logical" level of detail, which can be decoded into logical replication streams. WAL files replication is managed through replication slots. Once some entry is read from the replication slot, the log entries are cleared from the server.

In this task we want to read entries from a PG WAL file, decode and transform them into a format of our choice. We shall define the message format of the events that are passed down the pipeline.

This task is already partly done in https://github.com/vaxine-io/electric

  • Produce events from source Pg instances
  • Decode and handle Pg replication message
  • Specify the message format we use internally and transform incoming messages to that format

@icehaunter icehaunter changed the title Working with Postgres WAL format from Elixir [DRAFT] Working with Postgres WAL format from Elixir Jun 14, 2022
@icehaunter icehaunter marked this pull request as draft June 15, 2022 13:33
@icehaunter icehaunter changed the title [DRAFT] Working with Postgres WAL format from Elixir Working with Postgres WAL format from Elixir Jun 15, 2022
@icehaunter icehaunter force-pushed the ilia/vax-57-transform-postgres-wal-files-into branch from da5afce to 7f206c3 Compare June 16, 2022 16:25
@icehaunter icehaunter marked this pull request as ready for review June 16, 2022 16:25
@icehaunter icehaunter requested review from thruflo and v0idpwn June 16, 2022 16:26
Copy link
Copy Markdown
Contributor

@thruflo thruflo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love it :)

@icehaunter icehaunter merged commit af34a44 into main Jun 20, 2022
@icehaunter icehaunter deleted the ilia/vax-57-transform-postgres-wal-files-into branch June 20, 2022 08:51
icehaunter pushed a commit that referenced this pull request Apr 17, 2023
* browser: wip structuring the web browser adapter.
* browser: more wip on browser adapter.
* browser: tweak to await return values.
* browser: use run not exec in db adapter.
* deps: tidy up dependencies.
* browser: support opening multiple databases.
* browser: flesh out testing the db proxy API.
* browser: test the statement API.
* browser: worker side `ElectricStatement` proxy wrapper.
* browser: test db and statement commit notifications.
* browser: actually run all the browser tests.
* refactor: rename adapters to drivers.
* scripts: rm `./dist` before building.
KyleAMathews pushed a commit that referenced this pull request Nov 1, 2024
KyleAMathews added a commit that referenced this pull request Oct 13, 2025
Based on thorough validation against PostgreSQL documentation:

**Issue #1 - Troubleshooting "must be owner" error:**
- REMOVED incorrect suggestion to use GRANT ALL PRIVILEGES
- PostgreSQL ownership rights cannot be granted via privileges
- ALTER TABLE and adding tables to publications require actual ownership

**Issue #2 - Quick Start ownership transfer:**
- ADDED GRANT CREATE ON SCHEMA public before ownership transfer
- PostgreSQL requires new owner to have CREATE privilege on schema
- Without this, ALTER TABLE ... OWNER TO will fail

**Issue #3 - Publication ownership requirements:**
- CLARIFIED that you must own BOTH publication AND each table
- Updated Core Permission Requirements table
- Per PostgreSQL docs: Adding a table additionally requires owning that table

**Issue #4 - AWS wal_level defaults:**
- CORRECTED incorrect claim about wal_level defaults
- PostgreSQL default is replica (not minimal for RDS)
- RDS/Aurora use standard PostgreSQL defaults

All fixes validated by research agents against official PostgreSQL docs.

Thanks to external reviewer for catching these critical errors.
KyleAMathews added a commit that referenced this pull request Oct 13, 2025
Based on thorough validation against PlanetScale documentation:

**Issue #1 - Default postgres role and REPLICATION:**
- FIXED incorrect claim that default roles lack REPLICATION
- PlanetScale's default postgres role DOES include REPLICATION attribute
- Updated to: "You can use it directly, or create dedicated role for least-privilege"
- Source: PlanetScale roles documentation shows CREATE ROLE ... REPLICATION

**Issue #2 - Logical replication defaults:**
- SOFTENED absolute claim "not enabled by default"
- Changed to: "may not be enabled. Verify and configure as needed"
- More future-proof: PlanetScale deliberately avoids stating defaults
- Moved verification step (SHOW wal_level) to top for clarity

**Issue #3 - Connection limit defaults:**
- REMOVED undocumented "25 connections" default claim
- No public PlanetScale docs state this specific default
- Changed to: "Small clusters may start with low max_connections"
- Provided sizing guidance: "≥ 3× Electric's pool size" without hard numbers
- Example: 20 connections → set max_connections to at least 60

All fixes validated by research agents against official documentation.

Thanks to external reviewer for catching these precision issues.
msfstef added a commit that referenced this pull request Nov 4, 2025
Fixes sentry errors
[#1](https://electricsql-04.sentry.io/issues/74727274/) and
[#2](https://electricsql-04.sentry.io/issues/74727257)

I have marked the `:disk_full` error as retryable, since storage might
be freed up automatically or added by the db administrator as a response
to this error and should thus not shut down the system.

I have marked the `:duplicate_file` for the replication slot
specifically as retryable as well, as it is a tmp file for an atomic
write that seems like a race. Interesting that this occurred and might
be worth looking into if it keeps occurring.
KyleAMathews pushed a commit that referenced this pull request Nov 13, 2025
Fixes an additional race condition where a stale abort completion
could overwrite the state after resume() has been called.

Timeline of the bug:
1. Request #1 running with AbortController #1
2. Tab hidden → pause() sets state to 'pause-requested', aborts #1
3. Tab visible → resume() sets state to 'active', starts request #2
4. Old request #1's abort completes, sets state to 'paused'
5. Stream stuck because state is 'paused' but should be 'active'

Fix: Only transition to 'paused' if state is still 'pause-requested'.
If resume() already changed it to 'active', don't overwrite it.

This ensures that old abort completions don't interfere with the
new active request started by resume().
alco added a commit that referenced this pull request Jan 20, 2026
### 🔗 Context
When this workflow is called by the Changesets workflow, the
`github.event_name` context variable evaluates to `push` (the parent's
trigger) rather than `workflow_call`. This caused our previous logic to
skip the release flow and default to canary builds.

So the most recent [package
publishing](https://github.com/electric-sql/electric/actions/runs/21003147814/job/60379128532)
only pushed canary images:

```
 #1 [internal] pushing docker.io/electricsql/electric:canary
#1 0.000 pushing sha256:e372c6ad86713cdbf726bbef83920eb32ecf9034d3794abde8d9fa73805413b1 to docker.io/electricsql/electric:canary
#1 DONE 1.9s
#1 [internal] pushing docker.io/electricsql/electric-canary:3516b9780
#1 0.000 pushing sha256:e372c6ad86713cdbf726bbef83920eb32ecf9034d3794abde8d9fa73805413b1 to docker.io/electricsql/electric-canary:3516b9780
#1 ...

#2 [internal] pushing docker.io/electricsql/electric-canary:latest
#2 0.000 pushing sha256:e372c6ad86713cdbf726bbef83920eb32ecf9034d3794abde8d9fa73805413b1 to docker.io/electricsql/electric-canary:latest
#2 DONE 1.8s

#1 [internal] pushing docker.io/electricsql/electric-canary:3516b9780
#1 DONE 1.8s
```


### 🛠️ Changes
- Updated `derive_build_vars` job to prioritize `inputs.release_tag` and
`github.event.release.tag_name` over the event name string.
- Refactored shell logic to use more idiomatic `-n` (non-zero string)
checks.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Refactor**
* Simplified the Docker Hub image sync workflow's release detection
logic to prioritize explicit release tags and fallback to commit-based
runs, preserving existing behavior for release and manual triggers.

<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
balegas added a commit that referenced this pull request Mar 8, 2026
- Remove expiry_manager.get_shape_count span: runs every 60s per stack,
  is a simple ETS count, generated 887K events/day
- Remove expiry_manager.get_least_recently_used span: ETS fold, only
  meaningful as part of the expire_shapes parent span
- Remove shape_status.validate_shape_handle span: sub-millisecond ETS
  hash comparison called 2x per request, was the #1 span by volume
  at 3.5M events/day

These operations are too fast and frequent to warrant individual spans.
The parent spans (Plug_shape_get, expiry_manager.expire_shapes) provide
sufficient context for debugging.

Fixes #3977

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
KyleAMathews added a commit that referenced this pull request Apr 10, 2026
Adds test/pbt-micro.test.ts — a dedicated PBT suite that exercises
twelve narrow invariants in the ShapeStream client. Each TARGET's
opening comment documents the invariant under test, and the PBTs
shook loose eight real bugs that are fixed in this commit:

  #1 canonicalShapeKey used URLSearchParams.set() for custom params,
     collapsing duplicate keys (e.g. ?tag=a&tag=b → ?tag=b) so two
     genuinely distinct shapes shared a cache key. Switched to append().

  #2 Shape#process clobbered shouldNotify by assignment in three
     places, so any sequence where a change message followed a
     status-change message would silently drop the change's
     notification. OR-accumulate instead, and track hadData before
     must-refetch clears state so subscribers still see the reset.

  #3 SubsetParams GET serialization dropped limit=0 and offset=0
     via falsy checks. Switched to explicit !== undefined guards.

  #4 Shape#requestedSubSnapshots dedup keyed on bigintSafeStringify,
     which preserves insertion order, so permutation-equivalent
     params produced different keys and re-execution fired the
     same snapshot N times. Added canonicalBigintSafeStringify to
     helpers.ts that recursively sorts object keys.

  #5 snakeToCamel collapsed runs of underscores into a single
     camelCase boundary, so user_id and user__id (distinct db
     columns) decoded to the same app key, corrupting rows with
     mapped values. snakeToCamel now preserves (n-1) literal
     underscores for a run of n, and camelToSnake's boundary
     regex was widened to ([a-z_])([A-Z]) so the round-trip is
     injective.

  #6 Shape#reexecuteSnapshots caught and discarded errors from
     stream.requestSnapshot, silently dropping failed sub-snapshot
     re-execution on shape rotation. Errors are now collected and
     the first one is surfaced via #error + #notify.

  #7 SnapshotTracker populated xmaxSnapshots and
     snapshotsByDatabaseLsn in addSnapshot but never cleaned them
     up — neither on removeSnapshot nor on addSnapshot with a
     repeated mark. A later shouldRejectMessage eviction loop
     would walk the stale reverse index and wrongly delete the
     current snapshot, allowing duplicate change messages to slip
     through. Stored databaseLsn on each entry and added
     #detachFromReverseIndexes that runs on both add (before
     overwriting) and remove.

  #8 Shape#awaitUpToDate never observed the stream's error state,
     so calling requestSnapshot on a terminally-errored stream
     would hang forever on the setInterval polling loop. The
     helper now checks #error / stream.error up front, subscribes
     to the stream's onError, and settles the internal promise
     via reject on any terminal error path.

Also:
  - vitest.pbt.config.ts — dedicated config that skips the
    real-Electric globalSetup and includes both model-based
    and pbt-micro test files.
  - bin/pbt-soak.sh — soak runner that loops PBT iterations
    with fresh seeds, captures counterexamples on failure.
  - test/pbt-micro.test.ts TARGET 4 (UpToDateTracker) restores
    real timers in afterEach so TARGET 12's setInterval doesn't
    hang when the suite runs end-to-end.
  - SPEC.md cross-references the L6 fetchSnapshotWithRetry PBT
    from the unconditional-409-cache-buster invariant.
  - model-based.test.ts gains response builders for update,
    delete, and mixed-batch 200s with lsn/op_position/txids
    headers for more realistic change sequences.

Verified with 319 unit tests, 44 PBT tests at 500 runs each
(and 2000-run soak), 62 column-mapper/snapshot-tracker tests,
typecheck clean, eslint clean, static-analysis tests clean.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
msfstef added a commit that referenced this pull request Apr 14, 2026
…th (#4127)

## Summary

During deployments, we observed an onslaught of `ArgumentError` crashes
when the stack is trying to get ready:

```
** (ArgumentError) errors were found at the given arguments:
  * 1st argument: the table identifier does not refer to an existing ETS table
    (stdlib) :ets.next_lookup(#Reference<...>, {51460894745528, 0})
    pure_file_storage.ex:1148: Electric.ShapeCache.PureFileStorage.read_range_from_ets_cache/5
    pure_file_storage.ex:1076: Electric.ShapeCache.PureFileStorage.stream_main_log/3
```

### Root cause

Each shape's Consumer process owns an unnamed ETS buffer table for
in-memory log data. The table's TID (reference) is stored in the
stack-wide named ETS table so that HTTP reader processes can look it up
via `for_shape/2` and `read_or_initialize_metadata/2`.

During stack restarts, Consumer processes terminate (or crash and
restart), which destroys their buffer ETS tables. HTTP requests that are
in-flight — or arrive during this window — can capture a stale TID and
crash when they attempt `:ets.next_lookup` on it.

### Race windows identified

1. **Terminate ordering gap** — Between `:ets.delete(buffer_ets)` and
`clean_shape_ets_entry` (which nils the reference in stack ETS), new
readers could look up the dead TID from stack ETS.
2. **Already-captured TID** — Even after the reference is nil'd, readers
that captured the TID *before* terminate started still hold the stale
reference in a local variable.
3. **Startup churn** — During stack initialization many Consumers start
simultaneously and some may fail/restart, creating repeated windows
where stale TIDs exist in the stack ETS.

### Fix

- **`safe_next_lookup/2`**: Wraps `:ets.next_lookup` in a rescue for
`ArgumentError`, returning `:ets_dead`. The recursive
`read_range_from_ets_cache/5` handles this the same as
`:"$end_of_table"` — returning whatever was accumulated so far. The
callers at both call sites (pure ETS path and mixed disk+ETS path)
already detect empty/partial reads and fall back to reading from disk.
The rescue is isolated to this leaf function specifically to **preserve
tail call optimization** in the recursive reader.
- **Reversed terminate ordering**: `clean_shape_ets_entry` (which nils
the `ets_table` field) now runs *before* `:ets.delete(buffer_ets)`,
closing race window #1 for new readers — they see `ets_table: nil` and
hit the existing nil guard at `read_range_from_ets_cache(nil, _, _)`.
Git history confirms the original intent was nil-before-delete (commit
5677df7), and the current ordering was an artifact of successive
refactors.

### Data correctness analysis

The key invariant that makes this safe: **`terminate` calls
`close_all_files` (which flushes buffered data to disk via `IO.binwrite`
+ `:file.datasync`) *before* touching the ETS table.** After terminate,
all data that was in the buffer ETS is on disk.

The existing code already handles the semantically identical case of
"ETS cleared by a concurrent flush" — comments at lines 1060 and 1079
describe this exact pattern. A deleted table is the same situation: data
was in ETS, now it's on disk.

Failure scenario analysis:

| Scenario | Wrong data? | Wrong order? | Duplicates? | Missing data? |
|---|---|---|---|---|
| Clean terminate (flush succeeds) | No | No | No | No — chunk index
finds flushed data |
| Partial ETS read before crash | No | No | No | No — partial data
discarded, full range re-read from disk |
| Consumer killed without terminate | No | No | No | Yes — tail data
lost (inherent to unclean kill, not introduced by this fix) |
| No chunk index entry yet | No | No | No | Yes — empty response, client
retries |

In all cases: data returned is correct and ordered. The only failure
mode is returning less data than the absolute latest, which triggers
client retry.

## Test plan

- [x] New test: reader falls back to disk when ETS table is deleted
(pure ETS path)
- [x] New test: reader falls back to disk when ETS table is deleted
(mixed disk + ETS path)
- [x] Both tests fail with `ArgumentError` before the fix, pass after
- [x] Existing "ETS read/write race condition" tests still pass
- [x] Full `test/electric/shape_cache/` suite passes (181 tests, 0
failures)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
balegas added a commit that referenced this pull request May 8, 2026
#1 — `mcp.json` is now explicit opt-in. New `loadProjectMcpConfig`
option (boolean | string) on `BuiltinAgentsServer`; defaults to off
because stdio servers spawn local commands. The Electron desktop and
the `electric-ax` CLI opt in; library embedders get the safe default.

#2 — `stop()` tears down MCP resources. Added `Registry.close()`
(closes every transport, forgets auth state, emits a final empty
snapshot). `BuiltinAgentsServer.stop()` now also disposes the
`mcp.json` chokidar watcher and unregisters the `mcp` tool provider.

#3 — `RuntimeRegistry.register()` accumulates types per runtime
instead of last-write-wins, fixing `/api/runtimes` losing earlier
types when entity-type registration POSTs land in parallel.

#4 — `applyMerged` is async/await so `mcpRegistry.applyConfig`
rejections actually reach the catch (previously voided inside a
`.then`, escaping as unhandled rejections). Optional
`onConfigError` callback exposed for embedders.

#5 — `composeToolsWithProviders` warns when a named MCP server in
`mcp.tools(['x'])` is unavailable (unknown or not yet ready). Wildcard
sentinels stay silent; missing names dedupe within a single call.

#6a — `hashConfig()` includes `timeoutMs` so timeout-only edits no
longer skip the reconfigure path and leave the entry's stale.

#6b — `mcp.tools()` (no arg) is the canonical "every registered
server" form. `mcp.tools('*')` kept for back-compat. Built-ins
`horton`/`worker` and the docs use the no-arg form.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants