Skip to content

fix(sync-service): handle deleted ETS buffer table in storage read path#4127

Merged
msfstef merged 1 commit into
mainfrom
msfstef/fix-race-to-read-ets-buffer
Apr 14, 2026
Merged

fix(sync-service): handle deleted ETS buffer table in storage read path#4127
msfstef merged 1 commit into
mainfrom
msfstef/fix-race-to-read-ets-buffer

Conversation

@msfstef
Copy link
Copy Markdown
Contributor

@msfstef msfstef commented Apr 14, 2026

Summary

During deployments, we observed an onslaught of ArgumentError crashes when the stack is trying to get ready:

** (ArgumentError) errors were found at the given arguments:
  * 1st argument: the table identifier does not refer to an existing ETS table
    (stdlib) :ets.next_lookup(#Reference<...>, {51460894745528, 0})
    pure_file_storage.ex:1148: Electric.ShapeCache.PureFileStorage.read_range_from_ets_cache/5
    pure_file_storage.ex:1076: Electric.ShapeCache.PureFileStorage.stream_main_log/3

Root cause

Each shape's Consumer process owns an unnamed ETS buffer table for in-memory log data. The table's TID (reference) is stored in the stack-wide named ETS table so that HTTP reader processes can look it up via for_shape/2 and read_or_initialize_metadata/2.

During stack restarts, Consumer processes terminate (or crash and restart), which destroys their buffer ETS tables. HTTP requests that are in-flight — or arrive during this window — can capture a stale TID and crash when they attempt :ets.next_lookup on it.

Race windows identified

  1. Terminate ordering gap — Between :ets.delete(buffer_ets) and clean_shape_ets_entry (which nils the reference in stack ETS), new readers could look up the dead TID from stack ETS.
  2. Already-captured TID — Even after the reference is nil'd, readers that captured the TID before terminate started still hold the stale reference in a local variable.
  3. Startup churn — During stack initialization many Consumers start simultaneously and some may fail/restart, creating repeated windows where stale TIDs exist in the stack ETS.

Fix

  • safe_next_lookup/2: Wraps :ets.next_lookup in a rescue for ArgumentError, returning :ets_dead. The recursive read_range_from_ets_cache/5 handles this the same as :"$end_of_table" — returning whatever was accumulated so far. The callers at both call sites (pure ETS path and mixed disk+ETS path) already detect empty/partial reads and fall back to reading from disk. The rescue is isolated to this leaf function specifically to preserve tail call optimization in the recursive reader.
  • Reversed terminate ordering: clean_shape_ets_entry (which nils the ets_table field) now runs before :ets.delete(buffer_ets), closing race window Working with Postgres WAL format from Elixir #1 for new readers — they see ets_table: nil and hit the existing nil guard at read_range_from_ets_cache(nil, _, _). Git history confirms the original intent was nil-before-delete (commit 5677df7), and the current ordering was an artifact of successive refactors.

Data correctness analysis

The key invariant that makes this safe: terminate calls close_all_files (which flushes buffered data to disk via IO.binwrite + :file.datasync) before touching the ETS table. After terminate, all data that was in the buffer ETS is on disk.

The existing code already handles the semantically identical case of "ETS cleared by a concurrent flush" — comments at lines 1060 and 1079 describe this exact pattern. A deleted table is the same situation: data was in ETS, now it's on disk.

Failure scenario analysis:

Scenario Wrong data? Wrong order? Duplicates? Missing data?
Clean terminate (flush succeeds) No No No No — chunk index finds flushed data
Partial ETS read before crash No No No No — partial data discarded, full range re-read from disk
Consumer killed without terminate No No No Yes — tail data lost (inherent to unclean kill, not introduced by this fix)
No chunk index entry yet No No No Yes — empty response, client retries

In all cases: data returned is correct and ordered. The only failure mode is returning less data than the absolute latest, which triggers client retry.

Test plan

  • New test: reader falls back to disk when ETS table is deleted (pure ETS path)
  • New test: reader falls back to disk when ETS table is deleted (mixed disk + ETS path)
  • Both tests fail with ArgumentError before the fix, pass after
  • Existing "ETS read/write race condition" tests still pass
  • Full test/electric/shape_cache/ suite passes (181 tests, 0 failures)

🤖 Generated with Claude Code

@msfstef msfstef added the claude label Apr 14, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 14, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 89.20%. Comparing base (e9db22c) to head (87b5460).
⚠️ Report is 2 commits behind head on main.
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@             Coverage Diff             @@
##             main    #4127       +/-   ##
===========================================
+ Coverage   69.65%   89.20%   +19.54%     
===========================================
  Files          15       25       +10     
  Lines         557     2520     +1963     
  Branches        0      634      +634     
===========================================
+ Hits          388     2248     +1860     
- Misses        169      270      +101     
- Partials        0        2        +2     
Flag Coverage Δ
electric-telemetry ?
elixir ?
packages/experimental 87.73% <ø> (?)
packages/react-hooks 86.48% <ø> (?)
packages/start 82.83% <ø> (?)
packages/typescript-client 94.30% <ø> (?)
packages/y-electric 56.05% <ø> (?)
typescript 89.20% <ø> (?)
unit-tests 89.20% <ø> (+19.54%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 14, 2026

test via pipe

@msfstef msfstef force-pushed the msfstef/fix-race-to-read-ets-buffer branch from f59417b to c27a6cb Compare April 14, 2026 15:02
During stack restarts, HTTP readers can hold a stale ETS table TID from
a per-shape buffer that has been destroyed when the owning Consumer
terminates. This causes ArgumentError crashes on :ets.next_lookup in
read_range_from_ets_cache.

Extract safe_next_lookup/2 to rescue ArgumentError and return :ets_dead,
which triggers the existing fallback-to-disk logic. The rescue is
isolated to a leaf function to preserve tail call optimization in the
recursive read_range_from_ets_cache.

Also reverse the terminate ordering to nil the ets_table reference in
the stack ETS before deleting the actual table, narrowing the race
window for new readers.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@msfstef msfstef force-pushed the msfstef/fix-race-to-read-ets-buffer branch from c27a6cb to 87b5460 Compare April 14, 2026 15:08
@claude
Copy link
Copy Markdown

claude Bot commented Apr 14, 2026

Claude Code Review

Summary

This PR fixes a race condition in PureFileStorage where stale ETS buffer table references (TIDs) cause ArgumentError crashes during stack restarts/deployments. The fix is minimal, well-analyzed, and uses two complementary approaches to cover both race windows.

What's Working Well

  • Thorough root-cause analysis: The PR description precisely identifies all three race windows and explains which part of the fix closes each one. The failure-scenario table is a model for this kind of concurrent-state analysis.
  • Defense-in-depth: Reordering terminate/1 prevents new readers from seeing a dead TID (closes window Working with Postgres WAL format from Elixir #1), while safe_next_lookup/2 provides a safety net for in-flight readers that already captured the stale TID (closes window [Merged on #3] Write inserts, updates and deletes to Vaxine #2). Neither alone is sufficient.
  • Rescue placement is deliberately isolated: Extracting the rescue into safe_next_lookup/2 to preserve TCO in the recursive read_range_from_ets_cache/5 is the correct approach and the comment explains it clearly.
  • Test design: Both tests reconstruct the exact stale-metadata scenario (manipulating stack ETS directly, then deleting the buffer ETS) and confirm the fallback produces correct data rather than just not crashing.
  • Correct data invariant respected: The fix correctly relies on the existing guarantee that close_all_files (flush to disk) runs before any ETS manipulation in terminate/1. This was already the intent per the original commit message cited in the PR.

Issues Found

Critical (Must Fix)

None.

Important (Should Fix)

No linked issue

The PR description is comprehensive, but there is no linked issue. Per project convention, PRs should reference the bug report or issue that prompted the fix. This matters for tracking regressions and for future contributors trying to understand the history. If this was an internal observation during a deployment, a minimal issue documenting the observed crash (with the stack trace from the PR description) would be useful before merge.

Suggestions (Nice to Have)

Add a comment at the :ets_dead match arm in read_range_from_ets_cache/5

The :ets_dead arm is handled identically to :"$end_of_table" but without explanation. The adjacent code at lines 1060 and 1079 has comments explaining "Empty or partial read - ETS was cleared..." — a one-liner here would match that pattern and aid future readers:

:ets_dead ->
  # ETS table was deleted during terminate; data is on disk. Callers
  # detect partial reads and fall back to disk like a concurrent flush.
  {Enum.reverse(acc), last_offset}

Minor: _writer in first test

_writer = PureFileStorage.perform_scheduled_flush(writer, 0) is correct (writer not needed after this point), but _ = PureFileStorage.perform_scheduled_flush(writer, 0) is the more idiomatic Elixir convention when the variable is truly unused. Not a blocker.

Issue Conformance

No linked issue. The PR description itself is thorough enough to understand the bug and the fix, but a corresponding issue would be ideal for traceability. The implementation is internally consistent with what the description promises.

Previous Review Status

First review — no previous iteration.


Review iteration: 1 | 2026-04-14

@claude
Copy link
Copy Markdown

claude Bot commented Apr 14, 2026

Claude Code Review

Summary

Fixes a crash-during-deployment caused by HTTP readers encountering stale ETS table TIDs when Consumer processes terminate and destroy their buffer ETS tables. The fix has two complementary parts: reversing the terminate/1 operation ordering and adding a safe_next_lookup/2 helper that gracefully handles a deleted table.

What Is Working Well

  • Exemplary PR description: root cause, three race windows identified, data correctness analysis table, and test plan -- this is the gold standard for a bug fix PR
  • Surgical change: 17 lines of production code, 95 lines of tests -- the minimal diff that closes both race windows without touching unrelated logic
  • Smart TCO preservation: isolating the rescue to a leaf function (safe_next_lookup) keeps the tail-call in the recursive read_range_from_ets_cache/5 intact -- idiomatic Elixir and shows attention to runtime behaviour
  • Both code paths tested: pure ETS path and mixed disk+ETS path each have a dedicated test that fails before the fix and passes after -- exactly what a regression test should be

Issues Found

Critical (Must Fix)

None.

Important (Should Fix)

None.

Suggestions (Nice to Have)

1. Test names imply data is in ETS but it is actually on disk

File: packages/sync-service/test/electric/shape_cache/pure_file_storage_test.exs:1176

The test name uses "pure ETS path" in two senses: the code branch triggered (when last_persisted <= min_offset) and where the data resides. In this test the data is on disk. The name might trip up a future reader. Consider "reader falls back to disk on deleted ETS table (last_persisted <= min_offset branch)" or similar.

2. Rescue scope comment could clarify that non-existence is the only ArgumentError source

File: packages/sync-service/lib/electric/shape_cache/pure_file_storage.ex:1165

The existing comment is good. One small addition: note that ArgumentError from :ets.next_lookup exclusively signals a deleted table. This prevents future readers from wondering whether other conditions could silently suppress real bugs.

Issue Conformance

No linked issue. Per project convention, PRs should reference the issue they address. Consider creating a retrospective issue documenting the deployment-crash symptom and linking back to this fix -- useful for future incident correlation.

The fix is complete and consistent with the PR description. All three race windows are handled (window 1 by ordering, window 2 by safe_next_lookup, window 3 implicitly by both together).

Previous Review Status

First review -- no prior iteration.


Review iteration: 1 | 2026-04-14

@msfstef
Copy link
Copy Markdown
Contributor Author

msfstef commented Apr 14, 2026

Thanks for the review. Addressing the suggestions:

  1. Test names: Keeping as-is — "pure ETS path" and "mixed disk + ETS path" match the terminology in the stream_main_log comments and are clear enough in context.
  2. Comment on :ets_dead match arm: The safe_next_lookup comment already explains why this is safe. Adding another at the call site would be redundant.
  3. ArgumentError scope: :ets.next_lookup only raises ArgumentError for a non-existent table — there's no other failure mode to confuse with.
  4. Linked issue: This was observed during a deployment, not tracked as a separate issue.

@msfstef msfstef marked this pull request as ready for review April 14, 2026 15:15
Copy link
Copy Markdown
Contributor

@magnetised magnetised left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

@msfstef msfstef merged commit 2c54c16 into main Apr 14, 2026
48 checks passed
@msfstef msfstef deleted the msfstef/fix-race-to-read-ets-buffer branch April 14, 2026 15:48
msfstef added a commit that referenced this pull request Apr 14, 2026
Missing changeset from #4127.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants