fix(sync-service): handle deleted ETS buffer table in storage read path#4127
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #4127 +/- ##
===========================================
+ Coverage 69.65% 89.20% +19.54%
===========================================
Files 15 25 +10
Lines 557 2520 +1963
Branches 0 634 +634
===========================================
+ Hits 388 2248 +1860
- Misses 169 270 +101
- Partials 0 2 +2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
test via pipe |
f59417b to
c27a6cb
Compare
During stack restarts, HTTP readers can hold a stale ETS table TID from a per-shape buffer that has been destroyed when the owning Consumer terminates. This causes ArgumentError crashes on :ets.next_lookup in read_range_from_ets_cache. Extract safe_next_lookup/2 to rescue ArgumentError and return :ets_dead, which triggers the existing fallback-to-disk logic. The rescue is isolated to a leaf function to preserve tail call optimization in the recursive read_range_from_ets_cache. Also reverse the terminate ordering to nil the ets_table reference in the stack ETS before deleting the actual table, narrowing the race window for new readers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
c27a6cb to
87b5460
Compare
Claude Code ReviewSummaryThis PR fixes a race condition in What's Working Well
Issues FoundCritical (Must Fix)None. Important (Should Fix)No linked issue The PR description is comprehensive, but there is no linked issue. Per project convention, PRs should reference the bug report or issue that prompted the fix. This matters for tracking regressions and for future contributors trying to understand the history. If this was an internal observation during a deployment, a minimal issue documenting the observed crash (with the stack trace from the PR description) would be useful before merge. Suggestions (Nice to Have)Add a comment at the The :ets_dead ->
# ETS table was deleted during terminate; data is on disk. Callers
# detect partial reads and fall back to disk like a concurrent flush.
{Enum.reverse(acc), last_offset}Minor:
Issue ConformanceNo linked issue. The PR description itself is thorough enough to understand the bug and the fix, but a corresponding issue would be ideal for traceability. The implementation is internally consistent with what the description promises. Previous Review StatusFirst review — no previous iteration. Review iteration: 1 | 2026-04-14 |
Claude Code ReviewSummaryFixes a crash-during-deployment caused by HTTP readers encountering stale ETS table TIDs when Consumer processes terminate and destroy their buffer ETS tables. The fix has two complementary parts: reversing the terminate/1 operation ordering and adding a safe_next_lookup/2 helper that gracefully handles a deleted table. What Is Working Well
Issues FoundCritical (Must Fix)None. Important (Should Fix)None. Suggestions (Nice to Have)1. Test names imply data is in ETS but it is actually on disk File: packages/sync-service/test/electric/shape_cache/pure_file_storage_test.exs:1176 The test name uses "pure ETS path" in two senses: the code branch triggered (when last_persisted <= min_offset) and where the data resides. In this test the data is on disk. The name might trip up a future reader. Consider "reader falls back to disk on deleted ETS table (last_persisted <= min_offset branch)" or similar. 2. Rescue scope comment could clarify that non-existence is the only ArgumentError source File: packages/sync-service/lib/electric/shape_cache/pure_file_storage.ex:1165 The existing comment is good. One small addition: note that ArgumentError from :ets.next_lookup exclusively signals a deleted table. This prevents future readers from wondering whether other conditions could silently suppress real bugs. Issue ConformanceNo linked issue. Per project convention, PRs should reference the issue they address. Consider creating a retrospective issue documenting the deployment-crash symptom and linking back to this fix -- useful for future incident correlation. The fix is complete and consistent with the PR description. All three race windows are handled (window 1 by ordering, window 2 by safe_next_lookup, window 3 implicitly by both together). Previous Review StatusFirst review -- no prior iteration. Review iteration: 1 | 2026-04-14 |
|
Thanks for the review. Addressing the suggestions:
|
Missing changeset from #4127. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
During deployments, we observed an onslaught of
ArgumentErrorcrashes when the stack is trying to get ready:Root cause
Each shape's Consumer process owns an unnamed ETS buffer table for in-memory log data. The table's TID (reference) is stored in the stack-wide named ETS table so that HTTP reader processes can look it up via
for_shape/2andread_or_initialize_metadata/2.During stack restarts, Consumer processes terminate (or crash and restart), which destroys their buffer ETS tables. HTTP requests that are in-flight — or arrive during this window — can capture a stale TID and crash when they attempt
:ets.next_lookupon it.Race windows identified
:ets.delete(buffer_ets)andclean_shape_ets_entry(which nils the reference in stack ETS), new readers could look up the dead TID from stack ETS.Fix
safe_next_lookup/2: Wraps:ets.next_lookupin a rescue forArgumentError, returning:ets_dead. The recursiveread_range_from_ets_cache/5handles this the same as:"$end_of_table"— returning whatever was accumulated so far. The callers at both call sites (pure ETS path and mixed disk+ETS path) already detect empty/partial reads and fall back to reading from disk. The rescue is isolated to this leaf function specifically to preserve tail call optimization in the recursive reader.clean_shape_ets_entry(which nils theets_tablefield) now runs before:ets.delete(buffer_ets), closing race window Working with Postgres WAL format from Elixir #1 for new readers — they seeets_table: niland hit the existing nil guard atread_range_from_ets_cache(nil, _, _). Git history confirms the original intent was nil-before-delete (commit 5677df7), and the current ordering was an artifact of successive refactors.Data correctness analysis
The key invariant that makes this safe:
terminatecallsclose_all_files(which flushes buffered data to disk viaIO.binwrite+:file.datasync) before touching the ETS table. After terminate, all data that was in the buffer ETS is on disk.The existing code already handles the semantically identical case of "ETS cleared by a concurrent flush" — comments at lines 1060 and 1079 describe this exact pattern. A deleted table is the same situation: data was in ETS, now it's on disk.
Failure scenario analysis:
In all cases: data returned is correct and ordered. The only failure mode is returning less data than the absolute latest, which triggers client retry.
Test plan
ArgumentErrorbefore the fix, pass aftertest/electric/shape_cache/suite passes (181 tests, 0 failures)🤖 Generated with Claude Code