fix: Race condition for reader after shape writer flushes data#3719
Conversation
📝 WalkthroughWalkthroughDetects empty or partial ETS reads in PureFileStorage and retries by refreshing metadata and reading from disk when ETS data is missing or truncated; no exported/public signatures were changed. Changes
Sequence Diagram(s)sequenceDiagram
participant Reader
participant ETS
participant Metadata
participant Disk
participant Writer
Note over Writer,ETS: Concurrent flush may clear/partially update ETS
Reader->>ETS: read_range_from_ets_cache(min,max)
ETS-->>Reader: {data, last_offset} or {[], nil}
alt ETS returned nil or last_offset < upper_read_bound
Reader->>Metadata: refresh_metadata()
Metadata-->>Reader: fresh_metadata (last_persisted)
Reader->>Disk: stream_from_disk(range_based_on_metadata)
Disk-->>Reader: full_data_stream
else ETS returned complete last_offset
ETS-->>Reader: data (used directly)
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
📜 Recent review detailsConfiguration used: Organization UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
🧰 Additional context used🧬 Code graph analysis (1)packages/sync-service/test/electric/shape_cache/pure_file_storage_test.exs (3)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
🔇 Additional comments (1)
✏️ Tip: You can disable this entire section by setting Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3719 +/- ##
=======================================
Coverage 87.36% 87.36%
=======================================
Files 23 23
Lines 2011 2011
Branches 532 528 -4
=======================================
Hits 1757 1757
Misses 252 252
Partials 2 2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
This comment has been minimized.
This comment has been minimized.
✅ Deploy Preview for electric-next ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
912988a to
a33f712
Compare
| # Pure ETS read case | ||
| case read_range_from_ets_cache(ets, min_offset, upper_read_bound) do | ||
| {_data, last_offset} | ||
| when not retried? and (is_nil(last_offset) or last_offset < upper_read_bound_tuple) -> |
There was a problem hiding this comment.
I'm contemplating whether we want to retry, or ti simply directly go to the main log since we know that this happens if we've flushed.
There was a problem hiding this comment.
I've updated the code to now directly read from the disk if the ETS cache is empty, since we know it will have flushed in that case
9391aaf to
a33f712
Compare
When a reader uses stale metadata to decide to read from ETS, but a concurrent flush clears ETS before the read completes, the reader could return empty results even though data exists on disk. The fix adds retry logic to stream_main_log: when ETS returns empty unexpectedly, retry once with fresh metadata. After a flush, metadata is updated (last_persisted) BEFORE ETS is cleared, so the retry will see the new offset and read from disk instead. Also removes investigation files from previous commit. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Handle partial ETS reads (not just empty) by tracking last offset read - Unify empty and partial read cases into single retry condition - Add proper @SPEC annotations using :ets.tid() type - Remove is_reference guards that caused dialyzer warnings - Add test for partial ETS read race condition Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
a33f712 to
2defcd3
Compare
Without mode: :shared, Repatch patches can leak to other tests running in parallel, causing doctest failures with REPATCH- prefixed function names in error messages. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When detecting empty/partial ETS reads, read directly from disk using existing `upper_read_bound` and `boundary_info` instead of re-reading metadata. This is safe because: 1. Flush order guarantees: write to disk → update metadata → clear ETS 2. If ETS is cleared, data is already on disk up to `upper_read_bound` 3. Existing `boundary_info` is valid for reading disk data Removes the `read_from_disk_after_ets_race` helper - just call `stream_from_disk` directly with existing values. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Repatch instruments the entire module when patching a function, which was causing LogOffset doctests to fail with "REPATCH-new" in the function name. Simplify the race condition tests to directly test the fallback behavior without needing to simulate exact race timing. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
This PR has been released! 🚀 The following packages include changes from this PR:
Thanks for contributing to Electric! |
Summary
Fixes a race condition in
PureFileStoragewhere readers can get empty or partial results when reading from ETS while a concurrent flush clears the buffer.The Bug
When a reader calls
get_log_stream, it reads metadata (includinglast_persisted_offset) as a snapshot, then uses that to decide whether to read from disk or ETS. However, a concurrent flush can happen between these steps:The data exists on disk but the reader misses it because it made its read decision based on stale metadata.
Two failure modes
Empty ETS: Flush completes entirely before reader's first
next_lookupcall → reader immediately gets'$end_of_table'and returns[]Partial ETS read: Reader iterates with multiple
next_lookupcalls, anddelete_all_objectsruns between calls:Reader got entries 1-2 but missed entries 3-N.
Note on ETS guarantees: The buffer is an
ordered_settable. Per Erlang ETS docs,delete_all_objectsis atomic and isolated—each individualnext_lookupcall sees either the full table or the empty table, never a mid-deletion state. However, a sequence ofnext_lookupcalls can span the deletion, resulting in partial data. Forordered_settables,next_lookupwill not crash during concurrent deletion; it simply returns'$end_of_table'when the table is empty.Both failure modes result in the client receiving less data than expected, but being told (via
electric-offsetheader) that they received data up to the requested offset.The Fix
Modified
read_range_from_ets_cacheto return{data, last_offset_read}wherelast_offset_readis the tuple offset of the last entry actually read (ornilif empty). This allows detecting incomplete reads.When an incomplete ETS read is detected (empty or partial), we fall back directly to reading from disk. This is safe because of the flush order in
write_loop.ex:IO.binwrite+:file.datasync— data written and synced to diskupdate_persistance_metadata— updateslast_persisted_offsetin stack ETStrim_ets— clears shape's ETS buffer via:ets.delete_all_objectsIf ETS is cleared, data is guaranteed to be on disk. We can read directly using the existing
upper_read_boundandboundary_infowithout re-reading metadata.Design Decision
The goal was to minimally alter the existing code. Alternative approaches like reader counters, delayed ETS clearing, or rearchitecting read/flush cycles were not considered—the fallback-to-disk approach requires only local changes to the reader path and doesn't affect the writer's eager memory cleanup behavior.
Test Plan
Summary by CodeRabbit
Bug Fixes
Tests
✏️ Tip: You can customize this high-level summary in your review settings.