Route DebuggerController operations through a per-controller worker queue#1090
Route DebuggerController operations through a per-controller worker queue#1090xusheng6 wants to merge 10 commits into
Conversation
The 17 `Launch`/`Attach`/`Connect`/`Go`/`Step*`/`RunTo*`/`Restart`/`Detach`/ `Quit`/`Pause` methods each spawn a detached `std::thread` that runs the corresponding `*AndWait` work, capturing `this` by reference. Nothing keeps the controller alive for the duration of the detached call, so if the controller's refcount drops to zero before the worker exits (e.g. the user closes the tab), the worker dereferences a dangling pointer. Capture a `DbgRef<DebuggerController>` by value into each lambda so the controller stays alive until the worker thread finishes. Fixes #1083. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…er queue
Replaces the previous "spawn one std::thread per operation and detach it"
pattern with a single persistent worker thread per controller plus a FIFO
work queue. Every public Xxx()/XxxAndWait() method now goes through Submit().
The worker thread is owned by the controller, started in the constructor
and joined in the destructor (after draining), so any task can no longer
outlive the controller it touches -- the lifetime guarantee is structural
rather than reliant on per-call DbgRef captures (which this PR removes).
Highlights:
* Submit() detects re-entrant calls from the worker itself and runs them
inline, so e.g. RestartAndWaitOnWorker can call QuitAndWaitOnWorker and
LaunchAndWaitOnWorker without deadlocking on the queue.
* SubmitAndWait() submits a task and waits on the resulting future. Every
XxxAndWait now takes an optional std::chrono::milliseconds timeout,
defaulting to milliseconds::max() ("wait forever") so existing callers
in ffi.cpp and uinotification.cpp keep their current behavior. If a
non-infinite timeout elapses, the engine is signaled to break and we
still wait for the in-flight op to settle before returning.
* Pause()/PauseAndWait() are special-cased: they bypass the queue and call
m_adapter->BreakInto() directly on the caller's thread. The worker is
blocked inside ExecuteAdapterAndWait for whatever resume op is in
flight; BreakInto unsticks it. PauseAndWait then waits for the worker
to drain past the running task by submitting a no-op probe.
The existing public/internal split (XxxAndWaitInternal does the actual
work; the old XxxAndWait was the lock-Internal-notify wrapper) is
preserved. The old wrapper is renamed to XxxAndWaitOnWorker (private) and
the new public XxxAndWait(timeout) is a thin SubmitAndWait wrapper around
it. This keeps the public API surface identical to today for any caller
that doesn't pass a timeout.
Out of scope for this PR: adapter-internal threads (DbgEngAdapter::Attach
spawned thread, EngineLoop, LLDB EventListener, RSP send/receive loops).
Those have their own lifetime concerns and will be handled separately.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…stop machinery Adapter stops are now an internal signal from the adapter thread to the worker, not events on the public dispatcher queue. PostDebuggerEvent intercepts AdapterStoppedEventType and delivers the stop reason to a new mutex+condvar pair (m_adapterStopMutex / m_adapterStopCv / m_adapterStopPending). The worker consumes them via two paths: 1. In-flight ops: ExecuteAdapterAndWait now claims the channel for its entire duration (m_inAdapterWait = true), kicks off the adapter operation, then loops on WaitForAdapterStop. The conditional- breakpoint check moves here -- on a false condition we silently resume via m_adapter->Go() and wait again, without releasing the channel. This eliminates the old "drive m_adapter->Go() from the dispatcher thread" hack and the m_suppressResumeEvent flag that papered over it. 2. Spontaneous stops: when the user types e.g. `si` directly into the LLDB REPL while no controller op is in flight, m_inAdapterWait is false, so PostDebuggerEvent queues HandleSpontaneousAdapterStop on the worker -- which updates caches and calls NotifyStopped, the same effect as the dispatcher synthesis the old code did at debuggercontroller.cpp:2279. Target-exit / detach events continue through the dispatcher queue (they're user-visible) but ALSO signal the adapter-stop channel so an in-flight WaitForAdapterStop unblocks when the target dies. The dispatcher (DebuggerMainThread) no longer has any AdapterStoppedEventType-specific code paths. m_lastAdapterStopEventConsumed and m_suppressResumeEvent are removed; the temporary RegisterEventCallback/Semaphore pattern inside ExecuteAdapterAndWait is replaced by the direct WaitForAdapterStop wait. AdapterStoppedEventType remains in the public enum so adapters keep posting it the same way -- the change is purely in how the controller routes it once it arrives at PostDebuggerEvent. Adapter implementations need no changes. Stacked on the worker-queue refactor (this PR / #1087). Implements #1089. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two bugs in how interrupting ops (Restart/Quit/Detach) interact with the
worker queue:
1. The public Restart/Quit/Detach methods just Submit'd their work to
the queue. If a resume op (Go/Step/RunTo) was in flight, the worker
was blocked inside WaitForAdapterStop -- the queued Restart/Quit/Detach
would sit indefinitely with nothing to interrupt the running target.
To the user, clicking Restart while running looked like a no-op.
Fix: these methods now call RequestInterrupt() (a small helper that
sets m_userRequestedBreak and invokes m_adapter->BreakInto()) on the
caller's thread before queueing the task -- same pattern as Pause().
The BreakInto unsticks the in-flight op's WaitForAdapterStop, that
op drains, and the queued Restart/Quit/Detach runs next. Applied to
both the async public methods and their *AndWait(timeout) variants.
2. QuitAndWaitOnWorker, when it observed IsRunning() inside a re-entrant
call (e.g. from Restart), called the public PauseAndWait() -- which
does BreakInto + Submit([]{}).wait(). On the worker thread Submit
detects re-entry and runs the empty lambda inline, so the wait was
a no-op. We then issued Quit on a still-running target, which DbgEng
in particular doesn't handle well.
Fix: switch that call to PauseAndWaitInternal(), which routes through
ExecuteAdapterAndWait(Pause) and actually waits for the engine via
the adapter-stop channel. With (1) in place this branch should rarely
fire, but it's the correct defensive behavior.
Also pulled the BreakInto + m_userRequestedBreak pair out of Pause and
PauseAndWait into the shared RequestInterrupt() helper.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The controller used to subscribe to its own broadcast: when something happened,
it posted a DebuggerEvent to its own dispatcher, the dispatcher pulled the
event back out, handed it to EventHandler (running on m_debuggerEventThread),
and EventHandler mutated m_state. That roundtrip is pure overhead for what
is structurally just a "controller updates its own state" operation, and it
created a race: the worker could observe stale m_state after WaitForAdapterStop
returned but before the dispatcher had gotten around to running EventHandler.
The previous commit (now reverted) patched that race by deferring the
worker-wake until after the dispatcher; this commit removes the roundtrip
entirely.
Renames EventHandler -> ApplyOwnStateForEvent (same body) and calls it inline
from PostDebuggerEvent, before the event is enqueued for the dispatcher and
before the adapter-stop channel is signaled. Drops the
RegisterEventCallback("Debugger Core") registration in the ctor. The
dispatcher now only fans out to genuinely-external consumers (UI widgets,
plugins, scripting providers); the controller's own state is updated
synchronously by whichever thread posted the event.
This also means the adhoc fix in d4973e0 (deferring the channel signal until
the dispatcher had updated state) is no longer needed -- the signal in
PostDebuggerEvent can fire immediately after ApplyOwnStateForEvent, because
the state is already current. That commit is reset out of this branch.
Thread-safety notes: m_state's connection/execution status fields are simple
assignments. The other touched fields (m_lastIP, m_currentIP, m_exitCode,
caches via UpdateCaches, etc.) used to be mutated on the single dispatcher
thread; they're now mutated on whichever thread posts each event. In practice
the engine produces one event at a time so concurrent mutation doesn't
arise, but if it ever does a mutex is trivial to add. Deferred m_accessor
deletion still uses the detached-thread trick so we don't self-join if the
worker is the one calling.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
d4973e0 to
3a44260
Compare
The two adapter mutexes were the synchronization mechanism for an older model where multiple threads (spawned per operation) could call into m_adapter concurrently. mutex1 covered Go/Step/Launch/etc.; mutex2 was separate so Pause/Quit/Detach could acquire their lock without blocking on a long-running resume op. With the worker queue, ExecuteAdapterAndWait runs exclusively on m_workerThread, so try_lock() always succeeded immediately and the mutexes did nothing. The new Pause path doesn't go through ExecuteAdapterAndWait at all -- it calls m_adapter->BreakInto() out-of-band from RequestInterrupt -- so there is no longer any "concurrent" adapter access that the mutexes were protecting against. Replaces them with a `assert(this_thread::get_id() == m_workerThreadId)` at the top of ExecuteAdapterAndWait to document and enforce the actual invariant. Separately addresses the m_adapter pointer-access race that the old mutexes never actually fixed: m_adapter is written from the worker (CreateDebugAdapter) and read from arbitrary threads (RequestInterrupt's BreakInto path). Made m_adapter `std::atomic<DebugAdapter*>` so the cross-thread read/write pair is race-free under C++20's memory model. All ~50 internal `m_adapter->...` sites become `m_adapter.load()->...`; null-checks and pointer-typed assignments keep working unchanged via std::atomic's implicit conversion. RequestInterrupt and SubmitAndWait's break-on-timeout path now snapshot the pointer with one acquire load to avoid load+call races. The pointed-to adapter object's lifetime is still guaranteed externally: the adapter is destroyed in ~DebuggerState (which runs inside ~DebuggerController after both worker threads have joined), and DbgRef holders on the controller prevent its destruction during cross-thread Pause/Restart/Quit/Detach calls. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…attern The std::atomic<DebugAdapter*> conversion in the previous commit was half-hygiene: it eliminated a torn-read / reorder race on the pointer bits, but the meaningful failure mode (calling into a destroyed adapter object) is not protected by pointer-level atomicity at all. That protection comes from external lifetime guarantees -- the adapter is destroyed only inside ~DebuggerState, which runs inside ~DebuggerController after both worker threads have joined, and cross-thread callers hold a DbgRef on the controller during their call. Given that, the atomic was paying ~50 ".load()" calls' worth of cosmetic noise for a corner-case race (Pause during Launch) that the UI doesn't even expose. Reverts m_adapter back to a plain DebugAdapter* and the ~50 .load() sites back to ordinary arrow access. The mutex removal from the same commit (m_adapterMutex / m_adapterMutex2 -> assert(this_thread == worker)) stays; that one is unambiguously correct. Retains the cross-thread snapshot pattern in RequestInterrupt and SubmitAndWait's timeout-break path -- "if (DebugAdapter* adapter = m_adapter) adapter->BreakInto()" -- so the null check and the call see the same pointer value even without atomicity. Good hygiene regardless. The real structural fix (refcount the DebugAdapter so the type system enforces the lifetime guarantee that DbgRef currently provides by convention) is filed as a follow-up issue. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When the user interrupts a freely-running target (Pause, or the interrupt that Restart/Quit/Detach issue), the target stops but the controller state was never updated -- the UI kept showing "running" even though the process had halted. Cause: the worker op (GoAndWaitOnWorker etc.) guarded its NotifyStopped call with !m_userRequestedBreak. That guard is a leftover from the pre-refactor model, where the Pause path itself ran ExecuteAdapterAndWait(Pause) and called NotifyStopped, so the resume op deliberately deferred to it to avoid a double notify. In the new model Pause is out-of-band (RequestInterrupt just sets the flag and calls BreakInto) and does NOT notify, so the worker op is the only place left that can report the stop -- but it was skipping it precisely when a break had been requested. Drop the !m_userRequestedBreak condition from the 14 NotifyStopped sites so the worker op always reports a genuine stop. m_userRequestedBreak is still used by ShouldSilentResumeAfterStop to suppress conditional-breakpoint auto-resume when the user explicitly broke. This also restores the pre-refactor behavior for Quit/Restart-while-running, where the old QuitAndWait called PauseAndWait (which notified an intermediate stop) before quitting. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds DebuggerState::m_adapterAccessMutex, taken around every live call into the adapter regardless of which thread makes it: worker control ops + cache refresh, and UI-thread memory/register reads and writes. This serializes adapter access so two threads can't hit it at once -- which for serial-protocol backends (gdb stub) would otherwise interleave packets on the wire and corrupt the stream -- and is the concurrency half of the #1091 data-race fixes. The lock is held ONLY around individual adapter calls, never across the run-wait, so Pause/BreakInto can acquire it while the target is running. In ExecuteAdapterAndWait it wraps just the resume-request switch (released before WaitForAdapterStop) and the silent-resume Go(); RequestInterrupt wraps BreakInto; the cache Update/Read/Write paths wrap their single adapter call. Critically, the lock is never held across a call into BN core. BN core, under its own analysis lock, calls back into the debugger's ReadMemory (via the file-accessor callback), which wants the adapter lock -- so holding the adapter lock across a BN-core call inverts the order and deadlocks. DebuggerThreads::Update therefore reads raw thread/frame data from the adapter under the lock, releases it, and only then calls SymbolizeFrames (which queries BN analysis). It's recursive_mutex to tolerate any synchronous re-entrant adapter call during a callback. Per-cache mutexes (m_registersMutex/m_memoryMutex/...) are retained; they guard the cache containers, which is a separate concern from serializing the adapter calls themselves. Not yet covered: lifetime (DebuggerState refcounting), the scalar-field races (m_currentIP / connection status), and the remaining direct adapter calls (breakpoints, TTD queries, GetProcessList, Get/SetProperty, GetAdapterSettings). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| // pair (which guarded against concurrent adapter access from multiple spawned threads) | ||
| // is no longer needed. The new Pause path bypasses this method entirely and calls | ||
| // m_adapter->BreakInto() out-of-band; everything else funnels through here on the worker. | ||
| assert(std::this_thread::get_id() == m_workerThreadId); |
There was a problem hiding this comment.
I'd recommend using BN_ASSERT or BN_RELEASE_ASSERT, depending on whether we want to validate this only in debug builds or also in our release builds. I'd lean towards the latter given that this check is cheap.
| DebuggerController::~DebuggerController() | ||
| { | ||
| { | ||
| std::lock_guard<std::mutex> lock(m_workQueueMutex); |
There was a problem hiding this comment.
I think there's an issue with the way mutexes and condition variables are used here that can lead to missed wakeups. Specifically, DebuggerController::WaitForAdapterStop uses m_adapterStopMutex with its condition variable, while this store occurs with m_workQueueMutex held. This means there's a race that can happen where m_workerShouldExit is set here but DebuggerController::WaitForAdapterStop does not see it.
This is related to the note at https://en.cppreference.com/cpp/thread/condition_variable that says:
Even if the shared variable is atomic, it must be modified while owning the mutex to correctly publish the modification to the waiting thread.
| // (GoAndWaitOnWorker etc.) is the sole notifier; it calls NotifyStopped on every | ||
| // genuine stop regardless of m_userRequestedBreak. m_userRequestedBreak only | ||
| // suppresses conditional-breakpoint auto-resume in ShouldSilentResumeAfterStop. | ||
| m_userRequestedBreak = true; |
There was a problem hiding this comment.
Is it safe to modify m_userRequestedBreak without taking any locks? It appears to potentially be read / written from different threads, so either needs to always be accessed with a lock held or to be std::atomic<bool>.
|
|
||
| private: | ||
| // m_adapter is the active debug adapter for this controller. Written exclusively | ||
| // from the worker thread (in CreateDebugAdapter); read both from the worker |
There was a problem hiding this comment.
Since this field is accessed by multiple threads, don't accesses need to be guarded by a mutex to avoid being data races?
|
|
||
| void DebuggerController::WorkerThreadMain() | ||
| { | ||
| m_workerThreadId = std::this_thread::get_id(); |
There was a problem hiding this comment.
This is set on the worker thread but read from other threads without any synchronization, which is a data race. This needs some form of synchronization.
Another option would be to skip storing a thread ID, and instead use a thread-local variable that points to the DebuggerController instance when on the worker thread and is nullptr otherwise. To check whether code is running on the worker thread you'd then instead test t_controllerOnWorker == this. Since it is a thread-local variable there is no need for synchronization.
| // caches updated via UpdateCaches) are mutated under the worker's serialization | ||
| // for ops the worker generates (TargetStopped, LaunchFailure, etc.) and under | ||
| // the adapter event thread for spontaneous events (TargetExited, RegisterChanged, | ||
| // ActiveThreadChanged). Concurrent mutation is unlikely in practice because the |
There was a problem hiding this comment.
I'm concerned by this reasoning. It seems to say that this code contains data races, but it's fine because they probably don't happen very often.
A data race is when two or more threads access the same memory location, with at least one of those accesses being a write, and there is no synchronization between the accesses (synchronization here is defined in terms of the happens-before relationship).
In C++, if a data race occurs, the behavior of the program is undefined. The compiler is free to assume that data races do not happen and optimize the code on that basis. This makes it impossible to reason about what the code is actually doing.
| // in-flight op to settle before returning). A timeout of milliseconds::max() means | ||
| // "wait forever" and bypasses wait_for entirely (avoids overflow inside the stdlib). | ||
| template<typename F> | ||
| auto SubmitAndWait(F&& f, std::chrono::milliseconds timeout) |
There was a problem hiding this comment.
What behavior does this have if run on the worker thread? Submit has an explicit case for it. Does this need one? If it's never expected to be called from the worker thread, an assert would serve to document that and catch any potential future misuse.
| DebugStopReason RestartAndWaitOnWorker(); | ||
| void DetachAndWaitOnWorker(); | ||
| void QuitAndWaitOnWorker(); | ||
| DebugStopReason PauseAndWaitOnWorker(); |
There was a problem hiding this comment.
PauseAndWaitOnWorker appears to not be referenced anywhere.
Replaces the "spawn one detached
std::threadper operation" model inDebuggerControllerwith a single per-controller worker thread + FIFO queue, and makes the controller own its state directly instead of bouncing it through the event dispatcher.Implements #1088 and #1089. Supersedes the auto-closed #1087.
Why
The per-op detach pattern (17 sites) captured
thisand detached, so a controller could be destroyed while a worker was still running — the recurring use-after-free family in Sentry (#1083, BINARYNINJA-1A/3X/5Z/60/7M). The queue makes lifetime structural: the worker is owned by the controller and joined in the destructor, so no adapter work can outlive it.What changed
Submit/SubmitAndWait. One worker thread per controller, joined on destruction.Submitruns inline if called from the worker itself (soRestart→Quit+Launchdoesn't self-deadlock).XxxAndWaitgains an optionalstd::chrono::milliseconds timeout(default = wait forever). Existing callers unchanged.BreakInto()on the caller's thread, then queue the actual op. Without this they'd sit behind the in-flight resume forever.AdapterStoppedEventis now an internal worker signal (condvar), not a public dispatcher event. Conditional-breakpoint silent-resume moved onto the worker; the dispatcher's stop machinery,m_lastAdapterStopEventConsumed, andm_suppressResumeEventare gone. Adapters are unchanged.EventHandler(the controller subscribing to its own broadcast) is replaced byApplyOwnStateForEvent, called inline inPostDebuggerEventbefore the event is broadcast. The dispatcher is now pure fan-out to external consumers (UI/plugins). This removes the worker-vs-dispatcher race that broke Restart.m_adapterMutex/m_adapterMutex2guarded concurrent adapter access that can no longer happen; replaced with anassert(on worker thread).NotifyStoppedeven on a user-requested break (the old Pause path used to do it; the new out-of-band Pause doesn't), fixing "target halts but UI still shows running".API compatibility
Public method signatures unchanged (timeout is an appended optional arg). Callers in
core/ffi.cppandui/uinotification.cppneed no changes.Out of scope (follow-ups)
m_targetControlMutexis now largely redundant; kept to limit churn.DebugAdapterwas tried and reverted — the adapter has no independent lifecycle (freed only in~DebuggerState), so it solved no real UAF.Test plan
ninja— core + ui)sityped directly in the LLDB REPL while idle (spontaneous stop)test/debugger_test.py