Route DebuggerController operations through a per-controller worker queue by xusheng6 · Pull Request #1090 · Vector35/debugger

xusheng6 · 2026-05-20T20:44:50Z

Replaces the "spawn one detached std::thread per operation" model in DebuggerController with a single per-controller worker thread + FIFO queue, and makes the controller own its state directly instead of bouncing it through the event dispatcher.

Implements #1088 and #1089. Supersedes the auto-closed #1087.

Why

The per-op detach pattern (17 sites) captured this and detached, so a controller could be destroyed while a worker was still running — the recurring use-after-free family in Sentry (#1083, BINARYNINJA-1A/3X/5Z/60/7M). The queue makes lifetime structural: the worker is owned by the controller and joined in the destructor, so no adapter work can outlive it.

What changed

Worker queue. All ops route through Submit / SubmitAndWait. One worker thread per controller, joined on destruction. Submit runs inline if called from the worker itself (so Restart → Quit + Launch doesn't self-deadlock).
Timeout. Every XxxAndWait gains an optional std::chrono::milliseconds timeout (default = wait forever). Existing callers unchanged.
Pause / Restart / Quit / Detach interrupt a running target out-of-band: they call BreakInto() on the caller's thread, then queue the actual op. Without this they'd sit behind the in-flight resume forever.
Adapter-stop channel. AdapterStoppedEvent is now an internal worker signal (condvar), not a public dispatcher event. Conditional-breakpoint silent-resume moved onto the worker; the dispatcher's stop machinery, m_lastAdapterStopEventConsumed, and m_suppressResumeEvent are gone. Adapters are unchanged.
State on the worker. EventHandler (the controller subscribing to its own broadcast) is replaced by ApplyOwnStateForEvent, called inline in PostDebuggerEvent before the event is broadcast. The dispatcher is now pure fan-out to external consumers (UI/plugins). This removes the worker-vs-dispatcher race that broke Restart.
Mutexes removed. m_adapterMutex / m_adapterMutex2 guarded concurrent adapter access that can no longer happen; replaced with an assert(on worker thread).
Stop is always reported. The worker op calls NotifyStopped even on a user-requested break (the old Pause path used to do it; the new out-of-band Pause doesn't), fixing "target halts but UI still shows running".

API compatibility

Public method signatures unchanged (timeout is an appended optional arg). Callers in core/ffi.cpp and ui/uinotification.cpp need no changes.

Out of scope (follow-ups)

Adapter-internal threads (EngineLoop, LLDB listener, RSP loops) — need adapter dtors as join barriers.
m_targetControlMutex is now largely redundant; kept to limit churn.
Refcounted DebugAdapter was tried and reverted — the adapter has no independent lifecycle (freed only in ~DebuggerState), so it solved no real UAF.
UI repaint throttling for fast stepping — UI widgets: mark-dirty-on-change, repaint at most every 100ms #1093.

Test plan

Builds clean (ninja — core + ui)
Launch / step / breakpoint / pause while running / detach mid-step / close tab mid-op
Restart on a running target
si typed directly in the LLDB REPL while idle (spontaneous stop)
Conditional breakpoint with a false condition (silent resume)
IL step surfaces one stop per user action
test/debugger_test.py

The 17 `Launch`/`Attach`/`Connect`/`Go`/`Step*`/`RunTo*`/`Restart`/`Detach`/ `Quit`/`Pause` methods each spawn a detached `std::thread` that runs the corresponding `*AndWait` work, capturing `this` by reference. Nothing keeps the controller alive for the duration of the detached call, so if the controller's refcount drops to zero before the worker exits (e.g. the user closes the tab), the worker dereferences a dangling pointer. Capture a `DbgRef<DebuggerController>` by value into each lambda so the controller stays alive until the worker thread finishes. Fixes #1083. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…er queue Replaces the previous "spawn one std::thread per operation and detach it" pattern with a single persistent worker thread per controller plus a FIFO work queue. Every public Xxx()/XxxAndWait() method now goes through Submit(). The worker thread is owned by the controller, started in the constructor and joined in the destructor (after draining), so any task can no longer outlive the controller it touches -- the lifetime guarantee is structural rather than reliant on per-call DbgRef captures (which this PR removes). Highlights: * Submit() detects re-entrant calls from the worker itself and runs them inline, so e.g. RestartAndWaitOnWorker can call QuitAndWaitOnWorker and LaunchAndWaitOnWorker without deadlocking on the queue. * SubmitAndWait() submits a task and waits on the resulting future. Every XxxAndWait now takes an optional std::chrono::milliseconds timeout, defaulting to milliseconds::max() ("wait forever") so existing callers in ffi.cpp and uinotification.cpp keep their current behavior. If a non-infinite timeout elapses, the engine is signaled to break and we still wait for the in-flight op to settle before returning. * Pause()/PauseAndWait() are special-cased: they bypass the queue and call m_adapter->BreakInto() directly on the caller's thread. The worker is blocked inside ExecuteAdapterAndWait for whatever resume op is in flight; BreakInto unsticks it. PauseAndWait then waits for the worker to drain past the running task by submitting a no-op probe. The existing public/internal split (XxxAndWaitInternal does the actual work; the old XxxAndWait was the lock-Internal-notify wrapper) is preserved. The old wrapper is renamed to XxxAndWaitOnWorker (private) and the new public XxxAndWait(timeout) is a thin SubmitAndWait wrapper around it. This keeps the public API surface identical to today for any caller that doesn't pass a timeout. Out of scope for this PR: adapter-internal threads (DbgEngAdapter::Attach spawned thread, EngineLoop, LLDB EventListener, RSP send/receive loops). Those have their own lifetime concerns and will be handled separately. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…stop machinery Adapter stops are now an internal signal from the adapter thread to the worker, not events on the public dispatcher queue. PostDebuggerEvent intercepts AdapterStoppedEventType and delivers the stop reason to a new mutex+condvar pair (m_adapterStopMutex / m_adapterStopCv / m_adapterStopPending). The worker consumes them via two paths: 1. In-flight ops: ExecuteAdapterAndWait now claims the channel for its entire duration (m_inAdapterWait = true), kicks off the adapter operation, then loops on WaitForAdapterStop. The conditional- breakpoint check moves here -- on a false condition we silently resume via m_adapter->Go() and wait again, without releasing the channel. This eliminates the old "drive m_adapter->Go() from the dispatcher thread" hack and the m_suppressResumeEvent flag that papered over it. 2. Spontaneous stops: when the user types e.g. `si` directly into the LLDB REPL while no controller op is in flight, m_inAdapterWait is false, so PostDebuggerEvent queues HandleSpontaneousAdapterStop on the worker -- which updates caches and calls NotifyStopped, the same effect as the dispatcher synthesis the old code did at debuggercontroller.cpp:2279. Target-exit / detach events continue through the dispatcher queue (they're user-visible) but ALSO signal the adapter-stop channel so an in-flight WaitForAdapterStop unblocks when the target dies. The dispatcher (DebuggerMainThread) no longer has any AdapterStoppedEventType-specific code paths. m_lastAdapterStopEventConsumed and m_suppressResumeEvent are removed; the temporary RegisterEventCallback/Semaphore pattern inside ExecuteAdapterAndWait is replaced by the direct WaitForAdapterStop wait. AdapterStoppedEventType remains in the public enum so adapters keep posting it the same way -- the change is purely in how the controller routes it once it arrives at PostDebuggerEvent. Adapter implementations need no changes. Stacked on the worker-queue refactor (this PR / #1087). Implements #1089. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two bugs in how interrupting ops (Restart/Quit/Detach) interact with the worker queue: 1. The public Restart/Quit/Detach methods just Submit'd their work to the queue. If a resume op (Go/Step/RunTo) was in flight, the worker was blocked inside WaitForAdapterStop -- the queued Restart/Quit/Detach would sit indefinitely with nothing to interrupt the running target. To the user, clicking Restart while running looked like a no-op. Fix: these methods now call RequestInterrupt() (a small helper that sets m_userRequestedBreak and invokes m_adapter->BreakInto()) on the caller's thread before queueing the task -- same pattern as Pause(). The BreakInto unsticks the in-flight op's WaitForAdapterStop, that op drains, and the queued Restart/Quit/Detach runs next. Applied to both the async public methods and their *AndWait(timeout) variants. 2. QuitAndWaitOnWorker, when it observed IsRunning() inside a re-entrant call (e.g. from Restart), called the public PauseAndWait() -- which does BreakInto + Submit([]{}).wait(). On the worker thread Submit detects re-entry and runs the empty lambda inline, so the wait was a no-op. We then issued Quit on a still-running target, which DbgEng in particular doesn't handle well. Fix: switch that call to PauseAndWaitInternal(), which routes through ExecuteAdapterAndWait(Pause) and actually waits for the engine via the adapter-stop channel. With (1) in place this branch should rarely fire, but it's the correct defensive behavior. Also pulled the BreakInto + m_userRequestedBreak pair out of Pause and PauseAndWait into the shared RequestInterrupt() helper. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The controller used to subscribe to its own broadcast: when something happened, it posted a DebuggerEvent to its own dispatcher, the dispatcher pulled the event back out, handed it to EventHandler (running on m_debuggerEventThread), and EventHandler mutated m_state. That roundtrip is pure overhead for what is structurally just a "controller updates its own state" operation, and it created a race: the worker could observe stale m_state after WaitForAdapterStop returned but before the dispatcher had gotten around to running EventHandler. The previous commit (now reverted) patched that race by deferring the worker-wake until after the dispatcher; this commit removes the roundtrip entirely. Renames EventHandler -> ApplyOwnStateForEvent (same body) and calls it inline from PostDebuggerEvent, before the event is enqueued for the dispatcher and before the adapter-stop channel is signaled. Drops the RegisterEventCallback("Debugger Core") registration in the ctor. The dispatcher now only fans out to genuinely-external consumers (UI widgets, plugins, scripting providers); the controller's own state is updated synchronously by whichever thread posted the event. This also means the adhoc fix in d4973e0 (deferring the channel signal until the dispatcher had updated state) is no longer needed -- the signal in PostDebuggerEvent can fire immediately after ApplyOwnStateForEvent, because the state is already current. That commit is reset out of this branch. Thread-safety notes: m_state's connection/execution status fields are simple assignments. The other touched fields (m_lastIP, m_currentIP, m_exitCode, caches via UpdateCaches, etc.) used to be mutated on the single dispatcher thread; they're now mutated on whichever thread posts each event. In practice the engine produces one event at a time so concurrent mutation doesn't arise, but if it ever does a mutex is trivial to add. Deferred m_accessor deletion still uses the detached-thread trick so we don't self-join if the worker is the one calling. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The two adapter mutexes were the synchronization mechanism for an older model where multiple threads (spawned per operation) could call into m_adapter concurrently. mutex1 covered Go/Step/Launch/etc.; mutex2 was separate so Pause/Quit/Detach could acquire their lock without blocking on a long-running resume op. With the worker queue, ExecuteAdapterAndWait runs exclusively on m_workerThread, so try_lock() always succeeded immediately and the mutexes did nothing. The new Pause path doesn't go through ExecuteAdapterAndWait at all -- it calls m_adapter->BreakInto() out-of-band from RequestInterrupt -- so there is no longer any "concurrent" adapter access that the mutexes were protecting against. Replaces them with a `assert(this_thread::get_id() == m_workerThreadId)` at the top of ExecuteAdapterAndWait to document and enforce the actual invariant. Separately addresses the m_adapter pointer-access race that the old mutexes never actually fixed: m_adapter is written from the worker (CreateDebugAdapter) and read from arbitrary threads (RequestInterrupt's BreakInto path). Made m_adapter `std::atomic<DebugAdapter*>` so the cross-thread read/write pair is race-free under C++20's memory model. All ~50 internal `m_adapter->...` sites become `m_adapter.load()->...`; null-checks and pointer-typed assignments keep working unchanged via std::atomic's implicit conversion. RequestInterrupt and SubmitAndWait's break-on-timeout path now snapshot the pointer with one acquire load to avoid load+call races. The pointed-to adapter object's lifetime is still guaranteed externally: the adapter is destroyed in ~DebuggerState (which runs inside ~DebuggerController after both worker threads have joined), and DbgRef holders on the controller prevent its destruction during cross-thread Pause/Restart/Quit/Detach calls. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…attern The std::atomic<DebugAdapter*> conversion in the previous commit was half-hygiene: it eliminated a torn-read / reorder race on the pointer bits, but the meaningful failure mode (calling into a destroyed adapter object) is not protected by pointer-level atomicity at all. That protection comes from external lifetime guarantees -- the adapter is destroyed only inside ~DebuggerState, which runs inside ~DebuggerController after both worker threads have joined, and cross-thread callers hold a DbgRef on the controller during their call. Given that, the atomic was paying ~50 ".load()" calls' worth of cosmetic noise for a corner-case race (Pause during Launch) that the UI doesn't even expose. Reverts m_adapter back to a plain DebugAdapter* and the ~50 .load() sites back to ordinary arrow access. The mutex removal from the same commit (m_adapterMutex / m_adapterMutex2 -> assert(this_thread == worker)) stays; that one is unambiguously correct. Retains the cross-thread snapshot pattern in RequestInterrupt and SubmitAndWait's timeout-break path -- "if (DebugAdapter* adapter = m_adapter) adapter->BreakInto()" -- so the null check and the call see the same pointer value even without atomicity. Good hygiene regardless. The real structural fix (refcount the DebugAdapter so the type system enforces the lifetime guarantee that DbgRef currently provides by convention) is filed as a follow-up issue. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

When the user interrupts a freely-running target (Pause, or the interrupt that Restart/Quit/Detach issue), the target stops but the controller state was never updated -- the UI kept showing "running" even though the process had halted. Cause: the worker op (GoAndWaitOnWorker etc.) guarded its NotifyStopped call with !m_userRequestedBreak. That guard is a leftover from the pre-refactor model, where the Pause path itself ran ExecuteAdapterAndWait(Pause) and called NotifyStopped, so the resume op deliberately deferred to it to avoid a double notify. In the new model Pause is out-of-band (RequestInterrupt just sets the flag and calls BreakInto) and does NOT notify, so the worker op is the only place left that can report the stop -- but it was skipping it precisely when a break had been requested. Drop the !m_userRequestedBreak condition from the 14 NotifyStopped sites so the worker op always reports a genuine stop. m_userRequestedBreak is still used by ShouldSilentResumeAfterStop to suppress conditional-breakpoint auto-resume when the user explicitly broke. This also restores the pre-refactor behavior for Quit/Restart-while-running, where the old QuitAndWait called PauseAndWait (which notified an intermediate stop) before quitting. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds DebuggerState::m_adapterAccessMutex, taken around every live call into the adapter regardless of which thread makes it: worker control ops + cache refresh, and UI-thread memory/register reads and writes. This serializes adapter access so two threads can't hit it at once -- which for serial-protocol backends (gdb stub) would otherwise interleave packets on the wire and corrupt the stream -- and is the concurrency half of the #1091 data-race fixes. The lock is held ONLY around individual adapter calls, never across the run-wait, so Pause/BreakInto can acquire it while the target is running. In ExecuteAdapterAndWait it wraps just the resume-request switch (released before WaitForAdapterStop) and the silent-resume Go(); RequestInterrupt wraps BreakInto; the cache Update/Read/Write paths wrap their single adapter call. Critically, the lock is never held across a call into BN core. BN core, under its own analysis lock, calls back into the debugger's ReadMemory (via the file-accessor callback), which wants the adapter lock -- so holding the adapter lock across a BN-core call inverts the order and deadlocks. DebuggerThreads::Update therefore reads raw thread/frame data from the adapter under the lock, releases it, and only then calls SymbolizeFrames (which queries BN analysis). It's recursive_mutex to tolerate any synchronous re-entrant adapter call during a callback. Per-cache mutexes (m_registersMutex/m_memoryMutex/...) are retained; they guard the cache containers, which is a separate concern from serializing the adapter calls themselves. Not yet covered: lifetime (DebuggerState refcounting), the scalar-field races (m_currentIP / connection status), and the remaining direct adapter calls (breakpoints, TTD queries, GetProcessList, Get/SetProperty, GetAdapterSettings). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

bdash · 2026-05-26T15:41:38Z

+	// pair (which guarded against concurrent adapter access from multiple spawned threads)
+	// is no longer needed. The new Pause path bypasses this method entirely and calls
+	// m_adapter->BreakInto() out-of-band; everything else funnels through here on the worker.
+	assert(std::this_thread::get_id() == m_workerThreadId);


I'd recommend using BN_ASSERT or BN_RELEASE_ASSERT, depending on whether we want to validate this only in debug builds or also in our release builds. I'd lean towards the latter given that this check is cheap.

bdash · 2026-05-26T17:27:18Z

 DebuggerController::~DebuggerController()
 {
+	{
+		std::lock_guard<std::mutex> lock(m_workQueueMutex);


I think there's an issue with the way mutexes and condition variables are used here that can lead to missed wakeups. Specifically, DebuggerController::WaitForAdapterStop uses m_adapterStopMutex with its condition variable, while this store occurs with m_workQueueMutex held. This means there's a race that can happen where m_workerShouldExit is set here but DebuggerController::WaitForAdapterStop does not see it.

This is related to the note at https://en.cppreference.com/cpp/thread/condition_variable that says:

Even if the shared variable is atomic, it must be modified while owning the mutex to correctly publish the modification to the waiting thread.

bdash · 2026-05-26T17:31:11Z

+	// (GoAndWaitOnWorker etc.) is the sole notifier; it calls NotifyStopped on every
+	// genuine stop regardless of m_userRequestedBreak. m_userRequestedBreak only
+	// suppresses conditional-breakpoint auto-resume in ShouldSilentResumeAfterStop.
+	m_userRequestedBreak = true;


Is it safe to modify m_userRequestedBreak without taking any locks? It appears to potentially be read / written from different threads, so either needs to always be accessed with a lock held or to be std::atomic<bool>.

bdash · 2026-05-26T17:34:08Z


 	private:
+		// m_adapter is the active debug adapter for this controller. Written exclusively
+		// from the worker thread (in CreateDebugAdapter); read both from the worker


Since this field is accessed by multiple threads, don't accesses need to be guarded by a mutex to avoid being data races?

bdash · 2026-05-26T17:42:41Z


+void DebuggerController::WorkerThreadMain()
+{
+	m_workerThreadId = std::this_thread::get_id();


This is set on the worker thread but read from other threads without any synchronization, which is a data race. This needs some form of synchronization.

Another option would be to skip storing a thread ID, and instead use a thread-local variable that points to the DebuggerController instance when on the worker thread and is nullptr otherwise. To check whether code is running on the worker thread you'd then instead test t_controllerOnWorker == this. Since it is a thread-local variable there is no need for synchronization.

bdash · 2026-05-26T18:00:58Z

+// caches updated via UpdateCaches) are mutated under the worker's serialization
+// for ops the worker generates (TargetStopped, LaunchFailure, etc.) and under
+// the adapter event thread for spontaneous events (TargetExited, RegisterChanged,
+// ActiveThreadChanged). Concurrent mutation is unlikely in practice because the


I'm concerned by this reasoning. It seems to say that this code contains data races, but it's fine because they probably don't happen very often.

A data race is when two or more threads access the same memory location, with at least one of those accesses being a write, and there is no synchronization between the accesses (synchronization here is defined in terms of the happens-before relationship).

In C++, if a data race occurs, the behavior of the program is undefined. The compiler is free to assume that data races do not happen and optimize the code on that basis. This makes it impossible to reason about what the code is actually doing.

bdash · 2026-05-26T18:07:57Z

+		// in-flight op to settle before returning). A timeout of milliseconds::max() means
+		// "wait forever" and bypasses wait_for entirely (avoids overflow inside the stdlib).
+		template<typename F>
+		auto SubmitAndWait(F&& f, std::chrono::milliseconds timeout)


What behavior does this have if run on the worker thread? Submit has an explicit case for it. Does this need one? If it's never expected to be called from the worker thread, an assert would serve to document that and catch any potential future misuse.

bdash · 2026-05-26T18:08:31Z

+		DebugStopReason RestartAndWaitOnWorker();
+		void DetachAndWaitOnWorker();
+		void QuitAndWaitOnWorker();
+		DebugStopReason PauseAndWaitOnWorker();


PauseAndWaitOnWorker appears to not be referenced anywhere.

xusheng6 and others added 5 commits May 20, 2026 15:02

xusheng6 force-pushed the test_refactor-worker-queue branch from d4973e0 to 3a44260 Compare May 21, 2026 19:58

xusheng6 and others added 2 commits May 21, 2026 16:29

This was referenced May 21, 2026

Refcount DebugAdapter to enforce cross-thread lifetime guarantees #1092

Closed

UI widgets: mark-dirty-on-change, repaint at most every 100ms #1093

Open

xusheng6 and others added 3 commits May 25, 2026 11:03

Remove unnecessary time delays in the unit test

445bec4

bdash reviewed May 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Route DebuggerController operations through a per-controller worker queue#1090

Route DebuggerController operations through a per-controller worker queue#1090
xusheng6 wants to merge 10 commits into
devfrom
test_refactor-worker-queue

xusheng6 commented May 20, 2026 •

edited

Loading

Uh oh!

bdash May 26, 2026

Uh oh!

bdash May 26, 2026

Uh oh!

bdash May 26, 2026

Uh oh!

bdash May 26, 2026

Uh oh!

bdash May 26, 2026

Uh oh!

bdash May 26, 2026

Uh oh!

bdash May 26, 2026

Uh oh!

bdash May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

xusheng6 commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

What changed

API compatibility

Out of scope (follow-ups)

Test plan

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

xusheng6 commented May 20, 2026 •

edited

Loading