fix(sdk): fix observability events not reaching the server#63
Merged
Conversation
Three fixes for control execution stats not showing in the UI: 1. EventBatcher lazy-start: when init() is called from sync context, the flush loop never started because no event loop was available. Added _try_attach_loop() called from add_event() to lazily attach to the running loop on first use in async context. 2. Enable observability by default: changed observability_enabled default from False to True, matching the server-side default and eliminating the most common setup foot-gun. 3. Reduce flush interval from 10s to 5s: matches the UI polling interval and reduces event loss on short-lived test runs.
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
sync_wrapper uses asyncio.run() per invocation, which creates and closes a loop each call. EventBatcher only checked _loop is None, so a closed-but-non-None loop was treated as healthy - flush tasks silently died and events accumulated without being sent. Fix _try_attach_loop() to detect closed loops via is_closed() and reattach to the current running loop. Move all loop lifecycle logic under _lock to eliminate the TOCTOU race on concurrent add_event() calls. Clear stale state in _schedule_flush() on RuntimeError so the next event can trigger reattachment.
Replace the lazy loop-attachment approach with a dedicated daemon thread that owns its own event loop. This eliminates the entire class of caller-loop lifecycle bugs (closed loops from asyncio.run, sync-context init, TOCTOU races on reattachment). - start() spawns a daemon thread running its own event loop - Flush loop runs there independently of caller loops - add_event() is a pure sync enqueue, no loop interaction - shutdown() uses run_coroutine_threadsafe to flush remaining events, then stops the loop and joins the thread Tradeoff: one extra thread per batcher instance, but consistent behavior across sync and async callers.
namrataghadi-galileo
approved these changes
Mar 5, 2026
siddhant-galileo
approved these changes
Mar 6, 2026
galileo-automation
pushed a commit
that referenced
this pull request
Mar 11, 2026
## [1.1.0](ts-sdk-v1.0.1...ts-sdk-v1.1.0) (2026-03-11) ### Features * **examples:** add Google ADK Agent Control examples ([#69](#69)) ([4b83542](4b83542)) * **infra:** publish UI image and add compose UI service ([#57](#57)) ([207c1af](207c1af)) * **sdk:** 57143 strands extra ([#59](#59)) ([97f2518](97f2518)) * **sdk:** add shutdown() and ashutdown() lifecycle API ([#70](#70)) ([9e29d86](9e29d86)) * **sdk:** migrate strands integration to be a plugin ([#74](#74)) ([897ece3](897ece3)) * **server:** enforce admin-only control-plane mutations ([#62](#62)) ([579407f](579407f)), closes [#61](#61) * **ui:** serve exported Agent Control UI from the FastAPI server ([#71](#71)) ([c140198](c140198)) ### Bug Fixes * **docs:** add centered logo, header, and badges to README ([#92](#92)) ([39c3cbf](39c3cbf)) * **docs:** Test all examples ([#16](#16)) ([39e95c2](39e95c2)) * **evaluators:** migrate sqlglot rs extra to sqlglot c ([#86](#86)) ([5e3e48c](5e3e48c)) * **infra:** fix docker compose to make ui work ([#82](#82)) ([5edbb6b](5edbb6b)) * **infra:** Remove UI service from docker-compose.yml ([#91](#91)) ([330ef55](330ef55)) * **sdk): Revert "fix(sdk:** bundle evaluators in sdk wheel" ([#90](#90)) ([b516ea6](b516ea6)), closes [#89](#89) * **sdk:** bundle evaluators in sdk wheel ([#89](#89)) ([ea5889a](ea5889a)) * **sdk:** fix observability events not reaching the server ([#63](#63)) ([70016db](70016db)) * **ui:** name update being saved now ([#87](#87)) ([919672d](919672d)) * **ui:** Step name not getting saved ([#68](#68)) ([13abef9](13abef9))
Collaborator
|
🎉 This PR is included in version 1.1.0 🎉 The release is available on: Your semantic-release bot 📦🚀 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Addresses the issue where control execution stats do not appear reliably in the UI for local SDK runs (see sc-57369).
This PR now includes four fixes:
asyncio.run(...), and async apps._flush_loopis now the sole owner of_flush()execution (signal-driven wakeups), avoiding fire-and-forget concurrent flush task races.observability_enableddefault changed toTrueflush_intervaldefault changed from10.0to5.0secondsTest plan
make sdk-lintmake sdk-typecheckmake sdk-test(284 passed, 1 skipped)test_shutdown_drains_inflight_flush_without_data_loss