Skip to content

fix(sdk): fix observability events not reaching the server#63

Merged
abhinav-galileo merged 9 commits into
mainfrom
abhi/fix-observability-event-emission
Mar 6, 2026
Merged

fix(sdk): fix observability events not reaching the server#63
abhinav-galileo merged 9 commits into
mainfrom
abhi/fix-observability-event-emission

Conversation

@abhinav-galileo
Copy link
Copy Markdown
Collaborator

@abhinav-galileo abhinav-galileo commented Mar 5, 2026

Summary

Addresses the issue where control execution stats do not appear reliably in the UI for local SDK runs (see sc-57369).

This PR now includes four fixes:

  • Dedicated EventBatcher worker loop: replaced caller-loop-dependent batching with a dedicated daemon worker thread + event loop, so flushing works consistently across sync callers, repeated asyncio.run(...), and async apps.
  • Single-owner flush execution: _flush_loop is now the sole owner of _flush() execution (signal-driven wakeups), avoiding fire-and-forget concurrent flush task races.
  • Graceful shutdown drain: shutdown now wakes the worker, drains pending/in-flight batches safely, and then stops/joins the worker thread, preventing silent event loss during shutdown.
  • Safer defaults for observability:
    • observability_enabled default changed to True
    • flush_interval default changed from 10.0 to 5.0 seconds

Test plan

  • make sdk-lint
  • make sdk-typecheck
  • make sdk-test (284 passed, 1 skipped)
  • Added/ran targeted regression test for shutdown race:
    • test_shutdown_drains_inflight_flush_without_data_loss
  • Manual verification: run a local example agent and confirm UI stats appear within ~5s

Three fixes for control execution stats not showing in the UI:

1. EventBatcher lazy-start: when init() is called from sync context,
   the flush loop never started because no event loop was available.
   Added _try_attach_loop() called from add_event() to lazily attach
   to the running loop on first use in async context.

2. Enable observability by default: changed observability_enabled
   default from False to True, matching the server-side default and
   eliminating the most common setup foot-gun.

3. Reduce flush interval from 10s to 5s: matches the UI polling
   interval and reduces event loss on short-lived test runs.
@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 5, 2026

Codecov Report

❌ Patch coverage is 80.74534% with 31 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
sdks/python/src/agent_control/observability.py 80.37% 31 Missing ⚠️

📢 Thoughts on this report? Let us know!

sync_wrapper uses asyncio.run() per invocation, which creates and
closes a loop each call. EventBatcher only checked _loop is None,
so a closed-but-non-None loop was treated as healthy - flush tasks
silently died and events accumulated without being sent.

Fix _try_attach_loop() to detect closed loops via is_closed() and
reattach to the current running loop. Move all loop lifecycle logic
under _lock to eliminate the TOCTOU race on concurrent add_event()
calls. Clear stale state in _schedule_flush() on RuntimeError so
the next event can trigger reattachment.
Replace the lazy loop-attachment approach with a dedicated daemon
thread that owns its own event loop. This eliminates the entire
class of caller-loop lifecycle bugs (closed loops from asyncio.run,
sync-context init, TOCTOU races on reattachment).

- start() spawns a daemon thread running its own event loop
- Flush loop runs there independently of caller loops
- add_event() is a pure sync enqueue, no loop interaction
- shutdown() uses run_coroutine_threadsafe to flush remaining
  events, then stops the loop and joins the thread

Tradeoff: one extra thread per batcher instance, but consistent
behavior across sync and async callers.
Comment thread sdks/python/src/agent_control/observability.py
Comment thread sdks/python/src/agent_control/observability.py Outdated
@abhinav-galileo abhinav-galileo merged commit 70016db into main Mar 6, 2026
6 of 7 checks passed
@abhinav-galileo abhinav-galileo deleted the abhi/fix-observability-event-emission branch March 6, 2026 14:42
galileo-automation pushed a commit that referenced this pull request Mar 11, 2026
## [1.1.0](ts-sdk-v1.0.1...ts-sdk-v1.1.0) (2026-03-11)

### Features

* **examples:** add Google ADK Agent Control examples ([#69](#69)) ([4b83542](4b83542))
* **infra:** publish UI image and add compose UI service ([#57](#57)) ([207c1af](207c1af))
* **sdk:** 57143 strands extra ([#59](#59)) ([97f2518](97f2518))
* **sdk:** add shutdown() and ashutdown() lifecycle API ([#70](#70)) ([9e29d86](9e29d86))
* **sdk:** migrate strands integration to be a plugin ([#74](#74)) ([897ece3](897ece3))
* **server:** enforce admin-only control-plane mutations ([#62](#62)) ([579407f](579407f)), closes [#61](#61)
* **ui:** serve exported Agent Control UI from the FastAPI server ([#71](#71)) ([c140198](c140198))

### Bug Fixes

* **docs:** add centered logo, header, and badges to README ([#92](#92)) ([39c3cbf](39c3cbf))
* **docs:** Test all examples ([#16](#16)) ([39e95c2](39e95c2))
* **evaluators:** migrate sqlglot rs extra to sqlglot c ([#86](#86)) ([5e3e48c](5e3e48c))
* **infra:** fix docker compose to make ui work ([#82](#82)) ([5edbb6b](5edbb6b))
* **infra:** Remove UI service from docker-compose.yml ([#91](#91)) ([330ef55](330ef55))
* **sdk): Revert "fix(sdk:** bundle evaluators in sdk wheel" ([#90](#90)) ([b516ea6](b516ea6)), closes [#89](#89)
* **sdk:** bundle evaluators in sdk wheel ([#89](#89)) ([ea5889a](ea5889a))
* **sdk:** fix observability events not reaching the server ([#63](#63)) ([70016db](70016db))
* **ui:** name update being saved now ([#87](#87)) ([919672d](919672d))
* **ui:** Step name not getting saved ([#68](#68)) ([13abef9](13abef9))
@galileo-automation
Copy link
Copy Markdown
Collaborator

🎉 This PR is included in version 1.1.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants