Skip to content

Proj/channel protocol (POC)#686

Open
ngduyanhece wants to merge 217 commits into
mainfrom
proj/channel-protocol
Open

Proj/channel protocol (POC)#686
ngduyanhece wants to merge 217 commits into
mainfrom
proj/channel-protocol

Conversation

@ngduyanhece
Copy link
Copy Markdown
Contributor

brv channel + bridge - multi-agent collaboration over a wire protocol

proj/channel-protocolmain. 204 commits. Lands the entire channel-protocol body of work for internal-test rollout.

Summary

  • Problem: Each agentic CLI (Claude Code, codex, kimi, opencode, brv itself) lives in its own working directory with no way to address another agent as an @mention or share context across them. Cross-machine collaboration between teammates' agents is impossible.
  • Why it matters: Two failure modes today: (1) Manual copy-paste of agent output between CLIs is the only way to combine them, breaks any agentic flow that wants to fan-out / quorum / second-opinion. (2) Teammates' agents have no shared transcript or way to ask each other questions — every cross-person agent interaction is a human relay.
  • What changed: Full brv channel subsystem (Phases 1–5), agent-driven skill connector (Phase 8), cross-machine brv bridge over libp2p with TOFU identity (Phase 9 demo target + post-merge heartbeat hardening), agent quorum Tier 1 + 2 (Phase 10), transcript-storage migration. New brv bridge, brv trust, brv alias CLI surfaces. New bridge-config.json persistence. SDK packages @brv/agent-sdk (TS) + brv-agent (Py). Pi-extension SDK consumer.
  • What did NOT change (scope boundary):
    • No web UI for channels — CLI / agent-driven only in this cut. Pi's web-UI design remains compatible if re-introduced later.
    • No native /channel:* slash commands in CLIs other than Pi. Other CLIs use the byterover skill to drive brv channel mention from their shell tool.
    • No libp2p kad-dht backend (Slice 9.6 interface only) — peers manually pinned via brv trust pin --multiaddr ....
    • No HTTP byterover-hosted registry client (Slice 9.7 interface only).
    • No libp2p AutoNAT / DCUtR / Circuit-Relay-v2 wiring (Slice 9.8 reachability classifier only) — Tailscale / same-LAN / public-IP works; raw double-NAT does not.
    • No /brv/parley/delegate/v1 wire handler (Slice 9.9 policy gate + startup warnings only) — cross-bridge agents can answer queries but cannot perform write actions on the dialer's behalf.
    • No browser-channel UI changes.

Type of change

  • New feature
  • Bug fix (heartbeat keep-alive, env-var persistence, cross-merge typecheck repair)
  • Refactor (no behavior change)
  • Documentation (INTERNAL_TEST.md, design + PHASES update)
  • Test
  • Chore (build, dependencies, CI)

Scope (select all touched areas)

  • TUI / REPL (no v1 changes; channels are oclif-only)
  • Agent / Tools (skill connector, byterover memory tools, ACP driver pool)
  • LLM Providers (no provider-specific changes; provider-agnostic by design)
  • Server / Daemon (channel orchestrator, parley server/client, libp2p host, bridge transcript service, auto-provision policy, bridge config store)
  • Shared (constants, types, transport events) — new ChannelEvents, BridgeEvents, TrustEvents
  • CLI Commands (oclif) — new topics: channel, bridge, trust, alias
  • Hub / Connectors — new brv connectors install <agent> --type skill flow
  • Cloud Sync
  • CI/CD / Infra

Linked issues

  • Closes: N/A — internal channel-protocol epic, no single tracking issue
  • Related: plan/channel-protocol/DESIGN.md, plan/channel-protocol/PHASES.md, plan/channel-protocol/IMPLEMENTATION_PHASE_9_CLOUD_BRIDGE.md

Root cause

N/A — primarily new-feature work. The handful of bug fixes (heartbeat keep-alive, env-var persistence, channel-events test fixture lag) have their root causes documented in their individual commit messages.

Test plan

  • Coverage added:
    • Unit test (9603+ tests passing; new tests under test/unit/server/infra/channel/bridge/ for: bridge-config-store, audit-parley-seal, parley-verifier, parley-server, parley-end-to-end, bridge-reachability, bridge-transcript-service, bridge-driver-pool, delegate-policy, peer-multiaddr-resolver, registry-client + nonce-lru, rate-limit)
    • Integration test (channel-phase{1..10}-* integration tests, cross-bridge end-to-end)
    • Manual verification — live two-daemon cross-bridge test on 2026-05-20: mario-team-v4, ctx-exchange-v2, bobs-side-v2 channels, multiple kimi/codex agents, round-trip queries + context-tree exchange, heartbeat verified across multi-minute idle gaps that previously triggered TRANSCRIPT_TERMINAL_MISSING
  • Test file(s): 50+ new test files under test/{unit,integration}/; see commit-by-commit log for which slice each belongs to
  • Key scenarios:
    • Single-machine multi-agent channel + fan-out
    • Cross-machine bridge: handshake, TOFU pin, signed parley round-trip, transcript persistence on both sides
    • Auto-provision policy (auto / pinned-only / deny)
    • Cancel mid-stream + permission flow
    • Heartbeat keep-alive during slow agent generators
    • Daemon respawn config persistence (NEW — eliminates the silent mock-echo degradation)

User-visible changes

New CLI surfaces:

  • brv channel new|invite|mention|cancel|show|list|members|onboard|doctor|approve|deny|archive|leave (full Phases 1–5 surface)
  • brv channel mention --mode sync --suppress-thoughts --json --timeout <ms> for agent-driven invocations (Phase 8)
  • brv channel mention --quorum N for fan-out + agreement merging (Phase 10 Tier 1)
  • brv bridge listen|whoami|pin|ping|verify for cross-machine peer setup
  • brv trust pin|list|verify with --multiaddr + --alias flags
  • brv alias add|list|remove for short-name routing
  • brv connectors install <agent> --type skill writes byterover skill (.agents/skills/byterover/SKILL.md) into a project so the host LLM can drive brv channel mention from its shell tool

New persisted state:

  • <dataDir>/identity/install.{key.enc,cert.json,master.key,peer-id} — L1 identity (auto-generated on first brv bridge whoami)
  • <dataDir>/identity/tree-default.{key.enc,cert.json,master.key} — L2 peer-tree identity
  • <dataDir>/identity/known-peers.jsonl — TOFU pin store (alias / peer-id / pin-state)
  • <dataDir>/identity/aliases.json — local alias map
  • <dataDir>/state/bridge-config.json — NEW persistent operator-prefs for BRV_BRIDGE_* env vars (survives daemon respawns)
  • <projectRoot>/.brv/channel-history/<channelId>/turns/<turnId>.ndjson — channel transcript (Phase 9 storage migration; was under .brv/context-tree/channel/)

New behavioural defaults:

  • BRV_BRIDGE_AUTO_PROVISION defaults to pinned-only (spec §7.3) — first-contact peers must be pinned via brv trust pin before they can dispatch into a channel
  • BRV_BRIDGE_MAX_CONCURRENT_PER_PROFILE=1 per agent; excess inbound parleys reject with signed PARLEY_LOCAL_AGENT_BUSY
  • BRV_BRIDGE_DELEGATE_POLICY=prompt default (mutating-tool wire is deferred so this is type-level only)
  • Daemon emits heartbeat_ping every 10s during idle bridge streams so libp2p Yamux substreams don't time out on multi-minute LLM calls

Deprecated: none.

Breaking: none on main's existing CLI surface — channels are net-new commands. Internal: WaitForTaskOptions.timeoutMs was removed in main f14047cae; reconciled in the cross-merge against proj/byterover-tool-mode.

Evidence

  • Failing test before / passing after: see commit 75b6c58b5 (heartbeat keep-alive) — added unit test that reliably fails on pre-fix code (cross-bridge tool-using turn dies with TRANSCRIPT_TERMINAL_MISSING after ~120s) and passes on post-fix code (heartbeats keep the substream alive indefinitely)
  • Live cross-bridge transcript: Bridge-handshake completed in 4.3s, cross-tree exchange completed in 7.7–33.3s round-trip, design+engineering notes successfully exchanged between two daemons (mario-team-v4 + ctx-exchange-v2 + bobs-side-v2 channels) on 2026-05-20
  • Daemon respawn persistence test: Live verified — kill daemon, respawn WITHOUT any BRV_BRIDGE_* env vars, daemon reads persisted <dataDir>/state/bridge-config.json and resumes with codex profile + auto provision + cap=3 instead of silently falling back to mock-echo + pinned-only. See INTERNAL_TEST.md §5.1 for the operational story
  • Test summary: 9603 unit + integration passing, 26 pending (22 pre-existing + 4 newly-skipped orphans with rationale in each describe.skip comment), 0 failing

Checklist

  • Tests added or updated and passing (npm test): 9603 passing, 26 pending, 0 failing
  • Lint passes (npm run lint)
  • Type check passes (npm run typecheck)
  • Build succeeds (npm run build)
  • Commits follow Conventional Commits format
  • Documentation updated — plan/channel-protocol/DESIGN.md, plan/channel-protocol/PHASES.md, new INTERNAL_TEST.md at repo root
  • No breaking changes (or clearly documented above)
  • Branch is up to date with main — last main sync c897ee00f (Release 3.15.0 merged)

Risks and mitigations

  • Risk: Operators set up the bridge on heterogeneous-NAT laptops and connections silently fail.
    • Mitigation: INTERNAL_TEST.md §3 recommends Tailscale upfront. brv channel doctor surfaces reachability (public / wildcard-unconfirmed / behind-nat-with-relay / loopback-only / unreachable) honestly. The libp2p AutoNAT/DCUtR/Circuit-Relay-v2 wire is explicitly deferred.
  • Risk: Cross-bridge agents try to write to dialer's repo and get a confusing rejection.
    • Mitigation: INTERNAL_TEST.md §5 lists this as a known limitation. The delegate_policy startup log makes the unwired state visible. The error path returns a clear CHANNEL_AUTO_PROVISION_DECLINED or PARLEY_REJECTED rather than hanging.
  • Risk: Peer's multiaddr rotates (laptop sleep/wake / network change) and the pinned entry breaks.
    • Mitigation: brv trust pin <peer-id> --multiaddr <new-addr> re-pins idempotently. brv channel doctor surfaces stale-cache state. DHT-driven multiaddr refresh ships in a follow-up slice (Phase 9.6 backend).
  • Risk: Operator confused by env-var vs. persisted config precedence.
    • Mitigation: New bridge-config.json resolver logs [Daemon] Bridge config persisted to ... whenever env supplies a value that changes the file. Revert path documented in INTERNAL_TEST.md §5.1 (rm <dataDir>/state/bridge-config.json).
  • Risk: 4 integration tests are describe.skipped — looks worse than it is on CI.
    • Mitigation: Each skip has a per-file rationale comment. 3 are pre-existing test-only orphans (BRV_FORCE_ORIGIN consumer never implemented; Phase 10 surface change made the multi-mention rejection assertion obsolete; channel-history storage migration made the cancel-ordering integration probe stale). The actual code paths these tested are covered by unit-level tests. None are Phase-9 regressions.

How to use it

A. Local multi-agent channel (single machine)

The basic flow. No network setup needed. Two terminals.

# One-time per agent type — probes its ACP install:
brv channel onboard codex -- codex-acp
brv channel onboard kimi -- kimi acp
brv channel onboard opencode -- opencode acp

# Create a channel + invite two local agents:
brv channel new code-review
brv channel invite code-review @codex --profile codex
brv channel invite code-review @kimi --profile kimi

# Address one agent:
brv channel mention code-review "@codex review src/foo.ts for null-safety bugs" \
  --mode sync --suppress-thoughts --json --timeout 300000

# Quorum across both (Phase 10):
brv channel mention code-review "@codex @kimi rate the diff: solid / risky / unsafe" \
  --quorum 2 --json

# Live tail any time:
brv channel watch code-review --from-tail 5
brv channel show code-review <turnId>
brv channel list-turns code-review --tail 10

Inside another agentic CLI (Claude Code, codex, kimi-cli, opencode, Pi) the host LLM calls these commands from its bash tool — install the byterover skill once per project and the LLM picks them up:

brv connectors install Codex --type skill           # or Claude Code / Cursor / Gemini CLI / etc.

After that the host agent reads .agents/skills/byterover/SKILL.md and learns to call brv channel mention --mode sync --suppress-thoughts --json whenever the user says "ask @Kimi to review this" or similar.

B. Cross-machine bridge (two laptops, two agents)

Prerequisite: both peers need a routable path to each other. Internal-test recommendation is Tailscale — install on every team member's laptop, join the same tailnet, every peer gets a stable IP that punches through every NAT. Free for ≤3 users; trivial setup. Bare-internet over heterogeneous NAT will NOT work in this cut (libp2p AutoNAT/DCUtR/Circuit-Relay-v2 are deferred).

# ─── Each peer (Alice + Bob) sets up the bridge ───
# In a shell rc (`~/.zshrc` / `~/.bashrc`), once:
export BRV_BRIDGE_PARLEY_PROFILE=codex     # the agent that answers parley calls; or kimi / opencode
export BRV_BRIDGE_AUTO_PROVISION=auto      # accept first-contact peers; switch to pinned-only after the team is up
export BRV_BRIDGE_MAX_CONCURRENT_PER_PROFILE=2

# Each peer: confirm bridge identity (auto-generates L1 + L2 keys on first run)
brv bridge whoami --format json
# {"data":{"multiaddrs":["/ip4/100.x.x.x/tcp/61234/p2p/12D3KooW..."], "peerId":"12D3KooW...", ...}}
# Share that multiaddr line with the other peer out-of-band (Slack, etc.)
# ─── Alice pins Bob (and Bob pins Alice) ───
brv trust pin <BOB-PEER-ID> --multiaddr <BOB-MULTIADDR> --alias bob
# verify:
brv trust list
# ─── Alice creates a channel and invites Bob as a remote peer ───
brv channel new team-review
brv channel invite team-review @alice --profile codex                                    # local member
brv channel invite team-review @bob --multiaddr <BOB-MULTIADDR> --peer <BOB-PEER-ID> --display-name "Bob"
# ─── Alice addresses Bob's agent across the bridge ───
brv channel mention team-review "@bob what's your team's preferred test runner? \
  Use your byterover skill (brv query) to consult your local context tree." \
  --mode sync --suppress-thoughts --json --timeout 300000
# → Bob's daemon receives the parley call, codex on Bob's side runs `brv query` against
#   Bob's `.brv/context-tree/`, replies inline with text. Alice sees the reply in
#   ~5–30s depending on agent latency.
# ─── Alice ingests Bob's reply into her own context tree ───
brv channel mention alice-local "@alice ingest Bob's reply under shared_from_bob/ using brv curate" \
  --mode sync --suppress-thoughts --json --timeout 300000
# → Alice's codex calls `brv curate` locally to persist the note.

# Verify with a search later:
brv search "test runner" --limit 5 --format json

That's the full cross-bridge loop. Two agents on different laptops exchange context via signed parley envelopes; both daemons persist the transcript; the receiver's context tree gains the shared knowledge.

C. What's the difference between brv channel and brv bridge?

  • brv channel is the user-facing surface: create a channel, invite members (local or remote), mention them, see transcripts.
  • brv bridge is the low-level libp2p surface: bring up your install's bridge identity, pin remote peers, ping them, verify TOFU promotions.

You almost always interact via brv channel and brv trust. brv bridge is for debugging (e.g. brv bridge whoami to share your multiaddr, brv bridge ping <peer-id> to test reachability).


What's left for a full-fledged release

Below is the list of follow-up slices that are NOT in this PR but are designed for and the next obvious work. Internal-test feedback will inform priority.

Within Phase 9 (cross-machine bridge) — 4 follow-up slices

These have interfaces + policy gates in this PR but no live backends:

  1. Slice 9.6 — DHT-driven multiaddr resolution. Wire @libp2p/kad-dht into Libp2pHost, implement DhtPeerMultiaddrResolver, daemon publishes signed peer-record every announce interval. Replaces manual brv trust pin --multiaddr re-pinning when IPs rotate. Effort: ~1 week. Worth doing only if internal-test feedback says manual re-pinning hurts.
  2. Slice 9.7 — HTTP byterover registry client. Implement HttpRegistryClient against BRV_REGISTRY_BASE_URL, allows handle-based peer discovery. Requires backend deployment (registry endpoint not yet hosted by ByteRover). Effort: ~1 week of CLI + backend infra outside this repo.
  3. Slice 9.8 — libp2p NAT-traversal wiring. Enable autoNAT() + dcutr() + circuitRelayClient/Server() libp2p services in Libp2pHost. Ship default ByteRover-hosted relay multiaddrs. Effort: ~3–5 days CLI + ops work to deploy relays. Likely unnecessary if teams adopt Tailscale; track internal-test feedback before committing.
  4. Slice 9.9 — /brv/parley/delegate/v1 wire protocol handler. Today: policy gate (delegate_policy: auto / prompt / deny) is enforced at the type level + logged at daemon startup, but the actual cross-bridge permission_request → permission_decision wire is unimplemented. Implementing this enables Alice asking Bob's agent to write to Bob's repo with cross-bridge consent. Effort: ~1 week. High-value if team's flagship use case is cross-machine delegation.

Within Phase 7 (native CLI integrations) — upstream-gated

  • 7.1b — kimi-cli upstream PR (native /channel:* slash command). Effort: ~200–500 LOC; merge timeline depends on Moonshot maintainers.
  • 7.2 — claude-code native integration. Effort: similar; depends on Anthropic.
  • 7.3 — opencode native integration. Same shape; depends on opencode maintainers.

These work without us — the skill connector in this PR (brv connectors install <agent> --type skill) makes every ACP-speaking CLI usable via shell-tool calls. Native slash commands are quality-of-life, not load-bearing.

Within Phase 8 (agent-driven channel) — upstream-async

  • 8.3 — upstream skill distribution. Currently the skill ships from this repo; longer-term distribution channels (Anthropic skills registry, etc.) are async.

Pre-existing tech debt outside Phase 9 scope

  • 3 integration tests skipped with rationale (channel-phase{2,3}-* reference orphan code paths like BRV_FORCE_ORIGIN that were never implemented, or test against the pre-storage-migration layout). Implement the missing consumer side OR delete the tests in a small follow-up PR.
  • No brv channel kick @<handle> / member-remove command. Today a multiaddr/port change forces creating a new channel. Effort: ~half day. Worth doing soon if teams cycle multiaddrs frequently.
  • Heartbeat counter metric for prod observability (kimi sign-off note, non-blocking).

Out of scope but tracked for v1.1+

  • @everyone / @all semantics in channel mentions
  • Per-channel ACLs
  • Threaded sub-conversations within a channel
  • Account-abstracted cross-channel identity ("same @kimi-cli across all channels")
  • Web UI for channels (Pi's design remains compatible)
  • Web-UI permission prompts (CLI is the only consent surface in this cut)

Out of scope indefinitely

  • Cross-CA federation (v1 trusts one CA root; multi-org bridging is v2)
  • End-to-end channel-content encryption at rest
  • Strong cross-machine consistency (transcripts are append-only by author; eventual consistency is the contract)
  • Group-chat semantics across channels
  • Claude Managed Agents driver class C-prime runtime (schema-only in v1)

danhdoan and others added 30 commits May 10, 2026 12:25
…rser

Lands the M1 keystone task: element vocabulary, per-element validators,
data-driven registry, and HTML parser wrapper. No behavior change yet —
no consumer wires the registry until T3 (curate path) and T4 (query path).

Element vocabulary (5 elements):
- bv-topic — root container; required `path`, optional importance/maturity/
  recency/updatedat (lowercase per HTML5 normalization)
- bv-rule — rule statements (severity, id)
- bv-decision — decision records (id)
- bv-bug — bug runbook entries (id, severity)
- bv-fix — fix runbook entries (id)

Architecture:
- Data-driven `ELEMENT_REGISTRY` is the production-track guardrail. M2
  vocabulary expansion is purely additive — new elements register via
  one entry + one schema/validator pair, no consumer code changes.
- `makeAttributeValidator` factory eliminates per-element boilerplate;
  reduces M2's 12 additional element validators to one-line bindings.
- HTML parser is parser-library-agnostic at the API boundary
  (`DocumentNode` / `ElementNode` / `TextNode`). parse5 used internally;
  consumers don't see parse5 types.
- Attribute-case normalization is documented in `ElementNode` JSDoc:
  HTML5 lowercases attribute names at parse time, so schema keys must
  match parser output (lowercase), not source HTML (camelCase).

Tests (106 new):
- 5 per-element validator suites (61 cases total) — each ≥10 cases
- Registry shape + wiring + metadata (14 cases)
- HTML parser: basic parsing, malformed-input handling, walkElements,
  getInnerText, serializeHtml round-trip (22 cases)
- End-to-end round-trip on a sample fixture covering all 5 elements
  (9 cases)

Verification:
- typecheck clean
- lint 0 errors (242 warnings, all pre-existing baseline)
- 7715 tests passing, 0 failing
- build clean

Adds parse5@^8.0.1 as a runtime dep — the W3C-spec HTML parser used by
jsdom; types ship in-tree (no @types/parse5 needed). Justified vs hand-
rolled parsing because robust HTML tag-soup handling is non-trivial and
this is the standard library for the job.
- bv-topic schema: tighten recency regex to ^\d+(\.\d+)?$ (rejects '.', '..1', '1..2' which previously slipped past the regex), drop redundant Number.isFinite for style consistency with importance, accept timezone offsets in updatedat via {offset: true}
- html-parser: replace `as unknown as Parse5DocumentFragment` cast with parse5 defaultTreeAdapter factories (createDocumentFragment, createElement, createTextNode, appendChild) — no more structural casts in the serialize path
- html-parser: getInnerText now inserts a separator between sibling element children so compact emit (<p>foo.</p><p>bar.</p>) does not merge tokens for BM25
- html-parser: document <template> content limitation (parse5 places template children in .content, not childNodes)
- schema docstrings + test names: replace 'warn-only behaviour' with 'parse-and-skip' (no warning is actually emitted in M1)
- sample-topic round-trip: add explicit case-normalization regression test for updatedAt → updatedat
feat: [ENG-2737] HTML render layer foundation — element registry + parser
Rewrite the canonical curate tool description (loaded by ToolDescriptionLoader
and presented to the agent on every curate operation) from the structured
JSON-args shape (rawConcept/narrative/facts/etc.) to the M1 HTML-emission
contract using the closed `<bv-*>` vocabulary defined by T1.

Coverage:
- All 5 elements (bv-topic, bv-rule, bv-decision, bv-bug, bv-fix) with full
  per-element attribute schemas: required/optional, value enums, format
  constraints (importance integer 0-100, recency numeric 0-1, ISO-8601
  updatedat, severity enums per element, etc.).
- Allowed-children semantics (any/inline/block) + pairing convention
  (bug+fix as siblings).
- Path format (slash-separated snake_case) + domain guidelines.
- Output contract: HTML only, exactly one bv-topic root, no preamble/fences/
  commentary, lowercase attribute names per HTML5 spec, closed vocabulary,
  no invented attributes, no clarifying questions.
- Three worked examples (bug+fix runbook, rule+decision pair, general topic)
  to anchor the model's element-pairing and id-naming conventions.

Add a sanity test (test/unit/server/infra/render/curate-prompt.test.ts) that
loads curate.txt and asserts every registered ELEMENT_NAME, every per-element
enum value, and every output-contract directive is present. Guards against
silent drift when M2 expands the vocabulary or refactors attribute schemas.

Authoring fluency check (M1 T2 spike, separate harness in
local-auto-test/curate-fluency/, not in this repo): 20/20 generated outputs
validated against the M1 element registry on Sonnet 4.5. Gate cleared on
first full run after one prompt-tuning iteration. Decision: proceed to T3.
Full report: research repo at features/html-memory-conversion/milestones/01-experiment/fluency-report.md
Field-by-field review of curate.txt against the actual rendered .md
context files (frontmatter + ## Reason + ## Raw Concept + ## Narrative
+ ## Facts) surfaced two issues with the initial T2 draft:

1. <bv-topic> carried runtime-signal attributes (importance, maturity,
   recency, updatedat) that were explicitly migrated to a sidecar store
   under research/features/runtime-signals/ (per-user, per-machine state
   that caused noisy `vc status` diffs and team-share conflicts). Drop
   them from the schema; replace with the actual frontmatter fields the
   markdown writer renders: title (required), summary, tags, keywords,
   related.

2. The 5-element vocabulary covered Rules + three M1 net-new elements
   (Decision/Bug/Fix) but left every other rendered MD section as
   free-form HTML — losing the structural mapping the writer needs.

Expand M1 vocabulary to 16 elements (added 11), each mapping to a
specific section in the rendered .md output:

  - bv-reason       → ## Reason
  - bv-task,        → ## Raw Concept (Task / Changes / Files / Flow)
    bv-changes,
    bv-files,
    bv-flow
  - bv-structure,   → ## Narrative (Structure / Dependencies /
    bv-dependencies,                Highlights / Examples / Diagrams);
    bv-highlights,                  bv-rule already covered Rules
    bv-examples,
    bv-diagram (with type + title attrs for verbatim diagram preservation)
  - bv-fact (subject/category/value attrs) → ## Facts list

Each element follows the existing makeAttributeValidator pattern; the
data-driven registry stays the single source of truth.

Tests:
- Update bv-topic test to drop runtime-signal cases and assert the
  new required-attributes (path + title); add a regression case
  asserting passthrough still tolerates legacy importance/maturity/
  recency without enforcing them.
- Add registry test asserting (a) bv-topic exposes the new
  optionalAttributes, (b) runtime signals are NOT in the registry
  metadata, (c) every registered validator accepts its own empty node
  + rejects mismatched-tag.
- Consolidate the 9 attribute-free new elements into one shared
  text-only-elements.test.ts (same trivial schema; per-element
  duplication wouldn't add coverage).
- Add dedicated bv-diagram.test.ts (type enum) and bv-fact.test.ts
  (category enum, subject/value pass-through).
- Update sample-topic.html fixture to exercise every new element +
  the new frontmatter attributes; assert no runtime signals leak in.
- Update curate-prompt.test.ts to assert the expanded vocabulary,
  category/type enums for bv-fact/bv-diagram, and that the prompt
  explicitly excludes runtime signals from bv-topic attributes.

Re-run the M1 T2 fluency check on the same 20 fixtures with the new
vocabulary: 20 / 20 valid (gate cleared); 100% cohort-appropriate
placement (bug+fix → bv-bug+bv-fix; rule+decision → bv-rule+bv-decision;
general → bv-reason); 122 bv-fact instances extracted across the run
(~6 per fixture). Fluency report updated:
research/features/html-memory-conversion/milestones/01-experiment/fluency-report.md

Open contract violation persists: 18/20 outputs (90%) wrap in code
fences despite explicit prompt instruction. Recommendation for T3
unchanged — strip in the response parser, don't iterate the prompt.
Second writer audit (markdown-writer.ts:270-354) found 3 rawConcept
sub-fields without an HTML home: rawConcept.timestamp, rawConcept.author,
rawConcept.patterns. Add the missing elements so the M1 vocabulary
covers every section the writer renders.

Vocabulary additions (16 → 19 elements):

- <bv-timestamp> — text-only; renders as `**Timestamp:**` under
  `## Raw Concept`. Distinct from frontmatter createdAt/updatedAt
  (system-set) — this is the date the concept's data represents.
- <bv-author> — text-only; renders as `**Author:**`.
- <bv-pattern> — structured (flags + description as attrs, pattern
  as text content). Multiple siblings inside <bv-topic> collected
  into the `**Patterns:**` bullet list. Mirrors how <bv-fact> works.

Sanitiser:

- stripCodeFenceWrapper(html) added to html-parser.ts. Strips a single
  outer ``` ... ``` (any language tag) wrapper from the LLM response.
  Inner fences (e.g., <pre><code> blocks inside <bv-diagram>) survive.
  T3's curate executor calls this before parseHtml. Defensive
  sanitisation generalises better than chasing the model's quirk via
  prompt iteration — observed fence-wrap rate is 70% on Sonnet 4.5
  even with explicit prompt instruction to the contrary.

Tests:

- bv-pattern dedicated test (bv-pattern.test.ts) for the structured
  attrs.
- text-only-elements.test.ts extended for bv-timestamp + bv-author
  (same trivial schema as the other text-only elements).
- html-parser.test.ts: 7 new cases for stripCodeFenceWrapper covering
  ```html, ```xml, no-language ```, leading/trailing whitespace,
  unwrapped passthrough, mismatched-fence passthrough, inner-fence
  preservation, and parse-after-strip end-to-end.
- sample-topic-roundtrip.test.ts: renderable-MD coverage assertion
  extended to require all 15 dedicated body-section elements (was 12).
- sample-topic.html fixture exercises the 3 new elements.

Re-ran the M1 T2 fluency check on the same 20 fixtures with the
19-element vocabulary. Results:
- 20 / 20 valid (gate cleared)
- 100% cohort-appropriate placement (unchanged)
- bv-timestamp 60% adoption (used where input has a clear date)
- bv-author 10% (used only on inputs that name an owner)
- bv-pattern 0% — model correctly did not over-emit; no fixture
  contained regex content. Restraint is good behaviour.
- Cost: $0.61 (vs. $0.60 for the 16-element draft; negligible)

Updated fluency-report.md in research repo with final coverage matrix
and the 19-element → markdown-writer mapping for T3.
…Query/CurateLogEntry

Add per-call token usage (input, output, cached, cache-creation), latency tiers
(searchMs, llmMs, totalMs), and format mode (markdown vs html) to QueryLogEntry
and CurateLogEntry. Token rollup uses a per-task TaskUsageAggregator subscribed
to llmservice:usage events on the agent bus; LoggingContentGenerator emits the
events on both streaming and non-streaming paths. AiSdkContentGenerator folds
result.usage and providerMetadata.anthropic.cacheCreationInputTokens into a
synthesized rawResponse so per-provider extraction lands consistently across
Anthropic, OpenAI, and Google.

Live-validated against 30 curates × 3 providers; all four token fields populate
end-to-end with cacheCreationTokens correctly Anthropic-only.
[ENG-2738] HTML-emission contract in curate.txt + fluency check
- core/interfaces no longer reaches into infra: CurateUsageRecord moved to
  core/domain/entities/curate-log-entry, IUsageAggregator added in
  core/interfaces/telemetry/, TaskUsageAggregator now `implements` it.
- Consolidate curate telemetry assembly inside CurateExecutor:
  runAgentBody invokes reportTelemetry on both happy and error paths;
  agent-process passes onTelemetry that forwards via task:curateResult.
  Removes the duplicated inline assembly in agent-process.
- QueryExecutor.buildTiming prefers TaskUsageAggregator.getLlmMs() over the
  executeOnSession wall-clock so llmMs counts only LLM-call time, not
  tool-execution overhead. Falls back to wall-clock for tests.
- CurateLogHandler error finalization merges telemetryFields(state.usage)
  so failed curates don't underreport cost when the executor's best-effort
  error-path reportTelemetry has populated state.usage.
- OOD short-circuit in QueryExecutor routes through formatDetector.detect([])
  instead of returning format: undefined, so future HTML-aware detectors
  don't get silently bypassed.
- DRY: extract makeUsageListener helper in agent-process and reuse for
  both curate and query handlers.
- Document that PROVIDER_TYPES iteration in LoggingContentGenerator only
  matches 'aiSdk' in production (all live traffic flows through AiSdkContentGenerator).
…ocation

Re-review flagged the gap that error-path reportTelemetry being silent could
mask a future regression where failed-curate telemetry disappears from disk.
Pin both directions of the contract:
- happy path: onTelemetry fires once before runAgentBody returns
- error path: onTelemetry fires once before the throw propagates so the
  daemon can still emit task:curateResult ahead of task:error
feat: telemetry producer for QueryLogEntry + CurateLogEntry
…or wiring

Wires the M1 HTML context-tree experiment into the curate flow. When the
new BrvConfig.useHtmlContextTree flag is enabled, curate emits a
<bv-topic> HTML document using T1's element registry (19 elements after
T2's expansion) and atomically writes it to .brv/context-tree/<path>.html.
When the flag is off (the default), the existing markdown curate path
runs unchanged.

Rolled-in T2 review fixes (PR #628 auto-review):
- curate.txt: split the "Standard HTML inside <bv-*>" clause per
  allowed-children semantics. Inline-only subset (code/strong/em) for
  bv-rule, bv-task, bv-flow, bv-fact, bv-pattern, bv-timestamp,
  bv-author. Full block subset for bv-topic and the block-content
  elements. Removes the M1-passthrough/M2-strict drift the reviewer
  flagged.
- curate.txt: prefix the examples block with a disclaimer that the
  fences are for prompt readability only and the agent's actual output
  must be bare HTML. Lower-cost mitigation for the model's stubborn
  fence-wrap habit (T2's stripCodeFenceWrapper still handles the
  remaining cases).
- curate-prompt.test.ts: enum-value tests now anchor to the section
  after each element header instead of whole-file string match.
  Catches "severity moved from bv-bug to bv-decision" drift.
- curate-prompt.test.ts: new test that parses every fenced example
  block from curate.txt with parseHtml and runs each <bv-*> element
  through its registered validator. Catches example typos
  (severity="hihg") AND vocabulary drift in one assertion.

T3 proper:

BrvConfig:
- New optional boolean field useHtmlContextTree (default undefined ==
  false). Plumbed through the JSON validator, withReviewDisabled,
  withVersion, and withSpace withers. withoutSpace already used
  spread, so it picks up the new field automatically. 8 new BrvConfig
  unit tests cover round-trip + wither preservation.

format-detector (src/server/infra/render/format/format-detector.ts):
- Two pure functions: getFormatForWrite(config) reads the flag,
  getFormatForRead(filePath) inspects the extension. Default markdown.
  12 unit tests cover both branches plus edge cases (case-insensitive
  extensions, directory-named-foo.html with .md inside, etc.).

html-writer (src/server/infra/render/writer/html-writer.ts):
- Public surface: validateHtmlTopic(html) (pure validation, no I/O)
  and writeHtmlTopic({contextTreeRoot, rawHtml}) (validation + atomic
  write). Pipeline:
    1. stripCodeFenceWrapper (T2-provided) on raw input.
    2. parseHtml + walkElements; require exactly one <bv-topic> and
       a non-empty path attribute.
    3. Run every typed bv-* element through its registered validator.
       Reject on any failure with structured errors.
    4. Atomically write via DirectoryManager.writeFileAtomic to
       <contextTreeRoot>/<path>.html.
- Path-traversal hardening: backslashes normalised, `..` segments
  rejected, leading slashes stripped, resolved path must stay under
  contextTreeRoot. 15 unit tests cover validity branches, fence
  variants, atomic-write success/failure, and traversal attempts.

curate executor wiring:
- CurateExecuteOptions gains useHtmlContextTree?: boolean.
- runAgentBody branches the runtime prompt's final-step instruction:
  HTML mode tells the agent its FINAL RESPONSE is the bv-topic
  document and explicitly forbids tools.curate calls (which would
  write a sibling .md file). MD mode keeps the existing "verify via
  result.applied[].filePath" instruction unchanged.
- After agent.executeOnSession returns, HTML mode routes through a
  new handleHtmlCurateResponse() that calls writeHtmlTopic and shapes
  a CurationStatus from the result (success/failed). MD mode
  continues to call parseCurationStatus.
- 6 new unit tests cover: valid topic written, fence-wrapped response
  stripped, missing bv-topic → failed, invalid attribute → failed,
  flag=false uses MD path, flag=undefined uses MD path.

agent-process plumbing:
- The daemon's curate dispatcher reads cachedBrvConfig.useHtmlContextTree
  and forwards it on every CurateExecuteOptions call. Default false
  remains production-safe; the bench's Configuration B sets it to true
  via .brv/config.json.

Tests: 7892 pass, 0 fail. Typecheck clean, lint clean, build clean.

Out of scope (per the M1 plan):
- Storage path migration to .brv/render/ — M2.
- IR (ComponentIR) compilation — M2.
- Multi-tier query / structural selectors — T4.
- to-injection/to-evidence/to-summary/to-md adapters — M2.
- Migration tool for existing .brv/context-tree/ — M2.
HTML is the curate format; there is no MD/HTML branching to control.
Removing the flag simplifies the executor (one prompt, one response
handler), the BrvConfig schema (no per-project toggle), and the
format-detector (no write-side decision — only read-side dispatch
remains for legacy MD files).

Code changes:

- BrvConfig: remove the optional `useHtmlContextTree` field, the JSON
  validator branch, all wither plumbing, and the dedicated test block.
- ICurateExecutor: remove `useHtmlContextTree` from CurateExecuteOptions.
- CurateExecutor: collapse the prompt branch — every curate runs the
  HTML-emission instruction. Always route the agent's response through
  `handleHtmlCurateResponse`. Remove the dead `parseCurationStatus`
  method (the JSON-status-block path is gone).
- agent-process: drop the `useHtmlContextTree` plumbing on the curate
  dispatch site.
- format-detector: drop `getFormatForWrite` (no decision to make);
  keep `getFormatForRead` so the query/search read path can still
  route legacy `.md` files transparently. Drop the BrvConfig import.

Test changes:

- BrvConfig: drop the 8-case `useHtmlContextTree` describe block.
- format-detector: drop the 4 `getFormatForWrite` cases.
- curate-executor-html-mode: drop the two MD-fallback cases (no flag
  to toggle); the file is now strictly HTML-emission tests.

Comment cleanup (separate concern, same commit):

Strip milestone scaffolding (M1, M2, T1–T7) from comments across the
render layer. Comments now describe logic and decisions, not the
project plan that scaffolded them. Specifically:

- "M1 light validation; ADR-007 §13 strict validation is M2" →
  "Light validation; strict validation per ADR-007 §13 is future work"
- "M1 element registry"/"M1 vocabulary" → "element registry"/"closed
  bv-* vocabulary"
- "T3 wires the writer", "T4 owns the index", etc. → describe the
  consumer by role, not by milestone task
- "M1 keeps the existing .brv/context-tree/ storage" → describe what
  the resolver encodes, not which milestone made the decision
- Test descriptions: "(parse-and-skip — M1 light validation)" →
  "(parse-and-skip — light validation)"

Pipeline:
- 7878 tests pass (lost 14 with the dropped flag — 8 BrvConfig + 4
  format-detector write + 2 executor MD-fallback).
- Typecheck clean, lint clean, build clean.
End-to-end smoke testing surfaced a gap: the executor's runtime prompt
told the agent to emit HTML as its final response, but the base system
prompt (system-prompt.yml) had ~500 lines of detailed guidance for the
old structured-args curate flow (UPSERT/ADD/UPDATE/MERGE/DELETE with
rawConcept/narrative/facts content shape). The system prompt's
specifics overrode the runtime instruction; the agent kept building
JS data structures via code_exec and never emitted HTML, so the writer
saw nothing to validate and no .html file landed.

Rewrite the curate-related sections of system-prompt.yml to match the
HTML-emission contract:

- Update <TOOL_QUICK_REFERENCE>, <WORKFLOW_SUMMARY>,
  <PATH_TITLE_CLARIFICATION>, <PROHIBITED_BEHAVIORS>, <DIRECTORIES>,
  and the <SANDBOX_PROGRAMMING> curate example to instruct HTML
  emission. The agent's final response IS the bv-topic document; do
  NOT call tools.curate; do NOT wrap in code fences.
- Replace the entire ## Curation Workflow section (~500 lines, lines
  275–786) with a tighter HTML-focused workflow (~217 lines):
  - Keep the recon → mapExtract → dedup → groupBySubject preprocessing
    helpers (still useful for chunked/large inputs).
  - Document the bv-topic frontmatter (path/title required; summary,
    tags, keywords, related optional) and explicitly call out that
    importance/maturity/recency/updatedat are sidecar runtime signals,
    not authored.
  - Map MD sections to bv-* elements (Reason → bv-reason, Raw Concept
    sub-fields → bv-task/bv-changes/bv-files/bv-flow/bv-timestamp/
    bv-author/bv-pattern, Narrative subsections → bv-structure/
    bv-dependencies/bv-highlights/bv-rule/bv-examples/bv-diagram,
    Facts → bv-fact, runbooks → bv-decision/bv-bug/bv-fix).
  - Document inline-content vs block-content elements and the standard
    HTML allowed inside each.
  - Keep facts extraction guidance, content quality requirements,
    temporal preservation, and content preservation rules — translated
    to the bv-* element vocabulary.
  - Drop the operation-type table (UPSERT/ADD/UPDATE/MERGE/DELETE),
    Two-Part Context Model section, domainContext/topicContext/
    subtopicContext (these belonged to the structured-args flow that
    no longer exists).
- Update the query-side temporal-reasoning hint to reference
  bv-timestamp content + sidecar runtime signals instead of
  rawConcept/narrative timestamps.
- Update the related-topics search hint to reference the `related`
  attribute on bv-topic alongside the legacy `relations` frontmatter
  on .md topics.

Net change: 196 insertions, 494 deletions. The rewritten prompt is
~217 lines shorter and aligned with the actual emission contract in
curate.txt.

Verified end-to-end via local-auto-test/curate-html-e2e/ harness:
- 3/3 fixtures (bug-fix, rule-decision, general-context) curate to
  registry-valid HTML topic files (Sonnet 4.5).
- File paths land at <projectDir>/.brv/context-tree/<path>.html as
  expected.
- Per-fixture latency: 19.3s, 30.5s, 27.3s.

Pipeline: 7878 unit tests pass, lint clean, typecheck clean, build
clean.
…e time

The previous markdown writer rendered `createdAt` and `updatedAt` in
the YAML frontmatter of every topic file. The HTML writer was dropping
both — the LLM doesn't author them and the writer wasn't injecting
them either, so HTML topics had no temporal metadata on disk.

Inject both as `<bv-topic>` attributes during the writer's atomic
write, after fence-stripping and registry validation:

- `updatedat` is always set to the current ISO-8601 instant.
- `createdat` is preserved from the existing file on disk if one
  exists at the resolved path; otherwise it is set to the current ISO
  instant. Re-writes leave `createdat` alone, advance `updatedat`.
- Any value the LLM authored for these attributes is overridden — the
  agent is not allowed to choose its own timestamps.

Implementation: surgical regex edit on the bv-topic opening tag (no
parse → re-serialize round-trip) so the LLM's formatting (whitespace,
attribute order, quoting style) survives intact. Two helpers added to
html-writer.ts:
- setBvTopicAttributes(html, attrs) — replace-or-insert each attr.
- readExistingTopicAttribute(filePath, name) — surgical read of an
  attribute value from an existing on-disk topic without parsing the
  whole document.

Tests:
- Existing assertions about file content equaling VALID_TOPIC updated
  to match-with-injected-timestamps (body preserved verbatim, bv-topic
  opening tag now has createdat/updatedat).
- Three new cases under "system-managed timestamps":
  1. First write injects both timestamps within the test window.
  2. Re-writes preserve createdat, advance updatedat.
  3. LLM-authored values for either attribute are overridden by the
     system value.

Verified end-to-end via local-auto-test/curate-html-e2e/ — all 3
fixtures produce HTML topic files with system-injected createdat and
updatedat ISO-8601 timestamps on the bv-topic root, alongside the
LLM's frontmatter (path/title/summary/tags/keywords/related).

Pipeline: 18 html-writer unit tests pass (3 new), full suite green,
typecheck clean, lint clean.
CI test failure (curate-executor-html-mode.test.ts:79) — the file-content
assertion still expected the LLM's raw HTML; now matches the
timestamp-injected file by checking the body is preserved verbatim and
that createdat/updatedat are present. Same fix in the fence-stripping
test variant.

Review feedback addressed:

1. Telemetry format hardcoded to 'markdown' (curate-executor.ts:485).
   The flag is gone; HTML is the only curate format. Updated to emit
   'html' and reworded the doc comment. Updated the regression test in
   curate-executor.test.ts:491 to assert 'html' accordingly.

2. validateHtmlTopic does not catch path-traversal (html-writer.ts).
   The doc comment promised "safe to use the topic" but a payload like
   `<bv-topic path="../../../etc/passwd">` returned `{ok: true}` —
   path-traversal was only caught later when topicPathToFilePath threw.
   That works end-to-end (executor catches and surfaces as failed
   status), but is a foot-gun for any future caller using
   validateHtmlTopic standalone (preview, dry-run).

   Move the path-segment safety check into validateHtmlTopic and
   surface as a new structured error kind 'unsafe-path'.
   topicPathToFilePath keeps path.resolve as defence-in-depth. Added
   two unit tests covering `..` segments and single-`.` segments;
   updated the existing executor-side test to expect a validation
   failure (with `unsafe-path` errors) instead of a thrown exception.

3. Attribute-validation errors swallowed the `field` discriminant
   (curate-executor.ts:344). The writer captures which attribute
   failed (`severity`, `id`, etc.) but the executor's
   verification.missing string only included the tag. So a malformed
   `severity="urgent"` was logged as `attribute-validation
   (bv-rule): …` rather than `attribute-validation (bv-rule.severity):
   …`. Updated the mapping to include the field for
   attribute-validation errors. Non-blocking but improves bench-log
   debuggability.

Reviewed but obsolete (not actioned):

- "format-detector.ts ships an unused getFormatForWrite" — removed in
  the flag-drop commit f701cd2.
- "agent-process.ts double-coerces useHtmlContextTree" — also obsolete
  (flag dropped).

Pipeline: 7883 unit tests pass, lint clean, typecheck clean, build
clean. E2e smoke (curate-html-e2e/) green on the rebuilt CLI.
[ENG-2739] curate path — HTML writer + format detector + executor wiring
…ch routing

Wires the search/query path to read HTML topic files transparently
alongside markdown ones. Curate emits HTML; the legacy MD path stays
in for `brv swarm` and existing trees that haven't migrated. The BM25
index is format-agnostic (HTML inner-text decodes to the same plain
text the markdown tokenizer expects); the indexer dispatches on file
extension at read time.

New modules:

- src/server/infra/render/reader/html-reader.ts — readHtmlTopicSync(html)
  returns {bodyText, elements, topicAttributes}. bodyText is the
  entity-decoded inner text, ready to feed BM25. elements is a flat,
  document-order list of every typed `<bv-*>` node with its attribute
  map. topicAttributes lifts the bv-topic frontmatter (path, title,
  summary, tags, keywords, related). Forgiving on malformed input
  (parse5 returns a best-effort tree).

- src/server/infra/render/reader/element-axis-index.ts — in-memory
  ElementAxisIndex. Two query keys:
    findByTag(tag) → paths
    findByAttribute(tag, attr, value) → paths
  Reverse path→keys map keeps invalidation O(keys-touched) instead of
  full-scan. Built in lockstep with the BM25 index; same lifetime,
  same invalidation triggers. Today's consumer is the optional
  `elementHint` filter on SearchOptions; the structural-selector
  grammar will hook in here without further reshaping.

Search service changes (search-knowledge-service.ts):

- File discovery: glob extended from `**/*.md` to `**/*.{html,md}`.
- Indexing: new readIndexableContent() helper dispatches on
  getFormatForRead(filePath). HTML files route through readHtmlTopicSync
  (the BM25 input is bodyText, not raw markup) and contribute their
  element list to the structural-axis index. MD files index their raw
  content as before.
- IndexedDocument gains a `format: 'html' | 'markdown'` field, set
  from the file extension at index time and surfaced on every
  SearchKnowledgeResult entry. Telemetry/query-log consumers
  (T5-shipped QueryLogEntry.format) populate from this without
  format-detection at the call site.
- CachedIndex gains `elementAxisIndex: ElementAxisIndex`, populated
  during buildFreshIndex from the per-document elements walk. The
  empty-tree fallback also constructs an empty index so callers can
  treat the field as non-nullable.
- SearchOptions gains `elementHint?: ElementHint`. When present, the
  search composes a MiniSearch filter that intersects scope +
  element-axis matches before BM25 ranking. No caller passes the hint
  today; M2 selector grammar is the consumer.
- INDEX_SCHEMA_VERSION bumped to 6 so previously-cached indexes built
  against MD-only schema rebuild on next query.
- Title resolution: HTML files use the bv-topic title attribute; MD
  files retain the existing extractTitle frontmatter walk.
- The structural-selector hint is also exposed on the public
  SearchKnowledgeOptions as `SearchKnowledgeElementHint` — wired
  through ToolsSDK so SDK consumers can pass it without going through
  the internal SearchOptions surface.

Tests:

- html-reader (9 cases): bodyText extraction, entity decoding,
  attribute lift, elements-list document order + de-duplication,
  unknown-bv-* skipping, missing bv-topic graceful behaviour,
  malformed input non-throw.
- element-axis-index (10 cases): findByTag/findByAttribute, empty-array
  for unknown tags, case-sensitivity on values, single-file multiple-
  contribution dedup, remove() drops all keys + cleans empty key sets,
  remove() on unknown path no-op, clear().
- search-knowledge-service-html-routing (8 cases): HTML files indexed,
  format field populated for both HTML and MD, raw markup is NOT
  searchable (inner-text only), bv-topic title attribute lifts as
  doc title, elementHint pre-filter (no-match returns empty,
  tag-only restricts, tag.attr=value restricts further).

Pipeline: 7910 unit tests pass (27 new), typecheck clean, lint clean
(0 errors), build clean.

Out of scope (per the M1 plan):
- Structural selector grammar — M2 (this PR ships the substrate).
- findReferences reverse-graph — M2.
- Signal-precise cache invalidation — mtime-based today is sufficient.
- BM25 ranking parity test (MD vs HTML same content) — M2 bench
  measures end-to-end; the routing tests above cover the
  determinism boundary.
- HTML BM25 input now concatenates `<bv-topic>` summary/tags/keywords/related
  into the indexable content. Without this, a term living only in those
  attributes ranked HTML topics far below the markdown equivalent (where
  YAML frontmatter is fed verbatim into BM25).
- HTML title fallback uses `?.trim() || fallbackTitle` so a `title=""` or
  whitespace-only attribute doesn't leak into BM25's 3x-boosted title field.
- `ElementHint` (service + tools-sdk) becomes a discriminated union so
  `attribute` and `value` must be supplied together at compile time.
- `ElementAxisIndex` storage swaps the stringly-keyed
  `${tag}.${attr}=${value}` table for nested Maps — eliminates the
  `.`/`=`-in-attribute collision class without a delimiter discipline.
- `html-reader` lifts attributes off the FIRST `<bv-topic>` encountered
  unconditionally rather than the first non-empty one (matches the
  contract on malformed input).
- Glob discovery does two parallel passes (`**/*.html`, `**/*.md`)
  instead of relying on the underlying engine to expand `{html,md}`.

Tests: +7 (BM25 attribute-payload parity, title-fallback, FS-backed
reader round-trip, topic-attribute lift contract, attribute-name/value
collision robustness x3). All 7917 pass; lint + typecheck clean.
[ENG-2740] query path — HTML reader + element-axis index + search routing
…sponse

Tier 2 (`brv query`'s direct-search response — no LLM) reads the full
file at `query-executor.ts:701` and ships the bytes back to the user
through `formatDirectResponse`. For HTML topics that's raw
`<bv-topic ...><bv-rule severity="must">...</bv-rule>...` markup.

Two consequences observed during the M1 multi-tier audit:

  1. The 5000-char content budget in `direct-search-responder.ts:11`
     burns on tags and attribute syntax before reaching meaningful
     text. Long topics get truncated mid-element.
  2. Any LLM-shaped consumer downstream (Tier 4 agent re-reads, the
     bench harness, third-party integrations) re-parses the HTML.
     Wasted tokens; lossy semantics.

Stripping every tag (bodyText only) loses the typed-element signals
the format exists to carry — severity on rules, subject/value on
facts, id on decisions. New `renderHtmlTopicForLlm(html)` walks the
parsed tree and renders each `<bv-*>` element with a semantic prefix:

  - `<bv-rule severity="must" id="r-1">x</bv-rule>` →
    `- **Rule** [must] (r-1): x`
  - `<bv-fact subject="alg" value="RS256">…</bv-fact>` →
    `- **Fact** (subject=alg, value=RS256): …`

Wired into `tryDirectSearchResponse`: when `result.format === 'html'`
the full content goes through the renderer; markdown topics pass
through verbatim. Renderer is forgiving on malformed input (mirrors
the rest of the parse5-driven reader pipeline); on any throw we fall
back to raw bytes so a single bad topic can't blank the response.

Tests: +15 (12 renderer, 2 Tier 2 HTML branch, 1 Tier 2 markdown
regression guard). 7931 pass; lint + typecheck + build clean.
fix: Tier 2 direct response renders HTML topics into structured markdown
…annel auth

Phase 1, Slice 1.0 of the channel-protocol rollout. Prepares the transport
server for the daemon-token auth middleware that arrives in Slice 1.4
(CHANNEL_PROTOCOL.md §2; DESIGN.md §5.6).

Three connected changes:

- Widen TransportServerConfig.corsOrigin from string-only to the full
  Socket.IO origin union (string | string[] | RegExp | RegExp[] |
  OriginCallback). Channel auth needs callback and regex shapes for
  loopback-only origin enforcement; string-only blocked them.

- Extend RequestHandler with an optional third ctx parameter exposing
  handshake auth.token and origin. Backward-compatible: existing handlers
  continue to match the (data, clientId) shape; new channel handlers
  consume ctx to enforce CHANNEL_UNAUTHORIZED before dispatching.

- Add DaemonTokenStore (readOrCreateDaemonAuthToken) that persists a
  256-bit token to <data-dir>/state/daemon-auth-token with mode 0600,
  regenerating on missing file, wrong perms, or empty contents. Atomic
  rename for crash safety; POSIX-only perm check documented for v1.

15 new tests across the three layers; 22 pre-existing
socket-io-transport-server tests still green.

See plan/channel-protocol/IMPLEMENTATION_PHASE_1.md Slice 1.0 for the
full scope.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Post-HTML-migration, every QueryLogEntry on disk reports
`format: "markdown"` regardless of whether the recall touched HTML or
markdown topics. Telemetry consumers (query-log summaries, analytics
dashboards, the brv-bench harness) cannot distinguish HTML vs markdown
traffic — every chart flat-lines as "all markdown" even on a fully
migrated tree.

Root cause: T5 (ENG-2741) introduced `MarkdownOnlyFormatDetector` as a
deliberate stub with the explicit comment "swap when the format-detector
task lands the real extension-aware detector." T3 (ENG-2739) shipped
`format-detector.ts` for read-routing inside `search-knowledge-service`
but did not swap the executor's binding. The stub stayed wired as the
default in `QueryExecutor`.

This change:

- Adds `ExtensionAwareFormatDetector` — inspects each
  `matchedDoc.path` extension and returns `'html'` if any HTML doc was
  retrieved, `'markdown'` for legacy-only recalls, `undefined` for
  empty recalls (cache hit, OOD short-circuit, tier 4 LLM-only).
- Single-`'html'` policy is deliberate: any HTML doc is the
  load-bearing signal for telemetry rollups in a partially-migrated
  tree. Reporting `'markdown'` for a mixed result would hide HTML
  traffic from cost / coverage dashboards.
- Strips shared-source `[alias]:` prefixes before extension parsing so
  shared-source paths classify the same as local.
- Defaults to `'markdown'` for extensionless paths (defensive).
- Wires as the default `formatDetector` binding in `QueryExecutor`.
- Keeps `MarkdownOnlyFormatDetector` for tests that pin pre-migration
  semantics; updates its docstring + comment to reflect the new role.

Tests: +9 (extension-aware detector full coverage). 7940 pass; lint
runs through the pre-existing submodule config issue, otherwise clean.
ByteRover-provider runs produce QueryLogEntry records with no
`inputTokens` / `outputTokens` / `cachedInputTokens` /
`cacheCreationTokens` even though Tier 3 calls take seconds of llmMs.
Telemetry consumers cannot compute per-query cost.

Root cause: T5's telemetry pipeline depends on
`GenerateContentChunk.rawResponse` being set on the terminating
chunk so `LoggingContentGenerator.generateContentStream()` can
capture the last non-undefined `rawResponse` and feed it to
`pickRawUsage()` (`.usage ?? .usageMetadata`). The AI-SDK content
generator wires this on its `'finish-step'` case but
`ByteRoverLlmHttpService.extractContentFromResponse()` (the simulated
stream backing every `ByteRoverContentGenerator` call) yielded chunks
without ever attaching the backend response. Result: `lastRawResponse`
stays undefined, no `llmservice:usage` event fires, no tokens land on
the log entry.

Fix: attach the full `GenerateContentResponse` as `rawResponse` on the
terminating chunk in both branches of `extractContentFromResponse` —
empty-content and full-content paths. The `pickRawUsage` helper
already handles the Gemini-shaped (`usageMetadata`) and Claude-shaped
(`usage`) payloads the ByteRover backend returns through this code
path, and the existing `extractUsage('google'|'anthropic')` mappers
take it from there.

Tests: +3 (terminating-chunk rawResponse coverage across content,
empty-parts, and function-call termination cases). 7943 pass.
The companion prompt loaded into every curate-mode invocation
(`curate-detail-preservation.yml`, auto-discovered by
`companion-contributor.ts:discoverCompanionPrompts('curate')` against
the `curate-*.yml` glob) still instructed the agent to "store in
narrative.rules / narrative.highlights / narrative.diagrams /
narrative.structure / narrative.dependencies / narrative.examples /
rawConcept.flow / content.facts / snippets" — every one of which is a
JSON-schema field on the pre-T3 curate-tool API that no longer exists.
T3 (ENG-2739) dropped `useHtmlContextTree` and made HTML the sole
emission format, but T2 (ENG-2738, the HTML-emission contract) only
updated `curate.txt` and missed this companion file.

Empirically observable on the proj/html-mem-conversion branch: a
LoCoMo-bench curate against `conv_26/session_1` produced a `<bv-topic>`
with one giant `<bv-rule>` element wrapping a markdown narrative —
the agent saw "store in narrative.rules" and treated `<bv-rule>` as
the closest match. Other sessions in the same run emitted clean
typed-element shapes; the variance correlates with whether the
legacy companion prompt dominated that turn's context.

This change rewrites the file so every preservation rule maps to the
correct `<bv-*>` element from the closed vocabulary
(`bv-rule`, `bv-task`, `bv-fact`, `bv-decision`, `bv-bug`, `bv-fix`,
`bv-flow`, `bv-diagram`, `bv-highlights`, `bv-structure`,
`bv-dependencies`, `bv-examples`, `bv-pattern`). General preservation
principles (completeness over conciseness, exact preservation, never
summarize) are kept verbatim. Adds an explicit guard at the end of
the prompt telling the agent NOT to use `narrative.*` / `rawConcept.*` /
`content.*` JSON-schema field names so any residual model bias toward
the pre-migration API surfaces is suppressed.

Tests: no test fixture asserts on this file's contents (it's a runtime
agent prompt, exercised via end-to-end curate runs). YAML still parses;
full suite green at 7943.
… goalposts

Phase 1, Slice 1.1 of the channel-protocol rollout. Ships only test files;
no production code. The tests compile and fail at runtime against the
ChannelTestHarness stub with a clear "Slice 1.1 stub" message pointing at
the slices that will turn each test green. This is the red signal Outside-In
TDD requires before any orchestrator / handler / oclif code lands.

New files:

- test/helpers/temp-dir.ts — tmpdir + recursive cleanup helper.
- test/helpers/temp-context-tree.ts — scratch project dir with the
  .brv/context-tree/ root expected by the orchestrator.
- test/helpers/channel-test-harness.ts — boot() + run() + shutdown() +
  simulateSnapshotLoss() API. All methods are stubs that throw with a
  pointer to the implementing slice. Documents both runtime options
  (in-process daemon vs subprocess) for Slice 1.4/1.5 to pick.
- test/integration/channel-phase1-happy-path.test.ts — encodes the §6.4
  demo: new → list → get → post → list-turns → show → archive. Includes
  the auth-rejection canary (Phase 1 DoD §5): orphan BRV_DATA_DIR with no
  daemon-auth-token MUST fail with CHANNEL_UNAUTHORIZED or
  ERR_BRV_DAEMON_NOT_INITIALISED.
- test/integration/channel-phase1-crash-recovery.test.ts — encodes DoD §2:
  drop turn.json, replay from events.jsonl.
- test/integration/channel-phase1-append-finalize-race.test.ts — encodes
  DoD §3: N concurrent posts to one channel must serialise through the
  per-turn write lock with no torn writes.

Red signal verified: `npx mocha test/integration/channel-phase1-*.test.ts`
reports 3 failing, 0 passing, all at the boot() stub. Slice 1.0 tests
remain green (37 passing).

See plan/channel-protocol/IMPLEMENTATION_PHASE_1.md Slice 1.1 for the full
scope and Slices 1.2-1.5 for the green-path schedule.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 1, Slice 1.2 of the channel-protocol rollout. Defines the shared wire
types and event constants the daemon, oclif commands, and adapter SDKs will
consume. Two concerns landed together so Phase-1 request validation works
end-to-end:

  - Full ChannelEvents constants object (all 19 events per CHANNEL_PROTOCOL.md
    §3) is exported from day one with logical lifecycle/membership/turns/
    broadcast grouping. Names are locked across phases so the wire never
    churns; only handler registration and request schemas grow per phase.

  - Phase-1 request/response zod schemas for the 7 client-to-host events
    (create, list, get, archive, post, list-turns, get-turn) plus the two
    Phase-1 broadcasts (turn-event, state-change). Phase-2 events are present
    as constants but have no request schema until Phase 2 wires them up.

Supporting domain types in src/shared/types/channel.ts:
  - ContentBlock (ACP-aligned discriminated union, passthrough() so unmodeled
    ACP fields round-trip).
  - TurnAuthor, ChannelMember, ChannelMemberSummary (discriminated unions
    matching CHANNEL_PROTOCOL.md §4 + §5.1).
  - Turn, TurnDelivery, TurnEvent (full §7.1 union — Phase 1 only exercises
    `message` and `turn_state_change`, but the schema is forward-complete so
    later slices don't re-touch the file).
  - Channel + ChannelMeta (wire vs on-disk shapes per §10 + §4.2).
  - TurnState / TurnDeliveryState enums.

ACP source resolution (Phase 1 plan Open Decision 4): installed
@agentclientprotocol/sdk@^0.21.0; package exports ContentBlock /
PromptRequest / RequestPermission* / PermissionOption / PlanEntry /
SessionUpdate as TypeScript types. Our local zod schemas mirror these for
runtime validation; orchestrator code that needs strict ACP type alignment
imports from the SDK directly. The promptBlocks field is typed properly (not
unknown) so §8.4 prompt-empty validation is accurate.

17 new tests; full prior suite (54 transport+auth tests) still green.

See plan/channel-protocol/IMPLEMENTATION_PHASE_1.md Slice 1.2 for scope and
plan/channel-protocol/CHANNEL_PROTOCOL.md §3/§4/§8 for the spec.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses three live-test failures from the 2026-05-24 cross-bridge session
(see plan/channel-protocol/BUG_REPORT_PARLEY_TIMEOUTS_2026-05-24.md):

1. PARLEY_LOCAL_AGENT_PROFILE_MISSING when BRV_BRIDGE_CLAUDE_UNSAFE is
   unset but BRV_BRIDGE_PARLEY_PROFILE=claude-code is set. Root cause:
   AcpAdapter shadows the reserved claude-code profile name. Fix: new
   BUILTIN_PARLEY_PROFILE_NAMES Set<string> in parley-adapter-registry.
   AcpAdapter registration is skipped (NOT early-returned) for reserved
   names; strict resolve then fails-fast with the existing UNSAFE hint.

2. TRANSCRIPT_TERMINAL_MISSING after substantive work. Root cause: seal-
   emission race where the dialer's libp2p stream is torn down before
   the seal frame lands. Two-layer fix:
   - Responder (parley-server.ts): wrap transcript_seal sendFrame in
     try/catch on both success + error paths; log failures with
     channelId/turnId; don't crash the dispatcher.
   - Dialer (parley-client.ts): degraded-completion fallback in
     verifyResponseStream. When the frame set contains a SIGNED
     stream_end as the last non-heartbeat frame plus ≥1 agent_message_
     chunk but no transcript_seal, return the completion result with
     sealOrigin='implicit-from-signed-terminal' + integrityDegraded=
     true. Signature verification binds the same payload fields as
     verifyResponseTerminal (channel_id, delivery_id, protocol,
     request_envelope_hash, seq, turn_id, terminal_payload) so the
     fallback never trusts a forged terminal. Schema additions in
     shared/types/channel.ts surface the markers in
     `brv channel show --json`.

3. ACP_PROMPT_FAILED: operation aborted after ~10min with zero tool
   calls. Root cause likely libp2p connection-level abort but
   unconfirmed (no 10-min constant in the source). Defensive fix:
   - Split timeouts: BRV_BRIDGE_PARLEY_DIAL_TIMEOUT_MS (30s default)
     covers ONLY the dial phase; BRV_BRIDGE_PARLEY_TURN_IDLE_TIMEOUT_MS
     (60min default) is an idle timer that resets on every responder
     frame (chunks, heartbeats, thoughts, tool_use).
   - Twin AbortControllers in remote-member-driver.ts. onDialComplete
     callback clears the dial timer + starts the idle timer + resets
     lastActivityAt to "now" so the idle window measures from post-
     dial, not turn start. onFrameReceived resets lastActivityAt on
     every received frame.
   - New PhaseStampedAbort error class with phase/elapsedMs/frameCount/
     lastFrameKind/lastFrameSeq/localTimeoutFired wired into both
     dial and idle abort paths. signal.reason preservation through
     dialAndSendAndConsume + readResponseFrames race so the original
     timeout reason propagates instead of being replaced.
   - libp2p-host.dialAndSendAndConsume gains optional signal arg
     (backwards-compatible overload); abort listener cleaned up in
     finally; signal threaded into the body callback.

Cross-cutting: subscribe lost-wakeup race when --turn is set without
--after-seq. Fix: subscribe.ts resolveReplayCursor defaults
afterSeq=0 when --turn is set without an explicit cursor, triggering
the existing replay path. Listener-first ordering preserved (no new
fetch-vs-listener race).

Deferred (codex tracked):
- Layer D libp2p ping hardening: defer until a retest shows it's
  needed. The library is not auto-keepalive; periodic-ping would
  require explicit setInterval calling node.services.ping.ping().
- BridgeTranscriptService broadcaster: responder-side observability
  for cross-bridge turns still TODO.

Tests: 8713 passing (was 8674; +39 new). Lint + build clean.

Plan: plan/bridge-smoothness/PARLEY_TIMEOUT_FIXES.md
Codex review: 4 rounds — plan r1/r2 + impl r1/r2. All blockers
resolved before merge.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 24, 2026

Claude finished @ngduyanhece's task in 9m 0s —— View job


Re-review — Channel + Bridge protocol (synchronize, 2026-05-24 05:20 UTC push)

Overall: 4 / 5 (unchanged). Phase 9.5.7 is a tightly-scoped fix-pack for three live-test bugs from the 2026-05-24 cross-bridge session. The commit message + plan doc explicitly track every codex round; the resulting code is structurally sound and well-tested.

  • Identify commits added since the prior review (2026-05-23 12:13 UTC)
  • Diff the new commits
  • Spot-check new code (Phase 9.5.7: parley timeout + observability hardening)
  • Verify prior-round findings (any addressed?)
  • Post inline findings (5 buffered)
  • Post final summary + score

Diff since 2026-05-23 12:13 UTC

SHA Change
1abbe7a5 Phase 9.5.7 — parley timeout + observability hardening. Three live-test bugs fixed: (1) AcpAdapter shadowing the reserved claude-code profile name → new BUILTIN_PARLEY_PROFILE_NAMES Set, AcpAdapter registration is skipped (not early-returned) for reserved names so downstream built-in registration still runs; strict resolve fails-fast with the existing UNSAFE hint. (2) TRANSCRIPT_TERMINAL_MISSING seal-emission race → responder wraps both seal-send paths in try/catch with diagnostic logs (Layer B); dialer's verifyResponseStream now async and falls back to a verified stream_end terminal when seal is missing AND chunks exist AND stream_end is the last non-heartbeat frame, marking the result sealOrigin='implicit-from-signed-terminal' + integrityDegraded=true (Layer A). Signature verification binds the same payload fields as verifyResponseTerminal (channel_id, delivery_id, protocol, request_envelope_hash, seq, turn_id, terminal_payload), so the fallback never trusts a forged terminal. (3) ACP_PROMPT_FAILED 10-min abort → split timeouts (BRV_BRIDGE_PARLEY_DIAL_TIMEOUT_MS default 30s, BRV_BRIDGE_PARLEY_TURN_IDLE_TIMEOUT_MS default 60min) with twin AbortControllers in remote-member-driver.ts; onDialComplete clears dial timer + starts idle interval + resets lastActivityAt to "now"; onFrameReceived resets lastActivityAt on every frame (heartbeats count). New PhaseStampedAbort error class carries phase/elapsedMs/frameCount/lastFrameKind/lastFrameSeq/localTimeoutFired. libp2p-host.dialAndSendAndConsume gains optional signal arg (BC overload preserved). Plus a subscribe.ts lost-wakeup fix: --turn set without --after-seq defaults afterSeq=0 to trigger replay.
21437c58 Docs only — plan/bridge-smoothness/PLAN.md + docs/channel-events.md (already reviewed in prior pass).

Total delta on this branch since 2026-05-23 12:13 UTC: 18 files, +2445 / -38.

What's good

  • Reserved-name guard is the right shape. BUILTIN_PARLEY_PROFILE_NAMES as ReadonlySet<string> for .has() membership; AcpAdapter registration skipped (not early-returned) so ClaudeCodeHeadlessAdapter registration still runs. The log line is operator-friendly with the recovery hint (BRV_BRIDGE_CLAUDE_UNSAFE=1).
  • Degraded-completion fallback verifies the stream_end signature using the SAME payload binding as the normal terminal-verify path (channel_id, delivery_id, protocol, request_envelope_hash, seq, turn_id, terminal_payload). Forged-key tests confirm the negative case throws TRANSCRIPT_TERMINAL_MISSING. Strict pre-conditions (signed stream_end + ≥1 chunk + stream_end is last non-heartbeat) — no path silently accepts an under-bound terminal. seq monotonicity check (line 320) defends against drop/reorder of unsigned chunks within the Noise session.
  • Twin AbortController + PhaseStampedAbort correctly separates dial-phase vs frame-read-phase timeouts; onDialComplete reset of lastActivityAt to "now" closes the round-2 codex finding (post-dial idle window shouldn't be shortened by slow dial); signal listener removed in finally so it never leaks.
  • signal.reason propagation through dialAndSendAndConsume and readResponseFramesInternal preserves the PhaseStampedAbort verbatim — operators get a stamped error with phase + frame state, not a generic The operation was aborted.
  • Test coverage is proportionate to the addition — 5 new test files (~1300 LOC) cover: degraded-fallback positive + 3 negative cases, reserved-name guard, env-config parsing, phase-stamped abort, replay cursor, abort-signal propagation, twin-timer behaviour.

New inline findings (5 buffered)

  1. parley-server.ts:538 — asymmetric protection: error-path errorFrame send is NOT wrapped in the §3.2 Layer B try/catch, but the immediately-following seal-send IS. If the dialer disconnects mid-generation (the exact Layer B scenario), the errorFrame throws first and the seal-send try/catch never runs, defeating the operator-visibility intent.
  2. libp2p-host.ts:191 — TOCTOU race between dialProtocol returning and signal.addEventListener('abort', onAbort, {once: true}). If the abort fires in that microsecond gap, the listener attaches against an already-aborted signal (AbortSignal doesn't replay), stream.abort() never runs, and the subsequent stream.send() can hang — defeating §3.3 Layer C in a narrow window. Cheap fix: if (signal.aborted) onAbort() after attach.
  3. parley-client.ts:344 — degraded-fallback comment says "chunks transported under same authenticated libp2p session", but the load-bearing in-session integrity property is the seq-monotonicity check at line 320-328. Note worth anchoring in the fallback rationale. PLUS: integrityDegraded=true only propagates through --json; a text-mode banner in brv channel show would surface this safety-critical state to operators who don't know to look.
  4. parley-timeout-config.ts:43 — no validation that dialTimeoutMs < idleTimeoutMs. Operator who sets IDLE=10s without overriding default DIAL=30s gets a structurally-uncatchable configuration. Warn-log at parse fixes it.
  5. subscribe.ts:39resolveReplayCursor returns {afterSeq, turn} but callers destructure only afterSeq; turn is identity. Shape polish.

Prior-round findings — status

1abbe7a5 did not touch any of the files flagged in earlier rounds. All ~15 prior-round findings still open as written (master-key colocation, NFC normalization writer-only, unbounded Map family across parley-nonce-lru.ts / parley-rate-limit.ts / auto-create-quota.ts, process-lock.ts doc/code drift, non-unique tmp paths in alias-store.ts + parley-adapter-session-store.ts, discovery.py bool-int, channel-skill/install-lib.js directory accept, claude-code-headless-adapter.ts env-passthrough + unbounded stderrChunks + flushStdoutBuffer race + doc/code drift + missing --verbose test, parley-adapter-registry.ts UNSAFE-without-deps logs-and-continues, bridge-config-store.ts listenAddrs not multiaddr-refined, ai-sdk-content-generator.ts synthetic usage masking, INTERNAL_TEST.md §3.1 / §3.5 / §5 / §5.1 doc trio). Consistent with the PR's "internal-test cut, follow-ups tracked" framing.

Verdict for this push

LGTM-incremental. Phase 9.5.7 is a high-quality fix-pack: each of the three live-test bugs has a documented root cause, a structural (not symptomatic) fix, and proportionate test coverage. The degraded-completion fallback is the most security-sensitive piece and its signature verification is properly bound to the same payload fields as the normal terminal-verify path — this is the right shape. The five inline findings are robustness/polish items (asymmetric error-path protection, TOCTOU race window, documentation anchor, config sanity warning, dead-return shape), not blockers.
· branch

seq: nextSeq(),
})
emittedFrames.push(errorFrame)
await sendFrame(stream, errorFrame)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit (asymmetric protection, low): the §3.2 Layer B try/catch wraps the seal-send on the error path (lines 539-560) but the immediately-preceding errorFrame send at line 538 is NOT wrapped. The motivating scenario for Layer B — "dialer torn down the stream before the seal lands" — applies equally to the error frame: if the dialer disconnects during generation, the error frame send fails first, throws out of the catch handler, and the carefully-crafted seal-send try/catch never runs. The diagnostic [parley-server] Failed to send transcript_seal frame (error path) log that §3.2 Layer B is designed to surface never fires.

Cheap fix — wrap both sendFrame calls symmetrically:

try {
  await sendFrame(stream, errorFrame)
} catch (errFrameError) {
  console.warn(`[parley-server] Failed to send error frame for turn=${envelope.turn_id} ...`)
  // Don't return — still try the seal for the dialer's implicit-fallback path
}
try {
  await sendFrame(stream, buildTranscriptSealFrame({...}))
} catch (sealError) { /* existing log */ }

Or accept the throw and add a single outer try/catch around both sends. Non-blocking; on the happy path nothing changes, but the operator-visibility intent of Layer B is partially defeated for the most common torn-down-mid-generation scenario.

readonly signal?: AbortSignal
},
): Promise<T>
/**
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (TOCTOU race, low-medium — defeats §3.3 Layer C in a narrow window): there's a small race window between dialProtocol returning (line 184) and signal.addEventListener('abort', onAbort, {once: true}) (line 191). If combinedAbortController.signal aborts in that gap (e.g. RemoteMemberDriver's dial-timer fires exactly as dialProtocol resolves), the listener attaches against an already-aborted signal — and AbortSignal does NOT replay events for listeners attached after the abort fired. Result: stream.abort() is never called, the listener never runs, and the next await stream.send(payload) can hang because nothing tore the substream down.

The likelihood is low (the window is microseconds), but the consequence — a turn that hangs until libp2p's own connection-level timeout fires, which is the very pathology Phase 9.5.7 is fixing — is the exact failure-mode this PR sets out to make impossible.

Cheap fix — replay the abort check right after attaching:

if (signal !== undefined) {
  signal.addEventListener('abort', onAbort, {once: true})
  // Replay: if signal aborted between dialProtocol returning and addEventListener
  // attaching, onAbort would never fire (AbortSignal doesn't replay). Run it once
  // synchronously so the stream is torn down before stream.send().
  if (signal.aborted) onAbort()
}

Non-blocking; the race is narrow and not directly the original bug. But this is the exact kind of edge case that surfaces in live tests months later.

// (c) at least one agent_message_chunk exists,
// then we can reconstruct the turn result as "completed but integrity-
// degraded" — the responder said it's done (via the signed terminal), and
// the chunks were transported under the same authenticated libp2p session.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit (security documentation, low — but load-bearing for future reviewers): the fallback-rationale comment at lines 332-347 says the chunks "were transported under the same authenticated libp2p session." That's true but it's the peer-authentication property of Noise XX, not what defends against drop/reorder of the unsigned chunks within that session. The load-bearing security claim for the degraded path is:

  1. Noise XX binds the peer identity — only Bob's daemon could have emitted these chunks.
  2. The seq-monotonicity check at lines 320-328 — checked BEFORE the fallback runs — rejects any frame stream where seq isn't strictly increasing, so a tampering MITM (who somehow has Bob's Noise key) couldn't reorder/drop chunks, and Bob himself can't accidentally lose a chunk without the dialer noticing.

The signed stream_end then anchors "Bob says the turn ended at seq=N" to the L2 key — but the content of seqs 1..N-1 is bound only by the combination of (Noise peer-auth + seq monotonicity). A one-sentence note pointing at the seq-check as the in-session integrity anchor would make the proof airtight for future reviewers.

Separate / heavier — operator visibility: integrityDegraded=true now propagates through TurnDelivery and is visible only via brv channel show --json. For the most safety-critical state in this entire codepath (turn succeeded but the cryptographic binding is missing), a text-mode banner — e.g. brv channel show prints ⚠ integrity-degraded next to the turn-id — would be safer than relying on operators to remember to add --json | jq .integrityDegraded. Operators only think to check JSON when they suspect a problem; this is the case where they don't know to suspect.

Non-blocking; the degraded-fallback logic itself is sound.

export function parseParleyTimeoutEnv(env: Record<string, string | undefined>): ParleyTimeoutConfig {
const dialTimeoutMs = readPositiveIntEnv(env.BRV_BRIDGE_PARLEY_DIAL_TIMEOUT_MS) ?? DEFAULT_DIAL_TIMEOUT_MS
const idleTimeoutMs = readPositiveIntEnv(env.BRV_BRIDGE_PARLEY_TURN_IDLE_TIMEOUT_MS) ?? DEFAULT_IDLE_TIMEOUT_MS
return {dialTimeoutMs, idleTimeoutMs}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit (config sanity, low): parseParleyTimeoutEnv silently accepts any positive integer for both values, including combinations where dialTimeoutMs >= idleTimeoutMs. The two timers compose temporally — the idle timer doesn't start until onDialComplete fires (see remote-member-driver.ts:191-207) — so an operator who sets

export BRV_BRIDGE_PARLEY_TURN_IDLE_TIMEOUT_MS=10000

without overriding the default BRV_BRIDGE_PARLEY_DIAL_TIMEOUT_MS=30_000 ends up with a 30s dial timeout that the idle timer is structurally incapable of catching (the dial finishes after the idle threshold the operator wanted). The values "work" in the sense that the timers fire, but the operator's intent is silently misapplied.

Cheap fix — warn at parse time when dialTimeoutMs >= idleTimeoutMs:

if (dialTimeoutMs >= idleTimeoutMs) {
  console.warn(
    `[parley-timeout] BRV_BRIDGE_PARLEY_DIAL_TIMEOUT_MS=${dialTimeoutMs}ms >= ` +
    `BRV_BRIDGE_PARLEY_TURN_IDLE_TIMEOUT_MS=${idleTimeoutMs}ms; the idle timer ` +
    `cannot catch slow dials.`,
  )
}

Non-blocking; defaults make this unreachable, but internal-test users who tune one value without the other will hit it silently.

return {afterSeq: 0, turn: args.turn}
}

return {afterSeq: args.afterSeq, turn: args.turn}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit (dead-return, low): resolveReplayCursor returns {afterSeq, turn} but the only caller (line 183) destructures only afterSeq. The returned turn is identical to the input turn on both branches (the function never modifies it). Either drop turn from the return signature (it's redundant) or have the caller use the destructured turn for the downstream flags.turn access — that would defend against future refactors where the function might want to normalize turn too. The current shape passes the linter but invites the next contributor to ignore one return field.

Functional behaviour is correct; this is shape polish only.

…ibp2p observability

Live retest on 2026-05-24 (S5 turn, 17m 48s wall-clock) hit failure mode #2
from the original bug report. Phase 9.5.7's degraded-completion fallback
required a stream_end frame as pre-condition, but the responder never got
to emit one — the libp2p connection torn down mid-stream while the
subprocess was still working. This slice closes the gap with three fixes.

Fix A — responder terminal-send hardening
  New sendFrameDiagnostic() helper in parley-server.ts wraps every
  terminal-frame send (stream_end success path, error path, transcript_seal
  on both paths) in try/catch with structured logging including channelId
  and turnId. Replaces the previous bare sendFrame calls at the terminal
  emission sites. Failures are logged but never re-thrown — work product
  is on disk and the dialer's fallback (Fix B) covers the wire-level loss.

Fix B — dialer second-tier fallback for "no terminal at all"
  parley-client.ts:verifyResponseStream gains a verifyNoSealFallback that
  handles two cases the original 9.5.7 fallback couldn't:
  1. chunks + signed error + no seal — verify the signed error terminal
     using the same payload binding as verifyResponseTerminal (codex
     round-3 constraint: bind channel_id, delivery_id, protocol,
     request_envelope_hash, seq, turn_id, terminal_payload). On valid
     signature, return endedState='errored' with errorCode/errorMessage
     populated, integrityDegraded=true, sealOrigin='implicit-from-signed-
     terminal'. On invalid signature, throw.
  2. chunks + no terminal at all — second-tier soft completion. The
     chunks were transported under the same authenticated libp2p session
     but no responder-signed terminal arrived. Return sealOrigin=
     'implicit-from-stream-eof', terminalMissing=true, integrityDegraded=
     true. Codex round-1 caveat: this trusts unsigned chunks; the
     terminalMissing marker is the guardrail — operators MUST surface
     this clearly.

  Schema additions (additive only) on TurnDelivery: sealOrigin literal
  union gains 'implicit-from-stream-eof'; new optional terminalMissing
  boolean field.

Fix C — libp2p connection-state observability
  libp2p-host.ts registers connection:open, connection:close, peer:connect,
  peer:disconnect listeners at boot time. Each logs peerId/addr/direction/
  status via console.warn ([libp2p] prefix). Plus a 30s setInterval that
  logs age/status for every open connection. Cleanup in stop(). This is
  the diagnostic instrumentation the original bug report's "what kills
  the connection at ~17min?" mystery needs — next retest produces an
  actionable timeline instead of guesses.

Marker propagation through driver + orchestrator + CLI (codex round-1 +
round-2 blockers):
  - remote-member-driver.ts: prompt() generator yield-order is now
    chunks → parley_integrity agent_meta → error frame surfacing (was:
    error frame surfacing first, dropping chunks + meta). For chunks+
    signed_error+no_seal turns, orchestrator now persists the integrity
    record into TurnDelivery BEFORE the delivery transitions to errored.
  - orchestrator.ts: existing delivery-write path now forwards sealOrigin,
    integrityDegraded, terminalMissing into the TurnDelivery snapshot
    when emitted via agent_meta with subKind='parley_integrity'.
  - bridge/ping.ts: non-JSON output emits an operator-visible warning
    block when result.integrityDegraded or result.terminalMissing is
    true. JSON output exposes the fields unchanged.

Tests: 8732 passing (was 8713; +19 new across the three fixes). Lint +
build clean.

Codex review: 3 rounds for 9.5.8 (impl r1 blocker, impl r2 blocker on
yield-order, r3 sign-off no findings).

Driving incident: plan/channel-protocol/BUG_REPORT_PARLEY_TIMEOUTS_2026-05-24.md
Plan context: plan/bridge-smoothness/PARLEY_TIMEOUT_FIXES.md

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 24, 2026

Claude finished @ngduyanhece's task in 4m 50s —— View job


Re-review — Channel + Bridge protocol (synchronize, 2026-05-24 07:24 UTC push)

Overall: 4 / 5 (unchanged). Phase 9.5.8 closes the failure mode #2 gap from the 2026-05-24 live-test bug report with a sound three-fix structure (responder terminal hardening, dialer second-tier fallback, libp2p observability). The cryptographic discipline of 9.5.7 carries through — the new fallback verifies the signed error frame using the same payload binding as the normal verifyResponseError path. Findings below are robustness and semantic-clarity items, not blockers.

  • Identify commits added since prior review (2026-05-24 05:20 UTC)
  • Diff the new commits
  • Spot-check new code (terminal-send hardening, dialer fallback ladder, libp2p observability, marker propagation)
  • Verify prior-round findings (any addressed?)
  • Post inline findings (5 buffered)
  • Post final summary + score

Diff since 2026-05-24 05:20 UTC

SHA Change
c4ee1135 Phase 9.5.8 — close TRANSCRIPT_TERMINAL_MISSING gap + libp2p observability. Three coordinated fixes for failure mode #2 from the live retest. Fix A (parley-server.ts): new sendFrameDiagnostic() wraps every terminal-frame send (success-path stream_end, error-path error frame, transcript_seal on both paths) in try/catch with structured [parley-server] Failed to send <label> for turn=... (channelId=...): <err> logging. Failures are logged but not re-thrown — work product is on disk and the dialer's fallback covers the wire-level loss. Fix B (parley-client.ts:verifyResponseStream is now async, verifyNoSealFallback handles two new branches): (1) chunks + signed error + no seal → verify the error terminal using the same payload binding as verifyResponseError, return endedState='errored' with errorCode/errorMessage populated, integrityDegraded=true, sealOrigin='implicit-from-signed-terminal'; (2) chunks + no terminal at all → second-tier soft completion: return endedState='completed' with sealOrigin='implicit-from-stream-eof', terminalMissing=true, integrityDegraded=true. Schema additions on TurnDelivery: sealOrigin enum gains 'implicit-from-stream-eof'; new optional terminalMissing: boolean. Fix C (libp2p-host.ts): connection:open / connection:close / peer:connect / peer:disconnect listeners log via console.warn + a 30s setInterval heartbeat logs age/status for every open connection; cleanup in stop(). Marker propagation: RemoteMemberDriver.prompt() yield-order is now chunks → parley_integrity agent_meta → error frame surfacing (was: error frame first, dropping chunks + meta). ChannelOrchestrator extracts markers from the parley_integrity event into TurnDelivery. bridge ping text mode emits a ⚠ integrity-degraded warning when integrityDegraded or terminalMissing is set.

Total delta: 13 files, +1215 / -120. +19 new tests across the three fixes.


What's good

  • Fix B's signed-error verification path uses the same payload binding (channel_id, delivery_id, protocol, request_envelope_hash, seq, turn_id, terminal_payload) as the normal verifyResponseError path. On invalid signature → throws TRANSCRIPT_TERMINAL_MISSING. The fallback never silently trusts an under-bound terminal. (parley-client.ts:406-440)
  • Yield-order fix in remote-member-driver.ts:303-329 is the right shape. Chunks + the parley_integrity agent_meta marker yield BEFORE the error-frame surfacing throw, so the orchestrator persists both before the delivery transitions to errored. The prior order would have dropped chunks + integrity meta for any chunks+signed_error+no_seal turn.
  • Marker propagation flows end-to-end: driver → orchestrator → on-disk TurnDeliverybrv channel show --jsonbrv bridge ping text-mode ⚠ integrity-degraded warning. The operator-visibility chain that the prior review flagged as missing for --json is now also surfaced in text mode (ping.ts:formatPingResult).
  • sendFrameDiagnostic discipline is consistent: every terminal/seal send inside dispatchResponseStream now logs + swallows. parley-server.ts:540-554 (error path) and parley-server.ts:571-585 (success path) are symmetric — this directly addresses the prior round's parley-server.ts:538 finding about asymmetric error-path protection.
  • Test coverage is proportionateparley-client-verify.test.ts gains 5 new cases for Fix B (signed-error positive, signed-error forged-sig negative, no-terminal-at-all soft completion, 9.5.7 path unchanged regression, empty-frames throw); orchestrator-phase2.test.ts gains 2 marker-persistence integration tests; ping.test.ts is new (4 cases) for the text-mode warning; libp2p-host.test.ts exercises connection-open/close listener wiring.
  • Heartbeat timer cleanup in stop() (lines 357-361) is correct shape — clearInterval before node.stop() so the closed-node reference isn't accessed.

Inline findings (5 buffered)

  1. parley-server.ts:870-871writeErrorTerminal (pre-dispatch verifier-reject path) still uses bare sendFrame. Fix A only closed the gap for dispatchResponseStream; the verifier-reject paths (IMPLEMENTATION_THROW / verifyResponseStream failure / CHANNEL_AUTO_PROVISION_DECLINED) are the most likely scenarios for a dialer-already-giving-up tear-down and are now the asymmetric remainder. Same sendFrameDiagnostic helper would close it.
  2. parley-client.ts:459-477 — second-tier fallback labels an unknown completion state as endedState: 'completed'. The path is engaged when chunks arrived but neither stream_end nor error did — meaning we cannot distinguish "actually completed, libp2p torn down at the wrong instant" from "actually errored" from "actually still generating". terminalMissing=true is the only differentiator, and downstream consumers that destructure endedState (metrics, future auto-actioners) will read this as authoritative success. Recommend introducing endedState: 'unknown' (preferred) or reusing 'cancelled' for the v1.0 hardening pass — internal-test ship can absorb the optimistic label because formatPingResult surfaces the warning.
  3. parley-client.ts:280-282Promise.race([iterator.next(), abortPromise]) correctly unblocks on abort but the in-flight iterator.next() promise is never settled and the iterator is never .return()'d. Underlying network buffers / partial decode state aren't released until the source finalizes, which in the abort-during-read scenario may never happen. Cheap fix: iterator.return?.().catch(() => {}) in the abort listener (or try { ... } finally { iterator.return?.()... } around the loop).
  4. libp2p-host.ts:420-435 — two concerns on the new connection heartbeat: (a) log volume: 30s heartbeat fires unconditionally with console.warn (= daemon stderr); a daemon holding one Tailscale connection during a multi-hour session emits ~120 lines/hour. Gate behind BRV_BRIDGE_LIBP2P_HEARTBEAT_LOG=1 or lower to console.debug. (b) robustness: heartbeat closure doesn't catch node.getConnections() errors; an already-queued tick that fires after clearInterval but before node.stop() finishes could throw. Wrap body in try/catch.
  5. orchestrator.ts:1517-1540 — implicit driver contract: marker propagation is hard-coded against payload.kind === 'agent_meta' && payload.subKind === 'parley_integrity' with Record<string, unknown> field extraction. A future driver that wants to surface integrity markers must match the exact subKind + field names or the markers silently disappear. Extract a typed ParleyIntegrityMarkers interface in shared/types/channel.ts, OR add a // Driver contract: see RemoteMemberDriver.prompt():323-326 anchor comment.

Prior-round findings — status

c4ee1135 addressed one prior round finding directly: the parley-server.ts:538 asymmetric error-path protection from the 2026-05-24 05:29 UTC review is now closed (both errorFrame and seal sends inside dispatchResponseStream go through sendFrameDiagnostic). The new finding above (writeErrorTerminal's verifier-reject path) is structurally the same shape on a different code path — call it a partial close.

All other prior findings still open as written (master-key colocation, NFC normalisation writer-only, unbounded Map family, process-lock doc/code drift, non-unique tmp paths, discovery.py bool-int, channel-skill/install-lib.js directory accept, claude-code-headless-adapter.ts env-passthrough + unbounded stderrChunks + flushStdoutBuffer race + doc/code drift + missing --verbose test, parley-adapter-registry.ts UNSAFE-without-deps logs-and-continues, bridge-config-store.ts listenAddrs not multiaddr-refined, ai-sdk-content-generator.ts synthetic usage masking, INTERNAL_TEST.md §3.1 / §3.5 / §5 / §5.1 doc trio, Phase 9.5.7 quartet from the 05:29 UTC review). Consistent with the PR's "internal-test cut, follow-ups tracked" framing.


Verdict for this push

LGTM-incremental. Phase 9.5.8 is a high-quality fix-pack with the same proportionate test discipline as 9.5.7. The signed-error fallback's payload binding is rigorous; the marker-propagation chain is plumbed end-to-end; the libp2p observability instrumentation will materially help the next live-test debug session. The 5 inline findings are robustness/clarity polish (verifier-reject path symmetry, optimistic endedState labeling, iterator cleanup on abort, heartbeat log volume, implicit driver contract on markers), not ship blockers.
· branch

Comment on lines +870 to +871
await sendFrame(stream, errorFrame)
await sendFrame(stream, seal)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (asymmetric protection, low-medium — same shape as the prior round's §3.2 Layer B finding, this PR's Fix A only partially closes it): §9.5.8 Fix A wrapped every terminal-frame send inside dispatchResponseStream in sendFrameDiagnostic (lines 541, 572, 583), but the pre-dispatch verifier-reject path at lines 235-242 (IMPLEMENTATION_THROW), 260-267 (verifyResponseStream failure), and 310-322 (CHANNEL_AUTO_PROVISION_DECLINED) calls into writeErrorTerminal, which still uses bare sendFrame here at lines 870-871.

These are the exact code paths most likely to hit a torn-down stream — the dialer who hit a verifier failure may already be in the process of giving up. If sendFrame throws here, the rejection propagates up through the libp2p handle callback registered at host.handle(PARLEY_QUERY_PROTOCOL, async (stream) => {...}) (line 202), and the entire diagnostic-visibility intent of Fix A is bypassed for the verifier-reject paths.

Cheap fix — wrap both calls with the same sendFrameDiagnostic helper, threading channel_id + turn_id from bound (or use sentinel "unknown" for the no-parse path):

await sendFrameDiagnostic({
  channelId: bound.channel_id,
  frame: errorFrame,
  label: 'error terminal frame (verifier reject path)',
  stream,
  turnId: bound.turn_id,
})
await sendFrameDiagnostic({
  channelId: bound.channel_id,
  frame: seal,
  label: 'transcript_seal frame (verifier reject path)',
  stream,
  turnId: bound.turn_id,
})

Non-blocking — the operator-visibility intent of Fix A is what's at stake, not work-product loss (the verifier-reject path has no on-disk artifact to preserve). But the inconsistency invites the next contributor to copy the wrong pattern.

Comment on lines +459 to +470
if (chunks.length > 0 && streamEndFrame === undefined && errorFrame === undefined) {
args.log?.(
`[parley-client] No transcript_seal AND no stream_end AND no error frame received for turn=${args.expectedTurnId} ` +
`(channelId=${args.expectedChannelId}). ${chunks.length} chunk(s) salvaged under the ` +
`authenticated libp2p session. Returning soft completion (sealOrigin=implicit-from-stream-eof, ` +
`terminalMissing=true, integrityDegraded=true). Likely cause: dialer libp2p connection ` +
`torn down before responder could emit terminal frame.`,
)
const content = chunks.map((f) => (f as {content: string}).content).join('')
return {
content,
endedState: 'completed' as const,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (semantic correctness, medium — the most semantically loaded choice in 9.5.8): the second-tier "no terminal at all" fallback returns endedState: 'completed' for what is genuinely an unknown completion state. The path is engaged when chunks arrived but neither stream_end NOR error did — meaning we cannot distinguish:

  1. Responder generated a partial answer, was about to emit stream_end, libp2p tore down at the wrong instant → actually completed.
  2. Responder errored mid-generation, was about to emit a signed error frame, libp2p tore down at the wrong instant → actually errored.
  3. Responder was still generating, libp2p tore down before the rest of the answer → actually cancelled / incomplete.

Labeling all three as endedState: 'completed' (with terminalMissing: true as the only guardrail) biases toward optimism. The formatPingResult warning at ping.ts:46-50 correctly surfaces the degradation when an operator runs brv bridge ping in text mode, but any other downstream consumer that simply destructures endedState — metrics, brv channel show in default mode, future code that auto-actions on completion — will read this as authoritative success.

Two reasonable fixes:

  1. Introduce a distinct endedState: 'unknown' (preferred — lexically unambiguous):

    endedState: 'unknown' as const,

    and update TurnDeliverySchema in src/shared/types/channel.ts to extend the enum.

  2. Re-use 'cancelled' (the existing wire-protocol value closest to "responder didn't finish on the record") and document this as the convention.

The current shape — 'completed' with a terminalMissing flag — is fine for the internal-test ship if every consumer is audited to check the flag. For follow-up work it's worth tightening to make the unknown-completion case structurally undeniable rather than relying on a Boolean side-channel.

Non-blocking for internal-test; flag for the v1.0 hardening pass.

Comment on lines +276 to +282
// Use the iterator protocol explicitly so we can race .next() against abort.
const decoded = lp.decode(source as AsyncIterable<Uint8Array>)
const iterator = decoded[Symbol.asyncIterator]()
while (true) {
// Race: either the next length-prefixed chunk arrives, or the signal fires.
// eslint-disable-next-line no-await-in-loop
const next = await Promise.race([iterator.next(), abortPromise])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit (resource cleanup, low): Promise.race([iterator.next(), abortPromise]) correctly unblocks the loop on abort, but the in-flight iterator.next() promise is never settled and never explicitly closed. If the underlying length-prefixed source's iterator is holding network buffers, partial decode state, or a stream reference, those aren't released until the source's own finalization runs — which, in the abort-during-read scenario this code is built for, may never happen because the connection went away.

Cheap fix — formally close the iterator on abort:

const abortPromise: Promise<never> =
  signal === undefined
    ? new Promise<never>(() => {})
    : new Promise<never>((_, reject) => {
        signal.addEventListener(
          'abort',
          () => {
            // Formally close the iterator so any held resources release.
            // Errors here are swallowed — we're already aborting.
            iterator.return?.().catch(() => {})
            reject(abortReasonAsError(signal))
          },
          {once: true},
        )
      })

Note: iterator would need to be hoisted out of the while setup at line 277-278 to be in scope here. Alternatively, wrap the loop in try { ... } finally { iterator.return?.().catch(() => {}) } so the iterator is closed on all loop exits (abort, seal, parse-fail, source-EOF), not just abort.

Non-blocking; the leak is bounded to the lifetime of the libp2p stream tear-down, which is itself bounded. But this is the kind of pattern that compounds badly under load.

Comment on lines +420 to +435
// Periodic connection heartbeat every 30 s — logs all open connections
// so post-mortem analysis can reconstruct the connection timeline.
const {node} = this
this.connectionHeartbeatTimer = setInterval(() => {
const conns = node.getConnections()
if (conns.length === 0) return
for (const conn of conns) {
const age = Date.now() - conn.timeline.open
console.warn(
`[libp2p] heartbeat peer=${conn.remotePeer.toString()} ` +
`age=${age}ms ` +
`status=${conn.status} ` +
`direction=${conn.direction}`,
)
}
}, 30_000)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit (operator UX / log volume, low — and one robustness gap):

  1. Log volume. The 30s heartbeat fires unconditionally and emits one console.warn line per open connection. For internal-test (rare, short-lived turns) this is fine — the commit message frames it as "the diagnostic instrumentation the original bug report's '17-minute abort mystery' needs." But for a long-running daemon that holds a single Tailscale connection open during a multi-hour session, that's 120 lines per hour per connection in the server-*.log, on top of the existing connection:open / connection:close listeners that already give event-driven coverage of the same data. Recommend either:

    • Gate behind a BRV_BRIDGE_LIBP2P_HEARTBEAT_LOG=1 env flag (default off), so operators chasing the 17-min abort can opt in.
    • Lower to console.debug and let the standard logging filter handle it.
  2. Robustness gap. The heartbeat closure captures const {node} = this (line 422) but doesn't guard against node.getConnections() throwing if the body runs after stop() has begun. There's a small race window: an interval callback that was already queued before clearInterval(...) in stop() will still execute, and by the time it does, await node.stop() may be in flight. Wrap the body in try/catch:

    this.connectionHeartbeatTimer = setInterval(() => {
      try {
        const conns = node.getConnections()
        if (conns.length === 0) return
        for (const conn of conns) { /* … */ }
      } catch (error) {
        // Race: stop() invalidated the node between clearInterval and this tick.
        // Drop silently; the heartbeat timer is observability-only.
      }
    }, 30_000)

Non-blocking; both are polish items. The 17-minute-mystery use case is served fine by the event listeners alone, which don't poll.

Comment on lines +1519 to +1540
// §9.5.8 Blocker 2 — propagate integrity-degraded markers from the
// remote-member driver into the delivery record so they are persisted
// to disk and visible in `brv channel show <turnId> --json`.
// The driver emits an `agent_meta` event with subKind='parley_integrity'
// immediately after yielding chunks when sealOrigin !== 'explicit'.
if (
payload.kind === 'agent_meta' &&
payload.subKind === 'parley_integrity' &&
typeof payload.payload === 'object' &&
payload.payload !== null
) {
const markers = payload.payload as Record<string, unknown>
if (typeof markers.sealOrigin === 'string') {
delivery.sealOrigin = markers.sealOrigin as typeof delivery.sealOrigin
}

if (typeof markers.integrityDegraded === 'boolean') {
delivery.integrityDegraded = markers.integrityDegraded
}

if (markers.terminalMissing === true) {
delivery.terminalMissing = true
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit (coupling, low): the integrity-marker propagation is hard-coded against a single event shape — agent_meta events with subKind === 'parley_integrity' — and the field names (sealOrigin / integrityDegraded / terminalMissing) are extracted via Record<string, unknown> rather than a typed payload. Two consequences:

  1. Implicit driver contract. If a future driver (kimi-acp, a yet-unwritten quorum adapter, anything in the cross-machine wiring after 9.6/9.7/9.9) wants to surface integrity-degraded markers, it MUST emit precisely this subKind with precisely these field names — or the marker silently disappears. There's no compile-time check.

  2. Field-name drift. A typo on the driver side (e.g. sealOrigin vs seal_origin) is silent — the check typeof markers.sealOrigin === 'string' just evaluates false and skips.

Cheap fixes — pick one:

  • Defensive (recommended for the internal-test cut): extract a typed ParleyIntegrityMarkers interface (it'd live alongside TurnDelivery in shared/types/channel.ts), and have the driver yield a discriminated-union payload typed against that. The orchestrator would then narrow with a type guard rather than Record<string, unknown> lookups.

  • Light (single comment): add a // Driver contract: see RemoteMemberDriver.prompt():323-326 for the canonical emit shape. block-comment here so future readers of either side know the coupling exists.

Non-blocking; the actual marker propagation is correct for the only driver that emits them today.

ngduyanhece and others added 2 commits May 25, 2026 00:27
Driven by 2026-05-24 investigation where the laptop daemon ran pre-9.5.7
code for hours because Node's require() cache held stale modules even
after npm run build updated dist. Every live test was against the wrong
code. ~6 hours lost to misdiagnosis. This slice closes that class of
foot-gun plus six other channel-state robustness improvements.

Seven fixes (reviewed by codex across 6 rounds + kimi second-eyes):

§2.1 Build-version stamping + runtime mismatch warning
  - scripts/generate-build-info.ts writes dist/build-info.json as runtime
    artifact (NOT compiled-in constant). Build chain reordered:
    rm -rf dist → generate-build-info → tsc → cp templates → build:ui.
    Fields: buildId (canonical compare key), buildAtIso, gitSha, gitDirty,
    packageVersion.
  - Daemon reads file at startup, exposes via system:build-info handler.
  - CLI reads same file at process boot. On first connection, calls
    system:build-info, compares buildId, prints loud stderr warning once
    per CLI process on mismatch. Wired in daemon-client.checkBuildVersionOnce.
  - Pure utilities in src/shared/build-info-check.ts (no transport/server
    imports — clean layer per codex round-2).

§2.2 Post-build daemon-staleness check
  - scripts/check-daemon-staleness.ts runs as postbuild. Reads daemon.json,
    warns if running daemon predates the build by >60s.
  - Opt-in fatal mode via BRV_BUILD_STRICT=1.

§2.4 listChannels skip-not-fail tolerance
  - channel-store.ts:listChannels mirrors runProjectWarm's Promise.allSettled
    pattern. Single malformed legacy meta no longer fails the whole call.
    Closes the documented orchestrator.ts:2126-2132 TODO.

§2.7 Persist all BRV_BRIDGE_* env vars to bridge-config.json
  - bridge-config-store.ts gains optional fields: claudeUnsafe,
    parleyDialTimeoutMs, parleyTurnIdleTimeoutMs, parleyHardTimeoutMs,
    autoCreateQuota.
  - Consumers wired with env > persisted > default precedence:
    - createDefaultRegistry accepts persisted claudeUnsafe.
    - RemoteMemberDriver constructor accepts persistedTimeouts.
    - AutoCreateQuota construction reads bridgeRuntime.autoCreateQuota.

§2.5 Accept-but-mark partial auto-create as inbound-only
  - addressability enum gains 'inbound-only' literal alongside
    'bootstrap-only' and 'pinned'.
  - bridge-transcript-service auto-create writes the marker when
    multiaddr OR remoteL2PubKey is missing (was: refuse the write).
  - channel-doctor surfaces new INBOUND_ONLY health code with copy-
    paste 'brv bridge connect' recovery hint.
  - orchestrator outbound-mention guard throws new
    BridgeInboundOnlyMemberError (extends ChannelError, code
    CHANNEL_ERROR_CODE.BRIDGE_INBOUND_ONLY_MEMBER) instead of dialing.
  - Opportunistic startup migration in src/server/infra/channel/migrations/
    mark-inbound-only.ts upgrades existing partial records via
    channelStore.updateChannelMeta (existing meta-lock path).

§2.6 .brv/ lifecycle observability
  - brv-dir-watcher.ts uses fs.watch on the project's .brv/ with parent-
    dir watcher to catch .brv/ itself being deleted. Lifecycle events
    by default; verbose all-writes behind BRV_DEBUG_DIR_WATCH=1.
  - Log wording is OBSERVATIONAL not attributional (codex round-1 +
    kimi review): "OBSERVED deletion ... cause unknown — check daemon
    logs + external tools." Watcher cannot identify the deleter.

§2.3 Dev-mode auto-restart convenience
  - npm run rebuild = npm run build && npm run dev:kill.

Production wiring: channel-project-startup.ts hooks the migration +
watcher into daemon startup once per project on first connection.

DEFERRED from this slice (held for 9.5.10):
- reconstructMissingMetas — kimi review (turnId 7h-RAyyU6GEy0mRdjI9ay)
  flagged two data-corruption vectors:
  (1) TOCTOU between access(metaPath) and rename(tempFile) — a real
      turn firing in the gap would have its meta clobbered by the empty
      stub. Fix needs writeFile {flag:'wx'} or a write lock.
  (2) Reconstructed members:[] is a lie for channels that had members.
      Should either reconstruct members from turn snapshot records or
      flag the stub as 'reconstructed-stub'.
  Function + tests stay in tree (channel-meta-reconstruction.ts) but
  are unwired from production. 9.5.10 picks up the fix.

Tests: 8797 passing (was 8732; +65 net new across two impl passes).
4 pending (1 newly skipped: the reconstruction test). Lint clean.
TS strict + build clean.

Reviewer trail:
- Codex: 6 rounds (plan r1/r2/r3 + impl r1/r2/r3, signed off).
- Kimi: second-eyes (turnId 7h-RAyyU6GEy0mRdjI9ay) caught the
  reconstruction TOCTOU + members lie that codex missed.

Plan: plan/bridge-smoothness/PHASE_9_5_9.md

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…k test

Companion test file for assertBuildVersionMatch — covers mismatch warning,
silent match, and graceful degradation when CLI build-info.json is missing
or the daemon returns no buildId. Was inadvertently omitted from a9c4762.

4 tests, all passing.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@ngduyanhece ngduyanhece requested a review from DatPham-6996 as a code owner May 24, 2026 23:47
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 24, 2026

Claude finished @ngduyanhece's task in 6m 29s —— View job


Re-review — Channel + Bridge protocol (synchronize, 2026-05-24 23:47 UTC push)

Overall: 4 / 5 (unchanged). Phase 9.5.9 is a tightly-scoped fix-pack motivated by the "6-hour lost-to-stale-require()-cache" diagnostic incident — the build-info stamping + runtime mismatch warning is the structural fix; the migration to addressability='inbound-only' cleanly addresses the partial-auto-create UX gap; listChannels skip-not-fail mirrors the runProjectWarm pattern correctly. The deferred reconstruction function (with kimi-flagged TOCTOU + members:[] corruption vectors) is correctly held out of production. Findings below are robustness / observability polish, not blockers.

  • Identify commits added since the prior review (2026-05-24 07:24 UTC)
  • Diff the new commits
  • Spot-check new code (Phase 9.5.9: build-info, listChannels, inbound-only, bridge config persistence, dir watcher)
  • Verify prior-round findings (any addressed?)
  • Post inline findings (5 buffered)
  • Post final summary + score

Diff since 2026-05-24 07:24 UTC

SHA Change
a9c4762b Phase 9.5.9 — anti-stale-state + channel robustness. Seven coordinated fixes for the 2026-05-24 live-test incident where the laptop daemon ran pre-9.5.7 code for hours due to Node's require() cache holding stale modules even after npm run build updated dist. §2.1 build-info stamping + runtime mismatch warning (centralized assertBuildVersionMatch in shared/build-info-check.ts, daemon exposes system:build-info, CLI compares once-per-process via daemon-client.checkBuildVersionOnce). §2.2 postbuild daemon-staleness check (scripts/check-daemon-staleness.ts, opt-in fatal via BRV_BUILD_STRICT=1). §2.4 listChannels Promise.allSettled tolerance (closes the orchestrator.ts:2126-2132 TODO). §2.5 new addressability='inbound-only' literal — bridge-transcript-service writes the marker when multiaddr OR L2 pubkey is missing (was: refuse), channel-doctor surfaces DOCTOR_INBOUND_ONLY, orchestrator's outbound-mention guard throws BridgeInboundOnlyMemberError extends ChannelError, opportunistic startup migration in mark-inbound-only.ts (routes through channelStore.updateChannelMeta() to use the per-channel write-serializer lock). §2.6 BrvDirWatcher observability (dual recursive + parent-dir, observational-only wording per codex+kimi review). §2.7 persist 5 more BRV_BRIDGE_* env vars to bridge-config.json (claudeUnsafe, parleyDialTimeoutMs, parleyTurnIdleTimeoutMs, parleyHardTimeoutMs, autoCreateQuota) — RemoteMemberDriver, createDefaultRegistry, createAutoCreateQuota all gain persisted* arg with env > persisted > default precedence. §2.3 npm run rebuild convenience. Reconstruction held for 9.5.10 — kimi-flagged TOCTOU + members:[] corruption vectors documented in channel-project-startup.ts:21-35; function + tests stay in tree but unwired.
0b1a02b6 Companion test file (test/unit/shared/build-info-runtime-check.test.ts, 4 cases) for §2.1 runtime build-info-check — inadvertently omitted from a9c4762b.

Total delta: 33 files, +3159 / -23. 8797 tests passing (was 8732 in prior round); 1 newly skipped (reconstruction test, with rationale).


What's good

  • Build-version mismatch detection is the right structural fix for the require()-cache foot-gun. Stamping buildId into dist/build-info.json at build time and reading it from both ends (daemon at boot, CLI at first connection) sidesteps any "did dist actually get regenerated" / "did the daemon hot-reload" / "what version is the in-memory module cache holding" uncertainty. The shared/build-info-check.ts module is layer-clean (no transport/oclif/server imports, per codex round-2). The system:build-info handler returns the boot-time value, not a re-read — exactly correct for "is the daemon's in-memory cache stale" semantics. (brv-server.ts:230-258, daemon-client.ts:37-63)
  • assertBuildVersionMatch degrades gracefully in two paths that matter: daemon doesn't return a buildId (e.g. pre-9.5.9 build, deployment skew) → silent return; CLI's dist/build-info.json is absent (e.g. dev-mode bin/dev.js without a prior npm run build) → silent return. Neither path warns spuriously. (build-info-check.ts:111-129)
  • CLI buildInfoPath fix in brv-server.ts:251 — codex round-2 caught a path-segment count bug (4 segments resolved to repo-root/build-info.json instead of dist/build-info.json). The corrected ..\..\..\ + sanity comment is a small but reliability-critical detail.
  • addressability='inbound-only' is the right discriminator. Previously, partial auto-create either refused the write (losing the turn record) or wrote a partial member that silently produced dial failures. Adding the explicit 'inbound-only' literal threads through: schema enum gains it (channel.ts:227-228), bridge-transcript-service.ts:478-485 writes the marker, the doctor surfaces DOCTOR_INBOUND_ONLY with a copy-pasteable recovery hint (doctor-service.ts:143-160), and the orchestrator's outbound-mention guard throws BridgeInboundOnlyMemberError (extends ChannelError per codex impl r2 fix at errors.ts:563). The error's top-level .code === 'BRIDGE_INBOUND_ONLY_MEMBER' propagates instead of being buried in .details.code — operators grep'ing find it directly.
  • runMarkInboundOnlyMigration routes through channelStore.updateChannelMeta when a store is supplied (Issue 4 fix from codex §4), using the per-channel write-serializer lock for atomicity. The direct-FS fallback is preserved for callers without a full ChannelStore. Idempotent on already-marked entries.
  • listChannels Promise.allSettled skip-not-fail is a faithful port of the runProjectWarm pattern (channel-store.ts:175-200). Logs the malformed-meta event so operators can investigate. Mirrors the existing convention rather than inventing a new error-handling shape.
  • Deferring reconstructMissingMetas is the right call. The TOCTOU between access(metaPath) and rename(tmp, metaPath) is a real corruption vector, and members: [] would surface a channel as legitimately empty post-reconstruction. Keeping function + tests in-tree (with it.skip rationale) lets 9.5.10 pick up cleanly. The 4th await Promise.allSettled over turn-file reads is the correct fix for the round-1 await-in-loop finding.
  • Persisted env-var fields (autoCreateQuota, claudeUnsafe, parleyDialTimeoutMs, parleyTurnIdleTimeoutMs, parleyHardTimeoutMs) are each correctly type-narrowed (z.number().int().positive() and z.boolean()). The env > persisted > default precedence is implemented consistently at each consumer (driver factory, registry, auto-create quota).
  • Test coverage is proportionate — 14 new unit-test files (~1380 LOC) across build-info, daemon-staleness, channel-store, bridge-config schema, inbound-only error path, migration v1+v2, dir-watcher, channel-project-startup wiring, and the deferred reconstruction (with kimi-rationale skip).

Inline findings (5 buffered)

  1. brv-dir-watcher.ts:62-96 — fresh-project observability gap. Recursive watcher attach fails silently when .brv/ doesn't exist at start() time; parent-dir watcher fires on later .brv/ create but has no else-branch to re-attach. Net: a daemon that starts before .brv/ exists permanently has zero recursive observability — the entire §2.6 intent silently fails for the most common fresh-project case.
  2. channel-project-startup.ts:67 — "BrvDirWatcher started" log emits unconditionally after start() swallows attach errors. Companion to the prior finding; misleads operators chasing observability gaps.
  3. bridge-config-store.ts:48listenAddrs still validates as bare z.array(z.string().min(1)). The prior-round multiaddr-refinement gap remains while §2.7 adds 5 new correctly-narrowed fields, making the inconsistency more glaring.
  4. brv-server.ts:1239-1243 — per-invite resolveBridgeRuntimeConfig causes redundant disk I/O on every remote-peer invite (one of FOUR independent BridgeConfigStore instances during startup). The values are process-stable; thread once at startup.
  5. channel-meta-reconstruction.ts:93 — function deferred to 9.5.10. The third issue (non-unique ${metaPath}.reconstruct.tmp — same shape as prior alias-store.ts:198 finding) belongs in the same fix-pack. The natural TOCTOU fix (writeFile {flag: 'wx'}) closes all three concerns in one pass.

Prior-round findings — status

a9c4762b and 0b1a02b6 did not touch any of the files flagged in the prior 7 review rounds. All prior findings still open as written (master-key colocation, NFC normalization writer-only, unbounded Map family across parley-nonce-lru.ts / parley-rate-limit.ts / auto-create-quota.ts, process-lock.ts doc/code drift, non-unique tmp paths in alias-store.ts + parley-adapter-session-store.ts, discovery.py bool-int, channel-skill/install-lib.js directory accept, claude-code-headless-adapter.ts env-passthrough + unbounded stderrChunks + flushStdoutBuffer race + doc/code drift + missing --verbose test, parley-adapter-registry.ts UNSAFE-without-deps logs-and-continues, bridge-config-store.ts listenAddrs not multiaddr-refined — now compounded with this round's finding #3, ai-sdk-content-generator.ts synthetic usage masking, INTERNAL_TEST.md §3.1/§3.5/§5/§5.1 doc trio, Phase 9.5.7 quartet, Phase 9.5.8 quintet). Consistent with the PR's "internal-test cut, follow-ups tracked" framing.


Verdict for this push

LGTM-incremental. Phase 9.5.9 turns a hard-won live-incident discovery (the stale-require()-cache foot-gun) into structural protection that future operators won't have to learn the hard way. The build-info stamping is small, well-tested, and the diagnostic warning's tone is exactly right ("daemon is running an older build" — observational, not attributional). The inbound-only literal closes a genuine UX gap from the partial-auto-create flow. The reconstruction deferral is correctly self-aware. The five inline findings are robustness / observability polish (fresh-project watcher gap, misleading log line, persistent listenAddrs validation parity, per-invite I/O, pre-emptive tmp-collision fix), not blockers.
· branch

Comment on lines +62 to +96
// ── Recursive watcher on .brv/ ──────────────────────────────────────
try {
this.brvWatcher = watch(brvDir, {recursive: true}, (eventType, filename) => {
if (eventType !== 'rename') return
if (filename === null) return

const norm = filename.replaceAll('\\', '/') // normalise Windows path seps

if (!isLifecyclePath(norm) && !verbose) return

const fullPath = join(brvDir, norm)
// Use access() to check existence without throwing.
access(fullPath, (err) => {
if (err === null) {
this.info(`[brv-dir] created ${norm}`)
} else {
const isChannelState =
norm.startsWith('context-tree/channel/') || norm === 'context-tree/channel'

if (isChannelState) {
// Codex round-2 wording — observational only, no attribution.
this.warn(
`[brv-dir] OBSERVED deletion of channel state at ${norm} ` +
`(daemon PID=${process.pid}); cause unknown — check daemon logs + ` +
`external tools (IDE sync, git operations, manual rm).`,
)
} else {
this.info(`[brv-dir] observed deletion of ${norm}`)
}
}
})
})
} catch {
// .brv/ may not exist yet — watcher will just not fire.
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (semantic, medium — observability silently degrades on fresh-project daemons): the dual-watcher design has a coverage gap when .brv/ doesn't exist at start() time:

  1. The recursive watcher (line 64) wraps watch(brvDir, {recursive: true}, ...) in try/catch and swallows the ENOENT that fs.watch throws when brvDir is absent. this.brvWatcher stays undefined. The catch comment frames this as benign (// .brv/ may not exist yet — watcher will just not fire), but…
  2. …the parent-dir watcher (line 100) at ${projectRoot}/ does fire when .brv/ is later created (the rename event), but its handler at line 104-112 only checks for .brv/ deletion (access(brvDir, (err) => { if (err !== null) warn(...) })). There is no else-branch to re-attach the recursive watcher when .brv/ is observed being created.

Net effect: on any daemon that runs runChannelProjectStartup before .brv/ exists (= every fresh project, every brv init flow, every brv channel new that creates the first channel against a daemon that had already started), the recursive watcher is never attached, and the operator-visible "BrvDirWatcher started" log from channel-project-startup.ts:67 is misleading. The §2.6 §"observability" intent silently fails for the most common first-time setup case.

Cheap fix — re-attach inside the parent-dir watcher's else-branch:

access(brvDir, (err) => {
  if (err !== null) {
    this.warn(`[brv-dir] OBSERVED deletion of ENTIRE .brv/ directory ...`)
    return
  }
  // .brv/ came into existence (or came back after a deletion) — attach the
  // recursive watcher now if it wasn't already attached.
  if (this.brvWatcher === undefined) {
    try {
      this.brvWatcher = watch(brvDir, {recursive: true}, /* ...same handler... */)
      this.info(`[brv-dir] recursive watcher attached on observed .brv/ create`)
    } catch (e) { /* race with re-deletion */ }
  }
})

Non-blocking for internal-test, but the failure mode is "operator sees BrvDirWatcher started, assumes observability is live, hours later wonders why no lifecycle events surfaced for a channel-state deletion." Worth tracking for v1.0 hardening.

// Step 2: start the .brv/ lifecycle watcher (observability only).
const watcher = new BrvDirWatcher({info: log, projectRoot, warn})
watcher.start()
log(`[channel-project-startup] BrvDirWatcher started for ${projectRoot}`)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit (observability accuracy, low — companion to brv-dir-watcher.ts:62-96 finding): the log line emits BrvDirWatcher started for ${projectRoot} unconditionally after watcher.start(), but start() swallows fs.watch failures silently in two places (both try/catch in brv-dir-watcher.ts:62-94 and :99-114). When .brv/ doesn't exist at startup time, the recursive watcher never attaches; when projectRoot isn't watchable (rare but possible inside e.g. read-only fuse mounts or some sandboxes), the parent watcher fails too. In both cases, this log line still emits "started" — operators chasing observability gaps will conclude the watcher is alive when in fact it is silently dormant.

Two reasonable fixes:

  1. Surface attach status — have BrvDirWatcher.start() return {recursive: 'attached' | 'failed', parent: 'attached' | 'failed'} and emit a structured log line here:
    const status = watcher.start()
    log(`[channel-project-startup] BrvDirWatcher: recursive=${status.recursive}, parent=${status.parent} (projectRoot=${projectRoot})`)
  2. Log specific failures — keep start() void but have it call the injected info / warn callback when each attach succeeds/fails, so the operator sees:
    [brv-dir] recursive watcher attached on /path/to/proj/.brv
    [brv-dir] parent watcher attached on /path/to/proj
    
    …or the warn equivalent if either failed.

(Fix 1 is more testable; fix 2 is closer to the existing logging pattern in the file.) Non-blocking, but the observability claim should match reality.

Comment on lines +27 to +67
export const BridgePersistedConfigSchema = z
.object({
/**
* Phase 9.5.9 §2.7 — persist BRV_BRIDGE_AUTO_CREATE_QUOTA so daemon
* respawns without env inherit the operator-configured quota.
*/
autoCreateQuota: z.number().int().positive().optional(),
autoProvision: z.enum(['auto', 'pinned-only', 'deny']).optional(),
/**
* Phase 9.5.9 §2.7 — persist BRV_BRIDGE_CLAUDE_UNSAFE so a daemon
* respawn without BRV_BRIDGE_CLAUDE_UNSAFE in env does not silently
* fall back and fail Claude Code adapter registration.
*/
claudeUnsafe: z.boolean().optional(),
delegatePolicy: z.enum(['auto', 'prompt', 'deny']).optional(),
// libp2p listen multiaddrs the daemon-integrated bridge binds.
// DEFAULT_BRIDGE_CONFIG.listen_addrs is `['/ip4/127.0.0.1/tcp/0']`
// (loopback-only, ephemeral port). Cross-machine bridge needs an
// externally-routable address — operators set `BRV_BRIDGE_LISTEN_ADDRS`
// (comma-separated) to something like
// `/ip4/0.0.0.0/tcp/60001,/ip4/100.x.x.x/tcp/60001`.
listenAddrs: z.array(z.string().min(1)).min(1).optional(),
maxConcurrentPerProfile: z.number().int().positive().optional(),
/**
* Phase 9.5.9 §2.7 — persist BRV_BRIDGE_PARLEY_DIAL_TIMEOUT_MS
* (Phase 9.5.7 split-timeout config) so respawns inherit it.
*/
parleyDialTimeoutMs: z.number().int().positive().optional(),
/**
* Phase 9.5.9 §2.7 — for future use (Phase 9.5.7 hard-cap timeout).
*/
parleyHardTimeoutMs: z.number().int().positive().optional(),
parleyProfile: z.string().min(1).optional(),
/**
* Phase 9.5.9 §2.7 — persist BRV_BRIDGE_PARLEY_TURN_IDLE_TIMEOUT_MS
* (Phase 9.5.7 split-timeout config) so respawns inherit it.
*/
parleyTurnIdleTimeoutMs: z.number().int().positive().optional(),
projectRoot: z.string().min(1).optional(),
})
.strict()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (validation gap, still open from prior round — and the schema's growth makes it more glaring): Phase 9.5.9 adds 5 new persisted fields here (autoCreateQuota, claudeUnsafe, parleyDialTimeoutMs, parleyHardTimeoutMs, parleyTurnIdleTimeoutMs) and each one is correctly type-narrowed (z.number().int().positive(), z.boolean()). But listenAddrs at line 48 still validates as z.array(z.string().min(1)).min(1).optional() — bypassing the multiaddr() refinement in the canonical bridge-config.ts schema, as flagged in the 2026-05-22 15:17 UTC review.

With the schema now governing 9 distinct knobs, the listenAddrs string-only validation is the conspicuous outlier. A malformed BRV_BRIDGE_LISTEN_ADDRS=:60001 still gets persisted to disk and then trips every subsequent daemon respawn with an opaque libp2p error.

Same fix shape as before — either:

import {multiaddr} from '@multiformats/multiaddr'
const multiaddrString = z.string().refine((s) => {
  try { multiaddr(s); return true } catch { return false }
})
// …
listenAddrs: z.array(multiaddrString).min(1).optional(),

OR route the constructed config through parseBridgeConfig at the brv-server.ts:916-920 spread point.

Non-blocking for internal-test — the operator persona of this PR is technical enough to recover from a multiaddr typo. But every PR cycle adds more schema surface where listenAddrs is the lone "string-min(1)" exception; the inconsistency invites the next contributor to copy the wrong pattern.

Comment on lines +1239 to +1243
const _bridgeRuntimeForDriver = resolveBridgeRuntimeConfig({
env: process.env,
log,
store: new BridgeConfigStore({stateDir: join(getGlobalDataDir(), 'state')}),
})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit (per-invite disk I/O, low — but compounds with the multi-instance pattern across the file): every call to remotePeerDriverFactory (= every brv channel invite <remote>) constructs a fresh BridgeConfigStore and runs resolveBridgeRuntimeConfig from scratch. That call does:

  1. store.load()readFileSync(bridge-config.json) + Zod parse
  2. Re-parse all BRV_BRIDGE_* env vars
  3. Build the env-vs-file overlay
  4. If anyDefined(envSnapshot) is true, call persistConfigIfChanged which compares + potentially writeFileSync + renameSync

Across the file there are four independent new BridgeConfigStore({stateDir: ...}) constructions (lines 956, 976, 1242, 1487), each doing its own load. The values being resolved here (parleyDialTimeoutMs / parleyTurnIdleTimeoutMs) are determined at process start by env + persisted-file state; neither input can change for the lifetime of the daemon (env is process-level, the file is only written from inside this same daemon). Resolving once at startup and threading bridgeRuntime through the factory closure removes the per-invite I/O.

Concretely — hoist bridgeRuntime out, capture it in the closure:

// At daemon startup, once:
const bridgeRuntime = resolveBridgeRuntimeConfig({env: process.env, log, store: bridgeConfigStore})

// Driver factory uses captured value, no disk hit per invite:
const remotePeerDriverFactory = async (args) => {
  const host = await ensureBridgeHost()
  return new RemoteMemberDriver({
    /* … */
    persistedTimeouts: {
      dialTimeoutMs: bridgeRuntime.parleyDialTimeoutMs,
      idleTimeoutMs: bridgeRuntime.parleyTurnIdleTimeoutMs,
    },
    /* … */
  })
}

Same applies to the _bridgeQuotaStore block at lines 956-961 and to the four-instance pattern in general. Non-blocking — the per-invite cost is small and configsEqual short-circuits the redundant writes — but the redundant work is invisible to operators and grows linearly with invite count.

Comment on lines +91 to +95
// Atomic rename write.
await fs.mkdir(dirname(metaPath), {recursive: true})
const tmp = `${metaPath}.reconstruct.tmp`
await fs.writeFile(tmp, JSON.stringify(minimal, null, 2), 'utf8')
await fs.rename(tmp, metaPath)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit (pre-emptive — fix this before 9.5.10 wires this function back into production): the commit message + channel-project-startup.ts doc-comment defer this function to 9.5.10 to fix the two data-corruption vectors kimi flagged (TOCTOU between access(metaPath) at line 50 and rename(tmp, metaPath) here; and the members: [] reconstruction lie). Adding a third item to that 9.5.10 punch list while it's fresh:

${metaPath}.reconstruct.tmp at line 93 is a non-unique tmp path — same atomic-write tmp-collision shape as the prior alias-store.ts:198 finding (still open) and parley-adapter-session-store.ts:126 finding (still open). When this function is wired and runs concurrently across multiple project roots (or after a crash leaves a stale tmp), two flows can race on the same path.

When fixing the TOCTOU at line 50-54 (per the kimi note), the natural fix is writeFile(metaPath, ..., {flag: 'wx'}) which makes both the existence check and the write happen atomically — and at that point the tmp path goes away entirely (no rename needed for wx). That'd close all three concerns in one pass:

try {
  await fs.writeFile(metaPath, JSON.stringify(minimal, null, 2), {encoding: 'utf8', flag: 'wx'})
  args.log(`[channel-meta-reconstruction] ...`)
} catch (err) {
  // meta.json materialized between our access check and write — leave the
  // real meta alone (the channel is back to a healthy state).
  if ((err as NodeJS.ErrnoException).code === 'EEXIST') return
  throw err
}

If rename-based atomic-write is preferred (matches the migration's existing pattern), copy the ${target}.tmp.${process.pid}.${randomBytes(4).toString('hex')} shape from install-identity-service.ts:427. Either is fine; both close the tmp-collision concern.

Non-blocking; the function is deferred. Flag belongs to the 9.5.10 work.

…ed fixes)

Picks up the reconstruction slice held from 9.5.9. Closes the data-corruption
vectors kimi flagged (turnId 7h-RAyyU6GEy0mRdjI9ay) and the recordType bug
codex found while reviewing the plan.

Fixes:
- TOCTOU race: new ChannelStore.reconstructIfMissing() uses the SAME
  writeSerializer.withLock(metaLockKey) as createChannel. Overwrite of a
  real meta is impossible. Stub-wins-by-design for the remaining
  create-vs-reconstruct race: doctor → invite recovery path.
- members:[] honest: schema-violating fake ChannelMember records avoided.
  Stub carries optional reconstructionStatus='reconstructed-from-history'
  and inferredHandles[] derived from author + mentions +
  delivery_snapshot.memberHandle. Filtered to /^@/, deduped, sorted.
- recordType bug: scanner now reads 'turn_snapshot' (matches the writer
  at snapshot-writer.ts:98) instead of 'snapshot'. Scans ALL NDJSON lines
  (not just first non-empty) so real files with event lines before the
  terminal snapshot are handled. Picks earliest startedAt across turns
  via Date.parse (not lex-sort) to handle mixed subsecond precision.
- Datetime validation: delegated to z.string().datetime() so the guard
  cannot drift from ChannelMetaSchema.createdAt (Z-only).
- Per-channel error logging: Promise.allSettled rejects are no longer
  silently swallowed; each failed channel is named in the log.

Doctor surface:
- DOCTOR_RECONSTRUCTED_FROM_HISTORY (warning) emitted when meta carries
  the reconstructionStatus flag. Message lists inferredHandles and
  recovery command with correct arg order:
  `brv channel invite <channelId> <handle> --profile <name>`.

Wiring:
- Re-wired reconstruction into runChannelProjectStartup as Step 0
  (before the inbound-only migration). Best-effort: any throw logged
  and swallowed.
- Un-skipped the daemon-startup reconstruction test from 9.5.9.

Schema:
- ChannelMetaSchema gains three optional fields: reconstructionStatus,
  reconstructedAt, inferredHandles. All optional so existing healthy
  metas continue to parse.

Review chain:
- Codex plan-review: 3 rounds, all blockers folded in
  (turnId SeZXehzrBoBKhk2U1P6GS)
- Codex impl-review: 3 rounds — caught doctor cmd arg-order +
  datetime offset accepting (turnId nlBVaypx29yWAPKddJ15D)
- Kimi second-eyes: caught lex-sort precision edge case + uninspected
  allSettled rejects (turnId BiAx_ssnAoWFa2qSVdtXy). Both fixed.

Test counts: ~25 new/changed tests (14 reconstruction + 5 ChannelStore +
2 doctor + 1 un-skipped startup + 3 fake-store interface plumbing
updates). Full suite: 9924 passing, 24 pending, 0 failing.

Out of scope (defer):
- Reconstruct full member records from delivery_snapshot (no
  addressability info inferable from history).
- Root cause for "context-tree/channel/ vanishes" — still unknown;
  BrvDirWatcher gives observability, this slice gives recovery.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 25, 2026

Claude finished @ngduyanhece's task in 4m 44s —— View job


Re-review — Channel + Bridge protocol (synchronize, 2026-05-25 02:00 UTC push)

Overall: 4 / 5 (unchanged). Phase 9.5.10 picks up the reconstruction slice deferred from 9.5.9 and folds in every kimi + codex finding from the review chain (3 + 3 + 2 rounds documented in the commit message). The structural shape is sound: reconstructIfMissing shares the same metaLockKey write-serializer as createChannel, so the TOCTOU overwrite race is closed; members:[] is now honest via reconstructionStatus + inferredHandles; the _recordType === 'turn_snapshot' scan iterates ALL NDJSON lines and uses Date.parse to pick the chronologically earliest startedAt across mixed subsecond precision. Findings below are operator-UX / schema-rigidity / resource polish, not blockers.

  • Identify commits added since the prior review (2026-05-24 23:47 UTC)
  • Diff the new commits — single substantive commit 9fbc7ac6 (+1549/-128, 15 files)
  • Spot-check new code (reconstructIfMissing, scanTurnsDir, doctor surface, schema additions, startup wiring)
  • Verify prior-round findings (3 directly addressed: TOCTOU + members + recordType — all from the 23:47 round's channel-meta-reconstruction.ts:93 finding)
  • Post inline findings (4 buffered)
  • Post final summary + score

Diff since 2026-05-24 23:47 UTC

SHA Change
9fbc7ac6 Phase 9.5.10 — channel meta reconstruction (kimi-flagged fixes). New ChannelStore.reconstructIfMissing (returns 'wrote' | 'already-exists') takes the same metaLockKey per-channel lock as createChannel. scanTurnsDir reads turn_snapshot (writer literal, was wrong 'snapshot') + delivery_snapshot across ALL non-empty NDJSON lines (was: first line only); collects localStartedAt + localHandles from turn.author.handle / turn.mentions[] / delivery.memberHandle; filters via /^@/ (drops the local-user 'you' placeholder); deduplicates + sorts. pickEarliest uses Date.parse (was: lex-sort — kimi found 2026-05-24T10:00:00.001Z < 2026-05-24T10:00:00Z because . < Z). Schema adds reconstructionStatus: z.literal('reconstructed-from-history').optional(), reconstructedAt, inferredHandles[]. Doctor emits DOCTOR_RECONSTRUCTED_FROM_HISTORY (warning) with recovery hint brv channel invite <channelId> <handle> --profile <name> (codex r2 caught the arg order). Wired as Step 0 in runChannelProjectStartup (before the inbound-only migration so the migration sees the reconstructed stubs). Per-channel error logging in the outer Promise.allSettled loop (kimi r2 — was silently swallowing rejects).

Total delta: 15 files, +1549/-128. ~25 new/changed tests; suite at 9924 passing.


What's good

  • Lock-protected reconstruction. ChannelStore.reconstructIfMissing (channel-store.ts:355-364) shares writeSerializer.withLock(metaLockKey(...)) with createChannel — the kimi-flagged TOCTOU overwrite race (accessrename gap) is structurally impossible now. The concurrent-reconstruct test (channel-store-reconstruct-if-missing.test.ts:94-105) verifies exactly-one-writes semantics.
  • Stub-wins-by-design is documented. The // Phase 9.5.10 Fix A block at lines 348-354 explicitly calls out the remaining race resolution: reconstructIfMissing publishes first → subsequent createChannel fails fast with "already exists" → operator recovers via doctor → invite. That's the right shape because reconstruction encodes a strictly weaker guarantee than user intent; the loud-fail forces the operator's hand. The test at channel-store-reconstruct-if-missing.test.ts:107-135 covers it.
  • NDJSON scanner is rigorous. Scans turn_snapshot (matches writer literal at snapshot-writer.ts:98) AND delivery_snapshot; iterates ALL non-empty lines; corrupt NDJSON lines are skipped (not abort); isIsoDatetime delegates to z.string().datetime() so the schema and the guard can never drift (codex r2). Date.parse-based pickEarliest handles the mixed-subsecond-precision case kimi caught.
  • Inferred-handles list is operator-actionable. Filtering /^@/ drops the local-user 'you' placeholder; deduping + sorting makes the doctor recovery hint stable. delivery_snapshot.memberHandle is included so dialer-side handles surface even when only the dialer mentioned the channel.
  • Schema additions are non-breaking. All three new fields are .optional() — healthy pre-9.5.10 metas parse unchanged. The doctor-side discriminator at doctor-service.ts:127 matches the literal exactly.
  • Step ordering is load-bearing and correctly documented. channel-project-startup.ts:15-26 makes the dependency explicit: reconstruction runs at Step 0 so runMarkInboundOnlyMigration sees the rebuilt stubs (no-op for reconstructed-stub members:[] but the comment is honest about why ordering matters). Best-effort wrappers swallow + log; daemon startup is not gated.
  • Test coverage is proportionate. 14 reconstruction-test cases (real turn_snapshot recordType, subsecond ordering, multiple turns + earliest pick, delivery_snapshot inferred handles, /^@/ filter, schema round-trip, idempotence, error tolerance) + 5 ChannelStore tests + 2 doctor tests + the un-skipped 9.5.9 startup-wiring test.

Inline findings (4 buffered)

  1. doctor-service.ts:127reconstructionStatus is sticky with no graduation path. The recovery hint reads "Recovery: run brv channel invite ... for each to restore membership," but neither inviteMember (orchestrator.ts:913-921) nor uninviteMember (orchestrator.ts:1150) clears the flag — the mutate function spreads ...meta verbatim. An operator who follows the recovery hint to the letter will keep seeing this warning forever. Suggest clearing the flag inside inviteMember's mutate when members.length becomes non-empty, OR surfacing a brv channel graduate <id> command. The current shape wires a warning the operator can never silence.
  2. channel.ts:493z.literal('reconstructed-from-history') makes any future provenance variant ('reconstructed-from-snapshot', 'reconstructed-from-remote') a schema-breaking change. z.enum(['reconstructed-from-history']) is the v1-equivalent and lets future variants land non-breakingly. Same nit on inferredHandles: z.array(z.string()).optional() — writer filters via /^@/ but schema doesn't carry that invariant (z.array(z.string().regex(/^@/))).
  3. channel-meta-reconstruction.ts:76 — cross-product parallelism (N channels × M turn files), no concurrency bound. For the function's load-bearing scenario ("the dir vanished" on a real-sized project), 50 channels × 100 turns = 5000 simultaneously-open FDs → EMFILE territory on default ulimits. Cheap fix: chunk the outer loop to 8 channels at a time, or use p-limit (already a dep).
  4. channel-meta-reconstruction.ts:210entries[i] zip-by-index coupling. Works (Promise.allSettled spec-guarantees order) but invites the wrong refactor. Capture the channelId inside the rejected reason or zip explicitly.

Prior-round findings — status

9fbc7ac6 directly closed the prior round's channel-meta-reconstruction.ts:93 finding — all three concerns (TOCTOU, members:[] lie, non-unique tmp path) are addressed structurally. The non-unique tmp shape inside writeAtomically (channel-store.ts:80) uses process.pid + Date.now(), which is sufficient because:

  • The in-process writeSerializer.withLock(metaLockKey(...)) serializes per-channel writes within a daemon
  • Cross-daemon is prevented by the daemon.json advisory lock
  • process.pid + Date.now() collisions would require same-pid + same-ms-tick from different daemons — vanishingly unlikely under the advisory-lock invariant

Compare to install-identity-service.ts:427 which uses randomBytes(4): strictly more entropy, but the channel-store's lock-protected case doesn't need it. The earlier round's alias-store.ts:198 finding remains the canonical "extract a shared atomicWrite() helper" worth doing eventually.

All other prior-round findings still open as written (master-key colocation, NFC normalization writer-only, unbounded Map family across parley-nonce-lru.ts / parley-rate-limit.ts / auto-create-quota.ts, process-lock.ts doc/code drift, alias-store.ts:198 + parley-adapter-session-store.ts:126 non-unique tmp paths, discovery.py bool-int, channel-skill/install-lib.js directory accept, claude-code-headless-adapter.ts:207 env-passthrough + :229 unbounded stderrChunks + :357-361 flushStdoutBuffer race + :20 doc/code drift + :439 missing --verbose test, parley-adapter-registry.ts UNSAFE-without-deps logs-and-continues, bridge-config-store.ts:48 listenAddrs not multiaddr-refined, ai-sdk-content-generator.ts:390 synthetic usage masking, INTERNAL_TEST.md §3.1/§3.5/§5/§5.1 doc trio, Phase 9.5.7 quartet, Phase 9.5.8 quintet, Phase 9.5.9 quintet). Consistent with the PR's "internal-test cut, follow-ups tracked" framing.


Verdict for this push

LGTM-incremental. Phase 9.5.10 closes the data-corruption vectors deferred from 9.5.9 with the cryptographic + concurrency discipline the prior rounds set the bar at: the lock invariant is shared with createChannel, the scanner is type-narrowing-safe and corruption-tolerant, the schema additions are non-breaking, and the doctor surface gives the operator a concrete recovery action. The four inline findings (sticky reconstructionStatus + no graduation path, schema rigidity on the literal + handle shape, cross-product parallelism on recovery, zip-coupling in error logging) are robustness/polish items, not ship blockers — only the first is structural (a warning the operator can't silence) and worth picking up before the v1.0 surface freezes.
· branch

// memberKind/peerId/multiaddr from history), so dispatch will fail
// until the operator re-invites participants. The diagnostic message
// names each inferred handle so the operator knows whom to re-invite.
if (meta.reconstructionStatus === 'reconstructed-from-history') {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (operator UX, low-medium — reconstructionStatus is sticky with no graduation path): the diagnostic message implies a clear recovery flow ("Recovery: run brv channel invite ... for each to restore membership"), but neither inviteMember (orchestrator.ts:913-921) nor uninviteMember (orchestrator.ts:1150) clears reconstructionStatus / inferredHandles. The mutate function spreads ...meta verbatim, so the flag persists across every invite. End state: an operator who runs brv channel invite <channelId> @alice --profile codex followed by the same for @bob, etc., and then re-runs brv channel doctor <channelId>, will keep seeing this warning forever — the recovery action they were told to take doesn't actually graduate the channel out of the reconstructed state.

Two reasonable fixes — either is fine:

  1. Have inviteMember clear the flag once members.length becomes non-empty. Inside the mutate function at orchestrator.ts:913-921, when adding a member to a channel whose meta carries reconstructionStatus, also unset the flag + inferredHandles + reconstructedAt:

    const next = {...meta, members: [...existing, member], updatedAt: this.clock().toISOString()}
    if (meta.reconstructionStatus !== undefined && next.members.length > 0) {
      delete next.reconstructionStatus
      delete next.reconstructedAt
      delete next.inferredHandles
    }
    return next

    This matches operator mental model: once you've re-invited people, the channel is recovered.

  2. Surface a brv channel graduate <channelId> command that explicitly clears the flag. More ceremonious, but gives the operator agency to assert "I've verified this channel is fully recovered" rather than relying on inviteMember to do it implicitly.

Without one of these, the recovery hint at lines 134-136 is misleading. The current code wires a "warning that the operator can never silence." Non-blocking for internal-test (the warning is informational, not error-level) but flag for v1.0 hardening.

Fix this →

// Phase 9.5.10 — set by `reconstructMissingMetas` when meta.json was
// rebuilt from channel-history because the original vanished. Downstream
// (doctor) surfaces this so the operator knows the channel is a stub.
reconstructionStatus: z.literal('reconstructed-from-history').optional(),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit (schema rigidity, low): z.literal('reconstructed-from-history') makes adding a second reconstruction provenance ("reconstructed-from-snapshot", "reconstructed-from-remote", etc.) a schema-breaking change — every reader (doctor-service, tests, future webui dashboards) becomes wrong simultaneously. The intent is clearly "one of a small enum of reconstruction sources" — making this z.enum(['reconstructed-from-history']) from the start would let future variants land without breaking the v1 reader's type narrowing (still 'reconstructed-from-history' for now, just typed as an enum singleton).

Same nit on inferredHandles: z.array(z.string()).optional() two lines down — the writer in channel-meta-reconstruction.ts:124 enforces /^@/ via filter, but the schema doesn't carry that invariant. Defense in depth: z.array(z.string().regex(/^@/)). Cheap, makes the schema describe the writer's contract.

Non-blocking; refactor-friendliness only.

// Read each NDJSON file once; iterate ALL non-empty lines (not just
// first) — real turn files have many event lines BEFORE the terminal
// `turn_snapshot`.
const fileResults = await Promise.allSettled(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (resource, low — cross-product parallelism on a recovery storm): reconstructMissingMetas runs Promise.allSettled over every channel-history entry (line 202-206), and each scanTurnsDir here runs Promise.allSettled over every .ndjson turn file. There is no concurrency bound on either layer. For internal-test usage (5-10 channels with a handful of turns each) this is fine. But the function's whole purpose is "the .brv/context-tree/channel/ directory vanished and we need to recover" — exactly the scenario where the operator's project has many channels each with many historical turns. A project with 50 channels × 100 turns = 5000 simultaneously-open file descriptors during scanTurnsDir.readFile. EMFILE / ENFILE territory on default ulimits.

Cheap fix — bound outer concurrency in reconstructMissingMetas:

// Process channels in chunks of 8 so we don't open ulimit_n FDs at once.
const CHUNK = 8
for (let i = 0; i < entries.length; i += CHUNK) {
  const slice = entries.slice(i, i + CHUNK)
  // eslint-disable-next-line no-await-in-loop
  await Promise.allSettled(slice.map(/* ... */))
}

OR use p-limit (already in package.json) for a simpler bounded-parallelism pattern. Non-blocking — the practical hit is at extremes — but the comment block at line 11-37 (the function's intent) names "the dir vanished" as the load-bearing scenario, which is exactly when the project is biggest.

for (const [i, r] of results.entries()) {
if (r.status === 'rejected') {
args.log(
`[channel-meta-reconstruction] reconstruct error for channel ${entries[i]} (continuing): ` +
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit (zip coupling, low): entries[i] reads from the outer array using the index from results.entries() — relies on the invariant that Promise.allSettled preserves input order. It does (spec-guaranteed), so this is correct, but the zip-by-index pattern is fragile against future refactors that introduce a .filter() before the Promise.allSettled or switch to a streaming variant. Preferable to capture the channelId inside the rejected reason or zip explicitly:

const promises = entries.map(async (channelId) => {
  try {
    await reconstructOne({channelId, channelStore: args.channelStore, log: args.log, projectRoot: args.projectRoot})
    return {channelId, status: 'fulfilled' as const}
  } catch (error) {
    return {channelId, error, status: 'rejected' as const}
  }
})
for (const r of await Promise.all(promises)) {
  if (r.status === 'rejected') args.log(`...channel ${r.channelId}...`)
}

Polish only — current shape works.

…sh prevention)

The recurring "context-tree/channel/<id>/meta.json vanishes" symptom that
motivated 9.5.9 (BrvDirWatcher observability) and 9.5.10 (reconstruction
recovery) has a root cause: `channel/` was intentionally cogit-synced per
a pre-bridge design, so VC tree-replace ops (checkout, reset, clone, merge)
on tracked channel files removed them from the working directory. The audit
trail in plan/bridge-smoothness/PHASE_9_5_11.md walks through which delete
sites were ruled out and why VC was the only path consistent with the
observed pattern (channel-history always survives because it's structurally
outside context-tree).

Since the libp2p bridge took over cross-host channel state, VC sync of
channels is no longer load-bearing. Adding `/channel/` to
CONTEXT_TREE_GITIGNORE_PATTERNS prevents future writes from being staged
by `brv vc add`, so subsequent tree-replace ops cannot touch channel/.

This is a PREVENTIVE fix. The 9.5.10 reconstruction layer remains the
safety net for pre-existing vanish events. Existing tracked channel
files in a user's local git index stay tracked until manually untracked
(`cd .brv/context-tree && git rm --cached -r channel/`); plan doc has
the recipe.

Out of scope:
- Auto-untrack migration (operator-driven; preserves user intent).
- TRANSCRIPT_TERMINAL_MISSING regression observed during the @gcp collab
  attempt for this slice (separate investigation; file as 9.5.12).

Tests: 4 new (1 constant assertion + 3 ensureContextTreeGitignore
behaviours). Full suite: 9928 passing, 24 pending, 0 failing.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants