Skip to content

Telemetry: stop retrying into 429s, honour Retry-After, fix userAgent#354

Open
samikshya-db wants to merge 5 commits into
mainfrom
telemetry-429-circuit-breaker-and-user-agent
Open

Telemetry: stop retrying into 429s, honour Retry-After, fix userAgent#354
samikshya-db wants to merge 5 commits into
mainfrom
telemetry-429-circuit-breaker-and-user-agent

Conversation

@samikshya-db
Copy link
Copy Markdown
Collaborator

@samikshya-db samikshya-db commented May 5, 2026

Summary

After v1.11.0 enabled telemetry by default via the server feature flag, high-QPS workloads produced excessive 429s on /telemetry-ext. Three issues compounded:

  1. Double-retry. The telemetry exporter ran its own retry loop on top of the retryablehttp-wrapped HTTP client (internal/client.RetryableClient), which already retries 429/5xx with Retry-After. Result: up to RetryMax × (MaxRetries+1) HTTP attempts per export, all collapsed into a single circuit-breaker outcome — so the breaker barely opened against persistent throttling.
  2. Untraceable in access logs. Telemetry POSTs and feature-flag GETs sent no User-Agent, so 429s landed in access logs tagged as Go-http-client/1.1 and could not be attributed to godatabrickssqlconnector by driver version.
  3. High request volume. FlushInterval=5s / BatchSize=100.

Changes

Retry behavior

  • telemetry/exporter.go — Removed doExport's retry loop entirely. doExport now makes a single HTTP request; transient retries (429/5xx, Retry-After) are owned by the underlying retryablehttp client. Each export() call now corresponds to exactly one HTTP transaction = one breaker outcome.
  • telemetry/config.go, telemetry/driver_integration.go — Removed MaxRetries / RetryDelay from telemetry.Config and TelemetryInitOptions. telemetry_retry_count / telemetry_retry_delay DSN params still parse without error for backwards compatibility but are no-ops.

Identifiability

  • connector.go — New buildUserAgent helper mirroring internal/client/client.go:295-302 exactly: DriverName/DriverVersion + optional UserAgentEntry + agent product.
  • telemetry/exporter.go, telemetry/featureflag.go — Set User-Agent on telemetry POST and feature-flag GET. Plumbed via TelemetryInitOptions.UserAgent.

Cadence and breaker tuning

  • telemetry/config.goFlushInterval 5s → 30s, BatchSize 100 → 200.
  • telemetry/circuitbreaker.gominimumNumberOfCalls 20 → 10 (so low-traffic clients can trip the breaker now that each export is one signal), waitDurationInOpenState 30s → 60s (respect typical Retry-After).

Tests

  • Removed obsolete retry/backoff tests (TestExport_RetryOn5xx, TestExport_ExponentialBackoff, TestIsRetryableStatus, retry-config parsing tests).
  • Added TestExport_SingleAttemptPerExport covering 4xx/429/5xx, asserting the exporter never retries.
  • Added TestExport_SetsUserAgent and TestFetchFeatureFlag_SetsUserAgent.

Mitigation

While this rolls out, users can opt out via DSN: enableTelemetry=false. Server-side: disable enableTelemetryForGoDriver for affected workspaces.

Test plan

  • go test ./... — all green locally.
  • Exporter never retries (4xx, 429, 500, 503).
  • User-Agent set on telemetry POST and feature-flag GET.
  • Verify in Lumberjack post-deploy that /telemetry-ext and /api/2.0/connector-service/feature-flags/GOLANG/... requests carry godatabrickssqlconnector/<version> in http_user_agent.
  • Confirm 429 rate against /telemetry-ext drops after rollout.

This pull request and its description were written by Isaac.

@samikshya-db samikshya-db changed the title Stop telemetry 429s from amplifying load and identify telemetry traffic Stop telemetry 429s from retrying with backoff only for telemetry endpoint + fix userAgent on telemetry May 5, 2026
@samikshya-db samikshya-db force-pushed the telemetry-429-circuit-breaker-and-user-agent branch from 1300f1a to fbb0263 Compare May 5, 2026 10:19
After v1.11.0 enabled telemetry by default via the server feature flag,
high-QPS workloads produced excessive 429s on /telemetry-ext. Three
issues compounded:

1. Double-retry. The exporter ran its own retry loop on top of the
   retryablehttp-wrapped HTTP client (internal/client.RetryableClient),
   which already retries 429/5xx with Retry-After. Result: up to
   RetryMax * (MaxRetries+1) HTTP attempts per export, all collapsed
   into one circuit-breaker outcome — so the breaker barely opened.
2. Untraceable in access logs. Telemetry POSTs and feature-flag GETs
   sent no User-Agent, so 429s were tagged Go-http-client/1.1 and
   could not be attributed to godatabrickssqlconnector by version.
3. High request volume. FlushInterval=5s, BatchSize=100.

Changes:

- telemetry/exporter.go: drop the retry loop entirely. doExport now
  makes a single HTTP request; transient retries (429/5xx, Retry-After)
  are owned by the underlying retryablehttp client. Each export call
  → exactly one breaker outcome.
- telemetry/exporter.go, telemetry/featureflag.go: set User-Agent
  header on telemetry POST and feature-flag GET. Built once at the
  connector site (buildUserAgent in connector.go), mirroring
  internal/client/client.go format
  (DriverName/DriverVersion + UserAgentEntry + agent product), and
  plumbed via TelemetryInitOptions.UserAgent.
- telemetry/config.go: FlushInterval 5s → 30s, BatchSize 100 → 200.
  Remove MaxRetries/RetryDelay from telemetry.Config and
  TelemetryInitOptions; telemetry_retry_count/_delay DSN params still
  parse for backwards compat but are no-ops.
- telemetry/circuitbreaker.go: lower minimumNumberOfCalls 20 → 10
  (so low-traffic clients can still trip the breaker on a sustained
  outage now that each export is one signal), and raise
  waitDurationInOpenState 30s → 60s (respect typical Retry-After).
- Tests: removed obsolete retry/backoff tests; added single-attempt
  assertion across 4xx/429/5xx; added User-Agent assertions on both
  endpoints.

Co-authored-by: Isaac
@samikshya-db samikshya-db force-pushed the telemetry-429-circuit-breaker-and-user-agent branch from fbb0263 to d47bf99 Compare May 5, 2026 10:40
@samikshya-db samikshya-db changed the title Stop telemetry 429s from retrying with backoff only for telemetry endpoint + fix userAgent on telemetry Telemetry: stop double-retrying, identify traffic, tune breaker May 5, 2026
@samikshya-db samikshya-db changed the title Telemetry: stop double-retrying, identify traffic, tune breaker Telemetry: stop double-retrying, fix user-agent, tune circuit breaker May 5, 2026
… first flush

The previous commit removed the exporter's inner retry loop, but the
retryablehttp layer (RetryMax=4) was still amplifying each telemetry
export into up to 5 attempts on a 429 — defeating the breaker because
only the final outcome was observed.

- internal/client: add WithSkipRateLimitRetry context helper. RetryPolicy
  treats 429 as non-retryable when the flag is set; 5xx and transport
  errors are unaffected. Set on telemetry POSTs and feature-flag GETs so
  each rate-limited request is exactly one HTTP transaction = one
  breaker signal.
- telemetry/circuitbreaker: record server-provided Retry-After deltas
  from 429 responses; the open-state wait becomes
  max(waitDurationInOpenState, hint). Hint cleared on the next
  half-open -> closed transition.
- telemetry/aggregator: offset the first flush by a random
  [0, FlushInterval) so a fleet of clients booted together does not
  phase-align bursts.

Co-authored-by: Isaac
@samikshya-db samikshya-db changed the title Telemetry: stop double-retrying, fix user-agent, tune circuit breaker Telemetry: stop retrying into 429s, honour Retry-After, identify traffic May 13, 2026
@samikshya-db samikshya-db changed the title Telemetry: stop retrying into 429s, honour Retry-After, identify traffic Telemetry: stop retrying into 429s, honour Retry-After, fix userAgent May 13, 2026
- Doc accuracy on retry semantics: WithSkipRateLimitRetry only suppresses
  429s. Generic 5xx and transport errors are governed by ClientMethod, so
  telemetry contexts (which set no ClientMethod) get one-shot semantics on
  those failure modes; 503 still retries via the method-agnostic path.
  Updated SkipRateLimitRetry godoc and doExport comment to match.
- Drop dead TelemetryRetryCount / TelemetryRetryDelay fields from
  UserConfig. The DSN keys are still consumed (so they do not leak into
  session params) but no longer stored. Cleaned up assertions in
  config_test.go and connector_test.go.
- Clear retryAfterHint when the breaker transitions Open -> Half-Open, so
  a stale hint cannot extend a later reopen that did not carry its own
  Retry-After. Collapsed the two-step lock dance in execute() into one
  critical section. Updated godoc.
- Guard rand.Int63n against a zero flushInterval in flushLoop.
- Document the first-caller-wins User-Agent behaviour on the per-host
  telemetry client singleton, since later connections on the same host
  reuse the cached client.

Co-authored-by: Isaac
@samikshya-db samikshya-db requested a review from gopalldb May 13, 2026 20:49
- gofmt -s on connector_test.go, internal/config/config_test.go,
  telemetry/exporter.go (struct alignment + missing blank line).
- internal/config/config.go: telemetry_retry_count is parsed for
  backwards compat then discarded, so call extract (2 returns) rather
  than extractAsInt (3 returns) to satisfy dogsled.
- telemetry/circuitbreaker.go: drop the now-unused setState helper.
  execute() now transitions to half-open via setStateUnlocked under
  an already-held lock.

Co-authored-by: Isaac
@gopalldb
Copy link
Copy Markdown
Collaborator

Must-fix

  1. "One breaker signal per export" is only true for 429. WithSkipRateLimitRetry(ctx) in exporter.go suppresses retryablehttp's 429 retries, but 502/503/5xx still retry internally per
    isRetryableServerResponse in internal/client/client.go. A 503 storm will go through N internal retries before surfacing one error to the breaker — same problem the PR claims to fix, just one code path
    narrower. Either pass a stricter "skip all retries" context flag from telemetry, or accept the asymmetry and document it explicitly.
  2. Retry-After is honored twice on 429. retryablehttp's backoff() (client.go ~L749) already waits Retry-After before re-attempting; the exporter then calls extendOpenStateAtLeast(...) with the same value, so
    the breaker sleeps Retry-After + waitDurationInOpenState. With both ~60s the effective pause is ~120s. Either move Retry-After handling exclusively to the breaker (don't retry at the transport layer for
    telemetry) or skip the breaker extension when retryablehttp owned the wait.
  3. Retry-After not parsed on 503. Exporter only extracts the header when resp.StatusCode == 429. A 503 with Retry-After is silently ignored — breaker stays open the default 60s regardless of what the server
    asked for. Apply the same Retry-After parse to 503 (and 429), not just 429.

Should-fix

  1. DSN backcompat is silent. telemetry_retry_count / telemetry_retry_delay are extracted and discarded (_, _ = params.extract(...) style). Users with legacy DSNs won't know their config is ignored. Log a
    one-time DEBUG (or WARN) when these params are present.
  2. Breaker math change isn't documented. minimumNumberOfCalls 20→10 with FlushInterval 5s→30s means worst-case ~5 minutes to detect a sustained outage at default cadence. State this in a code comment so
    future maintainers don't accidentally regress it.
  3. Sliding-window size didn't change with minimumNumberOfCalls. Window stayed at 30 while min dropped to 10. Verify intentional — the half-open recovery window now spans a smaller fraction, which shifts the
    trip-to-recover ratio.
  4. buildUserAgent is duplicated, not extracted. PR body says it "mirrors internal/client/client.go:295-302 exactly" — two copies of the same logic will drift. Extract to a shared helper (e.g.,
    internal/agent/useragent.go) and have both call sites use it.

Test coverage

  1. TestExport_SingleAttemptPerExport uses a bare *http.Client, not the retryablehttp-wrapped one. That correctly proves the exporter layer doesn't loop, but doesn't catch issue Add License file #1 above (transport-layer
    retry on 503). Add an integration-flavor test that uses the real RetryableClient() and asserts the combined behavior matches "one breaker signal per export" for all status codes the PR cares about.
  2. TestIsRetryableStatus was deleted, not relocated. isRetryableStatus moved from exporter to the transport layer (isRetryableServerResponse); the predicate test should move with it into
    internal/client/client_test.go.
  3. Breaker edge cases. New circuitbreaker_test.go (+118) covers happy paths for the Retry-After hint. Missing:
    - Two consecutive extendOpenStateAtLeast calls with different hints — which wins?
    - Half-open → fail → re-open while a prior hint is pending — does the hint persist? Reading the code, hints only clear on Close, so a stale hint can linger.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants