Skip to content

cosmos_database_container_item_query: 93% of failures are auth credential errors (CredentialUnavailableException) #2290

@yunjchoi

Description

@yunjchoi

Summary

Telemetry analysis of cosmos_database_container_item_query over the last 30 days shows 21,183 total failures. 93% are caused by a single root cause: CredentialUnavailableException (HTTP 401) — users whose Azure CLI credentials are expired or not configured.

This suggests the tool's error handling and user guidance for auth failures could be improved.

Failure Breakdown (30d)

# Exception Type Status Count % Avg Latency Root Cause
1 Azure.Identity.CredentialUnavailableException 401 19,640 92.7% 249ms Azure CLI not logged in or token expired
2 (no exception captured) 1,190 5.6% 61,000ms Silent timeouts — 61s avg, no error details surfaced
3 Microsoft.Azure.Cosmos.CosmosException 403 150 0.7% ~2s User lacks Cosmos DB data-plane RBAC
4 System.ArgumentNullException 500 100 0.5% ~1s Null required parameter in tool invocation
5 ValidationError 35 0.2% ~1s Missing required args (--account, --database, --container, --subscription)
6 Azure.RequestFailedException (AuthorizationFailed) 403 ~30 0.1% ~8s ARM-level RBAC denial
7 System.Net.Http.HttpRequestException 503 18 <0.1% ~3s Cosmos DB service unavailable
8 CosmosOperationCanceledException 500 8 <0.1% 22 min Extreme query timeouts
9 InvalidAuthenticationTokenTenant 401 ~6 <0.1% ~2s Wrong Entra ID tenant
10 System.TypeInitializationException 500 10 <0.1% ~30ms SDK initialization failure

Recommendations

1. Better auth error UX (addresses 93% of failures)

When CredentialUnavailableException is caught, return a clear, actionable error message instead of a generic failure:

"Azure credentials not found. Please run az login to authenticate, then try again."

Consider proactively checking credential availability before attempting the Cosmos DB call.

2. Surface error details for silent timeouts (addresses 5.6%)

The 1,190 failures with no exception type captured have a 61-second average duration — these appear to be connection timeouts where the error is swallowed. Ensure the timeout exception and message are captured in telemetry.

Environment

  • Data source: RawEventsDependencies table in AzureDevExp
  • Time range: Last 30 days (as of 2026-03-30)
  • Clients affected: Primarily VS Code (clientname == 'Visual Studio Code')

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

Status

Untriaged

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions