Skip to content

feat(provider): selective credential passthrough#1355

Open
laitingsheng wants to merge 10 commits into
mainfrom
fix/894-customisable-credential-placeholder
Open

feat(provider): selective credential passthrough#1355
laitingsheng wants to merge 10 commits into
mainfrom
fix/894-customisable-credential-placeholder

Conversation

@laitingsheng
Copy link
Copy Markdown
Contributor

@laitingsheng laitingsheng commented May 13, 2026

Summary

Adds an opt-in passthrough mechanism for provider credentials. The supervisor injects the real value into the agent's environment instead of the canonical openshell:resolve:env:<KEY> placeholder, and the L7 proxy performs no substitution for that key. Required for SDKs that validate credential format in-process (Slack xoxb-*/xapp-*), credentials consumed by in-process crypto (HMAC signing secrets, signature verification), and transports the proxy cannot rewrite (WebSocket payloads after HTTP 101 upgrade).

Related Issue

Fixes #894.

Related: #872, #913, #988.

Groundwork for NVIDIA/NemoClaw#1569, NVIDIA/NemoClaw#2024, NVIDIA/NemoClaw#2031, NVIDIA/NemoClaw#2085, and NVIDIA/NemoClaw#3179.

Changes

Adds one new optional field to Provider in proto/datamodel.proto:

  • passthrough_credentials: repeated string listing credential keys whose real value is injected directly into the agent process. Drops the "agent never sees the real secret" invariant for the listed keys; treated as a per-credential security trade-off.

GetSandboxProviderEnvironmentResponse gains passthrough_keys so the supervisor knows which entries in environment are real values rather than canonical-placeholder targets.

Custom-format placeholders (the second mode in earlier drafts of this PR) are deliberately deferred. Passthrough alone covers all three failure classes from #894 (in-process format check, opaque transport, in-process crypto). A custom-placeholder field is purely additive when a future Class A use case justifies it.

Server changes (crates/openshell-server):

  • grpc/provider.rs: round-trip on Create/Update/Get/List. resolve_provider_environment returns a ProviderEnvironment struct containing the merged env map and the union of passthrough keys; on duplicate keys across providers, the passthrough flag follows the winning value.
  • grpc/validation.rs: validate each entry as a valid env var name, present in credentials, no duplicates, under the entry-count cap.
  • grpc/policy.rs: plumb passthrough_keys into GetSandboxProviderEnvironmentResponse.
  • Update merge has three modes:
    • clear_passthrough_credentials=true on UpdateProviderRequest = replace with the empty list. proto3 cannot distinguish "absent" from "empty list" on a repeated field, so this is the only way to revoke passthrough for every key. Mutually exclusive with a non-empty passthrough_credentials on the same request;
    • empty incoming + clear=false = preserve existing list, but auto-prune entries whose credential was deleted in the same update — without this, marking a credential passthrough and then deleting that credential locks the provider in an invalid state;
    • non-empty incoming = replace verbatim; explicit orphans still rejected as a caller bug.

Sandbox changes (crates/openshell-sandbox):

  • grpc_client.rs: fetch_provider_environment now returns a ProviderEnvironment carrying both the env map and the passthrough key list.
  • secrets.rs: new SecretResolver::from_provider_env_with_passthrough — for passthrough keys puts the real value in child_env and skips resolver registration; non-passthrough keys keep the existing canonical-placeholder behaviour. Resolver is None when every key is passthrough.
  • lib.rs: thread the passthrough list through the supervisor launch path; OCSF log includes both env_count and passthrough_count.

CLI changes (crates/openshell-cli):

  • provider create/provider update: --passthrough KEY (repeatable; key must also appear in --credential on create).
  • provider update: --clear-passthrough (bool) wipes the passthrough list; mutually exclusive with --passthrough.
  • provider get: annotates passthrough keys (KEY (passthrough)).
  • provider list: adds a PASSTHROUGH count column.

Architecture docs (architecture/sandbox.md): new "Selective Passthrough" section covering motivation, configuration, resolution-and-merge semantics, supervisor injection, the auto-prune-on-credential-deletion behaviour, and the security trade-off.

User-facing docs (docs/sandboxes/manage-providers.mdx): how-to with example, <Warning> on the security trade-off, and a "Revoking passthrough" subsection covering --clear-passthrough.

Testing

  • mise run pre-commit passes
  • Unit tests added/updated
  • E2E tests added/updated (if applicable)

Checklist

  • Follows Conventional Commits
  • Commits are signed off (DCO)
  • Architecture docs updated (if applicable)

Add an opt-in passthrough mechanism for provider credentials that bypass the
canonical openshell:resolve:env:* placeholder and L7 proxy substitution.

Provider gains a repeated string passthrough_credentials field listing
credential keys whose real value should be injected directly into the agent
process environment. Required for SDKs that validate credential format
in-process (Slack xoxb-*/xapp-*), credentials consumed by in-process crypto
(HMAC signing secrets, signature verification), and transports the proxy
cannot rewrite (WebSocket payloads after HTTP 101 upgrade). Passthrough
drops the "agent never sees the real secret" invariant for the listed
keys; documented as a per-credential security trade-off.

Server: validation rejects keys not in credentials, invalid env var names,
duplicates, and over-cap. resolve_provider_environment returns a
ProviderEnvironment struct with the merged env map and the union of
passthrough keys; passthrough flag follows the winning value on duplicate
keys across providers. Update merge auto-prunes orphaned passthrough
entries when their credential is deleted in the same call (proto3 cannot
distinguish "field unset" from "set to empty list"), preserving the
"recover via update" path. Explicit non-empty incoming list still rejects
orphans as a caller bug.

Sandbox: SecretResolver::from_provider_env_with_passthrough handles the
new mode — passthrough keys go to child env as real values, resolver does
not learn them; non-passthrough keys still get the canonical placeholder.

CLI: provider create/update accept --passthrough KEY (repeatable);
provider get annotates passthrough keys, provider list adds a PASSTHROUGH
column.

Refs #894 (selective passthrough lands; custom-format placeholders deferred
to a follow-up).

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Resolve conflicts from upstream:
- Architecture doc consolidation (#1184) deleted sandbox-providers.md;
  migrate the Selective Passthrough section to architecture/sandbox.md.
- Provider credential rotation infrastructure (provider_env_revision +
  ProviderCredentialState) and selective passthrough share the same proto
  field slot and the same SecretResolver entry point. Merge them: proto
  GetSandboxProviderEnvironmentResponse keeps both passthrough_keys and
  provider_env_revision; SecretResolver::from_provider_env_for_revision
  takes both passthrough_keys and revision; ProviderCredentialState plumbs
  passthrough_keys through from_environment / install_environment;
  ProviderEnvironment carries all three fields end-to-end.
- Provider proto struct picked up passthrough_credentials in our branch;
  add the missing field to test fixtures in run.rs and grpc/sandbox.rs.
- provider_create grew a passthrough argument; pass &[] in
  provider_commands_integration.rs.

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

# Conflicts:
#	mise.lock
@laitingsheng laitingsheng marked this pull request as draft May 13, 2026 04:47
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 13, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Resolve three conflicts from upstream credential-rewrite work (#1286):

- `secrets.rs`: the inner constructor now accepts both `passthrough_keys`
  (from this branch) and `include_current_aliases` (from main). Passthrough
  keys short-circuit into `child_env` directly; the alias-fallback insert
  runs only for placeholder keys. Threaded `passthrough_keys` through both
  public wrappers (`from_provider_env_for_revision`,
  `from_provider_env_for_current_revision`).
- `provider_credentials.rs`: `from_environment` and `install_environment`
  keep this branch's `passthrough_keys` parameter and main's three-tuple
  return shape (`generation_resolver` + `current_resolver`).
- `run.rs`: an additive `service_expose`/`service_list`/... block from main
  collided with this branch's `#[allow(clippy::too_many_arguments)]`
  annotation on `provider_create`. Keep both — annotation stays on
  `provider_create`, services block sits before it.

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
@github-actions
Copy link
Copy Markdown

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
@laitingsheng laitingsheng marked this pull request as ready for review May 13, 2026 14:19
Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
@laitingsheng laitingsheng added the test:e2e Requires end-to-end coverage label May 13, 2026
@github-actions
Copy link
Copy Markdown

Label test:e2e applied for ada8cee. Open the existing run and click Re-run all jobs to execute with the label set. The E2E Gate check on this PR will flip green automatically once the run finishes.

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
@amfred
Copy link
Copy Markdown
Contributor

amfred commented May 26, 2026

My code review agent GPT-5.5 flagged the following issue, and I agree that there should be a way to clear the passthrough credentials:

[P1] UpdateProvider cannot clear the last passthrough credential
crates/openshell-server/src/grpc/provider.rs:162

The update merge treats an empty incoming passthrough_credentials list as “preserve existing,” and a non-empty list as replacement. Because this is a proto3 repeated field, clients have no presence bit to distinguish “omitted” from “explicitly set to empty.” That means once a provider has exactly one passthrough key, there is no direct API or CLI path to turn passthrough off while keeping the credential. openshell provider update --passthrough ... can replace with another non-empty list, but openshell provider update with no passthrough preserves the old list.

That is a security-relevant footgun: passthrough is the mode that exposes raw secrets to the agent process, so operators need a clean revocation path. I’d add either an explicit clear_passthrough_credentials field, an update field mask, or a dedicated provider passthrough update RPC/operation. Then add tests covering “single passthrough key -> clear to empty.”

Related code:

  • crates/openshell-server/src/grpc/provider.rs:155-170
  • crates/openshell-cli/src/run.rs:4253-4257
  • docs/sandboxes/manage-providers.mdx documents update support but not a way to clear passthrough.

@laitingsheng laitingsheng added test:e2e Requires end-to-end coverage and removed test:e2e Requires end-to-end coverage labels May 27, 2026
@github-actions
Copy link
Copy Markdown

Label test:e2e applied for ba366d7. Open the existing run and click Re-run all jobs to execute with the label set. The run will execute the standard E2E suite after building the required gateway and supervisor images once. The matching required CI gate status on this PR will flip green automatically once the run finishes.

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
…quest

proto3 cannot distinguish 'absent' from 'empty list' on a repeated field,
so an empty incoming passthrough_credentials list is treated as 'preserve
existing'. A bool clear_passthrough_credentials on UpdateProviderRequest
gives callers an explicit revoke-all path. The CLI exposes it as
`openshell provider update --clear-passthrough`, mutually exclusive
with `--passthrough KEY`.

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Apply review feedback on PR #1355:
- Format provider_update test call to satisfy `cargo fmt --check`.
- Update six provider_create test sites and one Provider literal to
  carry the passthrough column added by the merge from main.
- Qualify the "agent never sees real credentials" claim with a pointer
  to selective passthrough so the default placeholder path stays the
  obvious choice but the opt-in is discoverable.
- Clarify that `--clear-passthrough` only affects sandboxes launched
  (or relaunched) after the update; live agent processes keep their
  injected env until restart.
- Document the `clear_passthrough_credentials = true` update mode in
  the sandbox architecture doc.
- Replace "honour" with "apply" in the doc comment and MDX page for
  consistency with the rest of the codebase.

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
})? {
Some(record) => {
hasher.update(record.id.as_bytes());
hasher.update(record.updated_at_ms.to_le_bytes());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GPT 5.5 code review:

Provider env revisions still rely on record.updated_at_ms, and the new passthrough state is not hashed directly. The sandbox only refreshes provider credentials when provider_env_revision changes (crates/openshell-sandbox/src/lib.rs (line 2335)), so two provider writes inside the same millisecond can leave a sandbox using stale provider mode or credentials. That is especially risky for passthrough revocation because the old real-secret exposure may persist until another detectable revision change.

Suggested fix: hash record.resource_version in compute_provider_env_revision; optionally also hash sorted passthrough_credentials explicitly for defense in depth.

I asked for a more in-depth explanation because I didn't follow this initially myself.

The core issue is that compute_provider_env_revision() decides whether a running sandbox should refresh provider credentials. The sandbox poll loop compares the latest provider_env_revision against its current value; if unchanged, it does nothing. So this revision has to change for every provider state change that can affect the sandbox environment.

Right now it hashes record.updated_at_ms:

hasher.update(record.id.as_bytes());
hasher.update(record.updated_at_ms.to_le_bytes());

That is fragile because updated_at_ms only has millisecond precision. If two provider updates land in the same millisecond, the persisted provider can change while the computed revision stays the same. For this PR, that matters more than usual because provider updates now control whether a credential is injected as a real secret via passthrough or as a placeholder. A missed revision can mean a sandbox keeps stale credential behavior until some later unrelated config change wakes it up.

The best fix is to hash the store’s monotonic resource_version instead of, or in addition to, updated_at_ms:

hasher.update(record.id.as_bytes());
hasher.update(record.resource_version.to_le_bytes());

I would prefer replacing updated_at_ms with resource_version if that field is guaranteed to increment on every provider write, which the SQLite persistence path appears to do. Keeping both is also fine:

hasher.update(record.id.as_bytes());
hasher.update(record.resource_version.to_le_bytes());
hasher.update(record.updated_at_ms.to_le_bytes());

For defense in depth, I’d also hash the actual fields that affect env resolution, including passthrough:

let mut passthrough_keys: Vec<_> = provider.passthrough_credentials.iter().collect();
passthrough_keys.sort();
for key in passthrough_keys {
    hasher.update(b"passthrough");
    hasher.update(key.as_bytes());
}

That makes the revision logic self-documenting: credentials, expiry, and passthrough mode all participate directly. Then add a focused test that performs two provider updates without sleeping and asserts the revision changes after each update. The existing test sleeps for 2ms, which hides exactly the race this should protect against.


impl ProviderEnvironment {
#[cfg(test)]
fn is_empty(&self) -> bool {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GPT 5.5 code review.

crates/openshell-server/src/grpc/provider.rs (line 51): CI branch checks are blocked by Clippy because ProviderEnvironment::is_empty is unused under test builds. Local reproduction: cargo clippy -p openshell-server --all-targets -- -D warnings fails with method is_empty is never used. Suggested fix: remove the helper, use it in tests, or add a narrow #[allow(dead_code)] if it is intentionally retained.

More in-depth explanation:

The PR adds test-only helper methods on ProviderEnvironment:

impl ProviderEnvironment {
    #[cfg(test)]
    fn is_empty(&self) -> bool {
        self.environment.is_empty()
    }

    #[cfg(test)]
    fn get(&self, key: &str) -> Option<&String> {
        self.environment.get(key)
    }

    #[cfg(test)]
    fn contains_key(&self, key: &str) -> bool {
        self.environment.contains_key(key)
    }
}

get and contains_key are used by tests, but is_empty is not. That produces a dead_code warning in test builds. CI runs Clippy with -D warnings, so the warning becomes a hard failure:

error: method `is_empty` is never used
  --> crates/openshell-server/src/grpc/provider.rs:51:8
   |
51 |     fn is_empty(&self) -> bool {
   |        ^^^^^^^^
   = note: `-D dead-code` implied by `-D warnings`

So this is not a sandbox escape or behavior bug. It is a branch-protection blocker: the Rust checks cannot pass until it is fixed.

The best fix is simply to remove is_empty if no test uses it:

impl ProviderEnvironment {
    #[cfg(test)]
    fn get(&self, key: &str) -> Option<&String> {
        self.environment.get(key)
    }

    #[cfg(test)]
    fn contains_key(&self, key: &str) -> bool {
        self.environment.contains_key(key)
    }
}

I would avoid adding #[allow(dead_code)] here. That would make CI green, but it preserves unused API surface for no benefit and weakens the signal of -D warnings. Using the helper in a test is also acceptable if there is a natural assertion like “detached provider env resolves empty,” but adding a test assertion solely to justify the helper is a little backwards.

Recommended review comment:

is_empty is test-only but unused, and cargo clippy -p openshell-server --all-targets -- -D warnings fails because dead_code is denied. Please remove this helper unless a test actually needs it. Prefer removing it over adding an allow, since the helper is private and has no behavioral value right now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test:e2e Requires end-to-end coverage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Credential placeholder model breaks SDKs that validate token format before making network calls

2 participants