Skip to content

dims/openshell-driver-substrate

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

openshell-driver-substrate

Agent Substrate (gVisor + checkpoint/restore via runsc) compute driver for OpenShell.

Read first, depending on what you want:

  • docs/poc-intro.md — joint POC overview for teammates familiar with OpenShell or Substrate. Explains what this is, why OpenShell is better with Substrate, how the boot path degrades safely under gVisor, and the boundary between this crate and upstream.
  • examples/helpdesk/README.md — the 10-beat driver-driven helpdesk demo. Three acts (provisioning, lifecycle, hygiene), every CreateSandbox/ListSandboxes/DeleteSandbox flows through openshell-gateway → openshell-driver-substrate → ate-api-server. Prereqs, quick-start, expected output, troubleshooting.

Status (2026-05-24): Driver crate is now load-bearing in a real OpenShell gateway. The M3 wiring landed on dims/OpenShell@chore/gvisor-degraded-netns (commits M3.14 = 917e969 and M3.16 = 8343b8d); the helpdesk demo above exercises every CreateSandbox/ListSandboxes/DeleteSandbox through the driver against a real substrate kind cluster. Verified end-to-end on bigbox 2026-05-24 evening.

This repository depends on a small change in OpenShell that lets the supervisor tolerate the bootstrap subsystems gVisor degrades. Two alternative shapes are filed upstream; one of them will land:

  • NVIDIA/OpenShell#1548 [WIP]OPENSHELL_BEST_EFFORT_FAILURES env-var gate (3 files, +51/-7).
  • NVIDIA/OpenShell#1549SandboxFailureHandler trait + set_failure_handler (3 files, +71/-7). Programmatic override only — no env var, no CLI flag.

Cargo's openshell-core dep is pinned to the corresponding dims/OpenShell fork tip.

How to use it

The crate is a library — consumers link it from Cargo and wire it into their compute-runtime dispatcher. The canonical consumer is OpenShell's openshell-server; the wiring landed on dims/OpenShell@chore/gvisor-degraded-netns as M3.14 + M3.16. For a fresh consumer the three pieces are:

1. Cargo dep. Add to openshell-server/Cargo.toml:

openshell-driver-substrate = { path = "../openshell-driver-substrate" }

2. Dispatcher arm. SubstrateComputeDriver implements ComputeDriver directly (same WatchSandboxesStream type the gateway expects), so the constructor mirrors new_kubernetes but skips the adapter:

let driver: SharedComputeDriver =
    Arc::new(SubstrateComputeDriver::new(config));
ComputeRuntime::from_driver(driver, /* … */).await

3. Activate in gateway.toml:

[openshell.gateway]
compute_drivers = ["substrate"]

[openshell.drivers.substrate]
api_endpoint          = "api.ate-system.svc:443"
api_tls_ca_path       = "/etc/openshell-substrate/ca.crt"
api_bearer_token_path = "/etc/openshell-substrate/token"
default_namespace     = "ate-demo-helpdesk"
default_worker_pool   = "helpdesk-pool"
pause_image           = "registry.k8s.io/pause:3.10.2@sha256:…"
snapshots_location    = "gs://ate-snapshots/ate-demo-helpdesk/"
runsc_amd64_sha256    = "a397…"
runsc_amd64_url       = "gs://gvisor/releases/nightly/…/runsc"
gateway_endpoint      = ""    # empty → supervisors stay in standalone mode

With those three pieces in place, every openshell.v1.OpenShell.CreateSandbox call routes through this crate. A working sample — gateway image build, projected SA-token + CA bundle wiring, kustomize-shaped Deployment, RBAC — lives at examples/helpdesk/gateway/; the 10-beat helpdesk demo at examples/helpdesk/ drives it end-to-end.

What's in the box

path what
src/lib.rs SubstrateComputeDriver — implements OpenShell's ComputeDriver gRPC trait against Substrate's ateapi.Control. The driver synthesizes ate.dev/v1alpha1 ActorTemplate resources and injects OPENSHELL_BEST_EFFORT_FAILURES=1 into the supervisor container's env.
src/template.rs kube-rs mirror of Substrate's ActorTemplate CRD; just the fields the driver writes and waits on.
proto/ateapi.proto Vendored from agent-substrate/substrate; build.rs runs tonic_build over it.
tests/live.rs Four live integration tests against a real ate-api-server (#[ignore]d; gated on SUBSTRATE_LIVE_* env vars).
tests/integration/ Feature-observation harness: builds the patched supervisor image, applies templates, spawns an actor, dumps [oshl-test] markers from worker pod logs.
tests/integration/gateway/ §7b end-to-end harness: deploys a real openshell-gateway (with a docker:28-dind sidecar + stub supervisor_bin), mints Ed25519 JWT signing material via generate-jwt-keys.sh (private key never lands in the repo), spawns a test actor wired with OPENSHELL_ENDPOINT + OPENSHELL_SANDBOX_TOKEN + OPENSHELL_SANDBOX_ID, and runs verify-features.sh to record PASS/FAIL for each of the five gateway-driven features.
examples/helpdesk/ Six-beat OpenShell-on-Substrate demo: cold ask → suspend → idle → follow-up (memory preserved) → exfil deny → pod-kill migration. Builds on tests/integration/. See examples/helpdesk/README.md.

Build

cargo build --release

Cargo resolves openshell-core from the pinned-rev git dep on first build; subsequent builds are cached.

Unit tests (no cluster required):

cargo test --lib

Live integration tests

tests/live.rs exercises the full driver lifecycle against a running ate-api-server. Required env vars: see the top of tests/live.rs for the full list. Skip silently when any required var is missing.

SUBSTRATE_LIVE_API_ENDPOINT=127.0.0.1:18443 \
SUBSTRATE_LIVE_NAMESPACE=ate-openshell-m0 \
SUBSTRATE_LIVE_CA_PATH=/tmp/ate-servicedns-ca.pem \
SUBSTRATE_LIVE_BEARER_TOKEN_PATH=/tmp/ate-bearer.token \
SUBSTRATE_LIVE_TLS_SERVER_NAME=api.ate-system.svc \
SUBSTRATE_LIVE_WORKER_POOL=openshell-m0-pool \
SUBSTRATE_LIVE_SNAPSHOTS_LOCATION=gs://ate-snapshots/ate-openshell-m0/ \
SUBSTRATE_LIVE_RUNSC_AMD64_SHA=... \
SUBSTRATE_LIVE_RUNSC_AMD64_URL=gs://gvisor/releases/.../runsc \
SUBSTRATE_LIVE_PAUSE_IMAGE=registry.k8s.io/pause:3.10.2@sha256:... \
SUBSTRATE_LIVE_TEMPLATE_NAME=supervisor \
SUBSTRATE_LIVE_TEST_IMAGE=localhost:5001/oshl-feature-test@sha256:... \
  cargo test --test live -- --ignored --test-threads=1

Feature-observation harness

tests/integration/ builds a feature-test supervisor image, applies the templates it depends on, spawns an actor via grpcurl, and dumps the [oshl-test] markers from the worker pod's stdout for inspection.

The supervisor binary is built from the patched OpenShell source (build-image.sh resolves the source tree in this order: $OPENSHELL_REPO, sibling ../OpenShell, then a clone at the pinned commit). The resulting image bakes OPENSHELL_BEST_EFFORT_FAILURES=1 in via the Dockerfile and the YAML templates re-state it in containers[].env for visibility.

Operator first-run:

  1. From the substrate repo (agent-substrate/substrate or a fork): KO_DOCKER_REPO=localhost:5001 ko publish ./cmd/servers/ateom-gvisor and export ATEOM_IMAGE='localhost:5001/ateom-gvisor@sha256:...'.
  2. From this repo: ./tests/integration/run.sh.

Subsequent runs: ./tests/integration/run.sh (the ATEOM_IMAGE env var is captured in the live WorkerPool spec on first apply).

§7b gateway-integration harness

tests/integration/gateway/ stands up a real openshell-gateway Deployment alongside the worker pool and exercises the supervisor's cluster-mode features (settings poll, inference routing, log push, SSH attach via RelayStream, cross-sandbox identity guard).

# One-time, before the first run on a fresh cluster:
export ATEOM_IMAGE='localhost:5001/ateom-gvisor@sha256:...'

cd tests/integration/gateway
./run-gateway-integration.sh        # builds + deploys + spawns + captures
./verify-features.sh /tmp/oshl-v3-<TS>   # PASS/FAIL summary for F1..F5

generate-jwt-keys.sh mints (or reuses) the Ed25519 JWT signing material at $OPENSHELL_JWT_DIR (default: /tmp) and renders the gateway Secret manifest to stdout — the private key never enters the repo. Three features (F1 settings poll, F2 inference routing, F3 log push) are PASS verified end-to-end; F4 SSH attach and F5 cross-sandbox IDOR are deferred (template wiring exists; verification needs an external SSH driver / per-actor JWTs). See ~/notes/openshell-on-substrate/2026-05-23-openshell-features-findings.md §7b verification for the full results + sharp-edges register (SE-8..SE-13).

Companion changes upstream

PR Effect
NVIDIA/OpenShell#1548 [WIP] OPENSHELL_BEST_EFFORT_FAILURES env-var gate. 3 files, +51/-7. Default strict; opt-in via the env var. Alternative shape; one of #1548 / #1549 will land.
NVIDIA/OpenShell#1549 SandboxFailureHandler trait + StrictHandler default + set_failure_handler setter. 3 files, +71/-7. Programmatic override only — no env var, no CLI flag, no main.rs changes. Alternative shape; one of #1548 / #1549 will land.
agent-substrate/substrate#66 ateom-gvisor eth0 move/restore idempotency + deferred rollback. Without it, the test harness alternates between green and red runs.
agent-substrate/substrate#67 install-ate-kind.sh builds + pushes ateom-gvisor automatically, so a WorkerPool is usable out of --deploy-ate-system. Closes the manual ko publish operator step.
agent-substrate/substrate#73 Per-container securityContext on ActorTemplate.spec.containers[]: capabilities.add + runAsUser / runAsGroup. Empty templates produce the same OCI bundle as before. Unblocks the driver's synthesize_template from emitting capability adds + a non-root supervisor start UID once it merges.
agent-substrate/substrate#75 ateapi/syncer: release actor when host pod is deleted. WorkerPoolSyncer's pod-delete hook resets the bound actor to STATUS_SUSPENDED so the next request migrates it onto a free worker, instead of stranding it pointing at a dead pod. Beat 9 of the helpdesk demo (pod-kill migration with multi-tenant proof) depends on it; verified end-to-end on bigbox 2026-05-24.

About

OpenShell ComputeDriver targeting Agent Substrate (gVisor + checkpoint/restore via runsc) plus the substrate-aware supervisor wrapper binary

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors