You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Agent Substrate decouples the high-frequency lifecycle of Actors (sandboxed agent workloads) from Workers (physical Kubernetes Pods). To meet our North Star target of sub-100ms workload activation (wakeup) latency at a scale of 1 billion registered actors and 1,000+ concurrent wakeups/second, Substrate bypasses the standard Kubernetes API server for real-time scheduling and state routing, offloading dynamic states to a dedicated high-performance database.
This RFC aims to evaluate the trade-offs of our current database choice (Redis/Valkey) against alternatives like Aerospike or traditional consensus-backed relational databases (e.g. Cloud Spanner, CockroachDB), summarizing local benchmark results, production operations concerns, and data structure redesign requirements.
2. Architectural Dimensions & Trade-offs
To choose the right storage engine for Substrate's dynamic instance state, we evaluate candidates across five key dimensions:
Comparative Database Matrix
Dimension
Redis (Valkey)
Aerospike (Community)
Cloud Bigtable (NoSQL)
Cloud Spanner / CockroachDB
TBD (e.g. DynamoDB / ScyllaDB)
Primary Storage Model
100% In-Memory (RAM)
Hybrid (Index in RAM, Data on SSD)
Wide-column key-value (Colossus SSD)
Distributed transactional disk/file
Your proposal here
Write Latency
Ultra-low (sub-100µs)
Very low (sub-300µs)
Low (2ms – 5ms)
High (5ms – 15ms due to Paxos sync)
TBD
Consensus & Replication
Async or Sync via WAIT
Paxos-like mesh replication
High-speed master-replica replication
Paxos/Raft consensus (Strict serializability)
TBD
Node Consistency
Eventually consistent (lag risk)
Strong consistency per record
Strongly consistent per row (local)
External consistency (immediate)
TBD
License Type
BSD-3 (Valkey) / SSPL (Redis)
AGPL-3.0 / Commercial
Proprietary (Google Cloud Managed)
Proprietary / Business Source License (BSL)
TBD (prefer Open Source)
Managed Service Availability
High (Memorystore, ElastiCache)
Limited (Aerospike Cloud)
High (Fully managed by GCP)
High (Fully managed serverless)
TBD
Deep Dive: Production Concerns by Database
A. Redis / Valkey
Data Durability Concerns: By default, Redis uses asynchronous replication. If a master node fails, writes in the replication buffer are lost. While appendfsync always provides complete durability, it historically degrades throughput on standard cloud storage.
Operational Mitigation: Implementing the WAIT command allows synchronous replication (blocking writes until committed to $N$ replicas) without sacrificing in-memory speed.
Memory Cost at Scale: At 1KB per actor record, storing 1 billion total registered actors requires 1 Terabyte of RAM, costing thousands of dollars per month in idle memory.
License Considerations: Redis recently migrated to a restrictive SSPL/RSALv2 license. Valkey (a Linux Foundation-backed BSD-3 fork) is the approved, open-source drop-in replacement.
B. Aerospike
Cost Efficiency at Scale: Unlike Redis, Aerospike stores the raw data records on raw block SSDs/NVMe, keeping only the keys/indexes in memory. This reduces the RAM footprint by up to 90%, making 1 billion actors highly cost-efficient.
Consensus Model: Uses Paxos-like mesh clustering. It supports highly reliable, durable single-record writes with built-in generation checks.
Community Restrictions: Community Edition (AGPL-3.0) lacks advanced features like multi-site cross-datacenter replication (XDR) and dynamic partition rebalancing under severe cluster splits, which are gated behind expensive enterprise commercial licenses.
C. Cloud Spanner / CockroachDB
Write Latency Penalty: To guarantee absolute transaction durability, these databases require network roundtrips for consensus (Raft/Paxos) and fsync to disk. This increases write latencies to 5ms – 15ms, making them completely incompatible with a 100ms wakeup budget if placed in the critical path.
The Hybrid Cache Model: In this model, Spanner acts as the durable, cold source of truth for 1 billion actors, while Redis/Valkey is used strictly as a transient, ephemeral session routing cache for currently active actors.
D. Cloud Bigtable (GCP Managed NoSQL)
Write Latency & Throughput: Bigtable provides excellent write throughput and low latency (2ms – 5ms for writes under steady-state), which is significantly faster than Spanner because it bypasses multi-zone Paxos synchronous roundtrips on local writes.
The Lock Challenge: Bigtable does not support multi-row ACID transactions or native atomic TTL/lease deletions (Bigtable garbage collection and TTL cells are eventually consistent and can take hours to delete). Implementing Substrate's lease-lock contract (AcquireLock/ReleaseLock) directly on Bigtable is very difficult and would require an external coordinator like etcd or Redis.
Scheduling / Range Scans: Bigtable does not support $O(1)$ random member pops natively. To select an idle worker, we would need a row key indexing strategy (e.g., worker:<pool>#STATUS_IDLE#<pod_name>) to perform fast prefix range scans, avoiding full-table scans.
Operational & License Concerns: Fully managed by Google Cloud, eliminating operational overhead. However, it is proprietary and has a minimum cost profile (minimum 1-node cluster cost) which can be high for small local developer setups.
E. TBD / Other Backends (e.g., AWS DynamoDB, ScyllaDB, Apache Cassandra)
Encouraging Contributions: We strongly welcome community proposals and pull requests to implement additional backend stores that satisfy Substrate's store.Interface contract!
Design Considerations: If you are interested in implementing a client for AWS DynamoDB, ScyllaDB, or Cassandra, please use this RFC issue thread to discuss key schema patterns, distributed locking strategies, and how you plan to avoid $O(N)$ worker range scan overhead.
3. Local Benchmarks Performed
We implemented a comparative, unified benchmark suite (store_benchmark_test.go) to test Valkey/Redis against Aerospike on a high-performance local workstation.
Benchmark Configuration:
Valkey/Redis: Standalone on localhost:6379, tested with everysec (default) and always (sync fsync) AOF policies.
Aerospike: Community Edition 8.1.2 on localhost:3000, using optimized PutBins APIs and generation concurrency.
Hardware: Intel(R) Xeon(R) CPU @ 2.20GHz, backed by high-speed local SSD storage.
Note
Important Networking Caveat: These benchmarks were performed locally using virtual interfaces (vNIC) on Docker containers on a single host workstation. Network hop latency was negligible (<10µs). In a distributed production GKE cluster where client pods and database pods reside on different physical nodes, each query roundtrip will incur real cross-node network hop latency, which typically adds ~300µs – 500µs to every database request. This overhead must be factored into production latency projections.
Local SSD/NVMe Performance: Enabling appendfsync always on Redis only increased latency by ~9% (shifting from 79.8µs to 87.0µs). On ultra-fast storage backings (like GCP Local SSDs), Redis can achieve RPO = 0 durability with virtually zero latency penalty.
Aerospike Performance: At 252 microseconds per write, Aerospike is ~3.1x slower than Redis but remains exceptionally fast, consuming only 27 CPU allocations and operating well within our sub-millisecond budget.
4. Core Data Structure & Scheduling Requirements
Regardless of the chosen storage engine, Substrate's control plane must refactor two legacy database patterns to support scaling:
A. Eliminating the $O(N)$ Scheduling Bottleneck
Currently, when an actor is resumed, the AssignWorkerStep calls ListWorkers (running a full-table keyspace scan over all master nodes), filters for idle ones, and shuffles them in memory. This scale-blocking bottleneck must be replaced by an $O(1)$Idle Worker Queue/Set:
graph LR
subgraph Idle Worker Pool
Set["pool:{default:pool-alpha}:idle_workers"]
Set -->|1. SPOP| Worker["pod-abc"]
end
ControlPlane -->|Claim Worker| Set
ControlPlane -->|2. Update Worker State| DB_Worker["worker:default:pool-alpha:pod-abc"]
Loading
B. Slot-Aware Sharding (Multi-Key Transactions)
Redis Cluster restricts transactions (WATCH/MULTI) to keys that hash to the exact same cluster slot.
To atomically assign a worker to an actor without key collision errors:
We must co-locate keys on the same shard using Redis Hash Tags (e.g. {actor:<actor-id>}).
Example keys:
Actor: actor:{actor-123}
Worker lease: {actor:actor-123}:worker
5. Discussion Points for the Community
We would love feedback from the maintainers and community on these directions:
Should we officially support a Hybrid Cache architecture?
Using Cloud Spanner/Postgres as the persistent registry for 1 billion actors, while keeping Valkey/Redis solely as a transient, fast in-memory session routing table for currently active actors.
Does the community favor open-source Valkey (BSD-3) or the commercial/AGPL Aerospike Community Edition?
Considering the operational overhead and licensing constraints of both.
Should we proceed with implementing the $O(1)$ Idle Worker Set scheduling pull request?
We have prototyped this locally and can submit it to clean up the $O(N)$ scan bottleneck in workflow_resume.go.
1. Objective & Context
Agent Substrate decouples the high-frequency lifecycle of Actors (sandboxed agent workloads) from Workers (physical Kubernetes Pods). To meet our North Star target of sub-100ms workload activation (wakeup) latency at a scale of 1 billion registered actors and 1,000+ concurrent wakeups/second, Substrate bypasses the standard Kubernetes API server for real-time scheduling and state routing, offloading dynamic states to a dedicated high-performance database.
This RFC aims to evaluate the trade-offs of our current database choice (Redis/Valkey) against alternatives like Aerospike or traditional consensus-backed relational databases (e.g. Cloud Spanner, CockroachDB), summarizing local benchmark results, production operations concerns, and data structure redesign requirements.
2. Architectural Dimensions & Trade-offs
To choose the right storage engine for Substrate's dynamic instance state, we evaluate candidates across five key dimensions:
Comparative Database Matrix
WAITDeep Dive: Production Concerns by Database
A. Redis / Valkey
appendfsync alwaysprovides complete durability, it historically degrades throughput on standard cloud storage.WAITcommand allows synchronous replication (blocking writes until committed toB. Aerospike
C. Cloud Spanner / CockroachDB
D. Cloud Bigtable (GCP Managed NoSQL)
AcquireLock/ReleaseLock) directly on Bigtable is very difficult and would require an external coordinator like etcd or Redis.worker:<pool>#STATUS_IDLE#<pod_name>) to perform fast prefix range scans, avoiding full-table scans.E. TBD / Other Backends (e.g., AWS DynamoDB, ScyllaDB, Apache Cassandra)
store.Interfacecontract!3. Local Benchmarks Performed
We implemented a comparative, unified benchmark suite (
store_benchmark_test.go) to test Valkey/Redis against Aerospike on a high-performance local workstation.Benchmark Configuration:
localhost:6379, tested witheverysec(default) andalways(sync fsync) AOF policies.localhost:3000, using optimizedPutBinsAPIs and generation concurrency.Note
Important Networking Caveat: These benchmarks were performed locally using virtual interfaces (vNIC) on Docker containers on a single host workstation. Network hop latency was negligible (<10µs). In a distributed production GKE cluster where client pods and database pods reside on different physical nodes, each query roundtrip will incur real cross-node network hop latency, which typically adds ~300µs – 500µs to every database request. This overhead must be factored into production latency projections.
Performance Comparison (Actor Creation / Writes):
gantt title Write Operation Latency (ns) dateFormat X axisFormat %s section Latencies Redis (everysec) :active, 0, 79814 Redis (always fsync) :crit, 0, 87054 Aerospike (durable NVMe) : 0, 252157Crucial Benchmark Insights:
appendfsync alwayson Redis only increased latency by ~9% (shifting from79.8µsto87.0µs). On ultra-fast storage backings (like GCP Local SSDs), Redis can achieve RPO = 0 durability with virtually zero latency penalty.4. Core Data Structure & Scheduling Requirements
Regardless of the chosen storage engine, Substrate's control plane must refactor two legacy database patterns to support scaling:
A. Eliminating the$O(N)$ Scheduling Bottleneck
Currently, when an actor is resumed, the$O(1)$ Idle Worker Queue/Set:
AssignWorkerStepcallsListWorkers(running a full-table keyspace scan over all master nodes), filters for idle ones, and shuffles them in memory. This scale-blocking bottleneck must be replaced by angraph LR subgraph Idle Worker Pool Set["pool:{default:pool-alpha}:idle_workers"] Set -->|1. SPOP| Worker["pod-abc"] end ControlPlane -->|Claim Worker| Set ControlPlane -->|2. Update Worker State| DB_Worker["worker:default:pool-alpha:pod-abc"]B. Slot-Aware Sharding (Multi-Key Transactions)
Redis Cluster restricts transactions (
WATCH/MULTI) to keys that hash to the exact same cluster slot.To atomically assign a worker to an actor without key collision errors:
{actor:<actor-id>}).actor:{actor-123}{actor:actor-123}:worker5. Discussion Points for the Community
We would love feedback from the maintainers and community on these directions:
Using Cloud Spanner/Postgres as the persistent registry for 1 billion actors, while keeping Valkey/Redis solely as a transient, fast in-memory session routing table for currently active actors.
Considering the operational overhead and licensing constraints of both.
We have prototyped this locally and can submit it to clean up the
workflow_resume.go.