Skip to content

[RFC][Architecture] Control Plane Persistence Layer: Scalability, Durability, and Latency Evaluation #12

@MushuEE

Description

1. Objective & Context

Agent Substrate decouples the high-frequency lifecycle of Actors (sandboxed agent workloads) from Workers (physical Kubernetes Pods). To meet our North Star target of sub-100ms workload activation (wakeup) latency at a scale of 1 billion registered actors and 1,000+ concurrent wakeups/second, Substrate bypasses the standard Kubernetes API server for real-time scheduling and state routing, offloading dynamic states to a dedicated high-performance database.

This RFC aims to evaluate the trade-offs of our current database choice (Redis/Valkey) against alternatives like Aerospike or traditional consensus-backed relational databases (e.g. Cloud Spanner, CockroachDB), summarizing local benchmark results, production operations concerns, and data structure redesign requirements.


2. Architectural Dimensions & Trade-offs

To choose the right storage engine for Substrate's dynamic instance state, we evaluate candidates across five key dimensions:

Comparative Database Matrix

Dimension Redis (Valkey) Aerospike (Community) Cloud Bigtable (NoSQL) Cloud Spanner / CockroachDB TBD (e.g. DynamoDB / ScyllaDB)
Primary Storage Model 100% In-Memory (RAM) Hybrid (Index in RAM, Data on SSD) Wide-column key-value (Colossus SSD) Distributed transactional disk/file Your proposal here
Write Latency Ultra-low (sub-100µs) Very low (sub-300µs) Low (2ms – 5ms) High (5ms – 15ms due to Paxos sync) TBD
Consensus & Replication Async or Sync via WAIT Paxos-like mesh replication High-speed master-replica replication Paxos/Raft consensus (Strict serializability) TBD
Node Consistency Eventually consistent (lag risk) Strong consistency per record Strongly consistent per row (local) External consistency (immediate) TBD
License Type BSD-3 (Valkey) / SSPL (Redis) AGPL-3.0 / Commercial Proprietary (Google Cloud Managed) Proprietary / Business Source License (BSL) TBD (prefer Open Source)
Managed Service Availability High (Memorystore, ElastiCache) Limited (Aerospike Cloud) High (Fully managed by GCP) High (Fully managed serverless) TBD

Deep Dive: Production Concerns by Database

A. Redis / Valkey

  • Data Durability Concerns: By default, Redis uses asynchronous replication. If a master node fails, writes in the replication buffer are lost. While appendfsync always provides complete durability, it historically degrades throughput on standard cloud storage.
  • Operational Mitigation: Implementing the WAIT command allows synchronous replication (blocking writes until committed to $N$ replicas) without sacrificing in-memory speed.
  • Memory Cost at Scale: At 1KB per actor record, storing 1 billion total registered actors requires 1 Terabyte of RAM, costing thousands of dollars per month in idle memory.
  • License Considerations: Redis recently migrated to a restrictive SSPL/RSALv2 license. Valkey (a Linux Foundation-backed BSD-3 fork) is the approved, open-source drop-in replacement.

B. Aerospike

  • Cost Efficiency at Scale: Unlike Redis, Aerospike stores the raw data records on raw block SSDs/NVMe, keeping only the keys/indexes in memory. This reduces the RAM footprint by up to 90%, making 1 billion actors highly cost-efficient.
  • Consensus Model: Uses Paxos-like mesh clustering. It supports highly reliable, durable single-record writes with built-in generation checks.
  • Community Restrictions: Community Edition (AGPL-3.0) lacks advanced features like multi-site cross-datacenter replication (XDR) and dynamic partition rebalancing under severe cluster splits, which are gated behind expensive enterprise commercial licenses.

C. Cloud Spanner / CockroachDB

  • Write Latency Penalty: To guarantee absolute transaction durability, these databases require network roundtrips for consensus (Raft/Paxos) and fsync to disk. This increases write latencies to 5ms – 15ms, making them completely incompatible with a 100ms wakeup budget if placed in the critical path.
  • The Hybrid Cache Model: In this model, Spanner acts as the durable, cold source of truth for 1 billion actors, while Redis/Valkey is used strictly as a transient, ephemeral session routing cache for currently active actors.

D. Cloud Bigtable (GCP Managed NoSQL)

  • Write Latency & Throughput: Bigtable provides excellent write throughput and low latency (2ms – 5ms for writes under steady-state), which is significantly faster than Spanner because it bypasses multi-zone Paxos synchronous roundtrips on local writes.
  • The Lock Challenge: Bigtable does not support multi-row ACID transactions or native atomic TTL/lease deletions (Bigtable garbage collection and TTL cells are eventually consistent and can take hours to delete). Implementing Substrate's lease-lock contract (AcquireLock/ReleaseLock) directly on Bigtable is very difficult and would require an external coordinator like etcd or Redis.
  • Scheduling / Range Scans: Bigtable does not support $O(1)$ random member pops natively. To select an idle worker, we would need a row key indexing strategy (e.g., worker:<pool>#STATUS_IDLE#<pod_name>) to perform fast prefix range scans, avoiding full-table scans.
  • Operational & License Concerns: Fully managed by Google Cloud, eliminating operational overhead. However, it is proprietary and has a minimum cost profile (minimum 1-node cluster cost) which can be high for small local developer setups.

E. TBD / Other Backends (e.g., AWS DynamoDB, ScyllaDB, Apache Cassandra)

  • Encouraging Contributions: We strongly welcome community proposals and pull requests to implement additional backend stores that satisfy Substrate's store.Interface contract!
  • Design Considerations: If you are interested in implementing a client for AWS DynamoDB, ScyllaDB, or Cassandra, please use this RFC issue thread to discuss key schema patterns, distributed locking strategies, and how you plan to avoid $O(N)$ worker range scan overhead.

3. Local Benchmarks Performed

We implemented a comparative, unified benchmark suite (store_benchmark_test.go) to test Valkey/Redis against Aerospike on a high-performance local workstation.

Benchmark Configuration:

  • Valkey/Redis: Standalone on localhost:6379, tested with everysec (default) and always (sync fsync) AOF policies.
  • Aerospike: Community Edition 8.1.2 on localhost:3000, using optimized PutBins APIs and generation concurrency.
  • Hardware: Intel(R) Xeon(R) CPU @ 2.20GHz, backed by high-speed local SSD storage.

Note

Important Networking Caveat: These benchmarks were performed locally using virtual interfaces (vNIC) on Docker containers on a single host workstation. Network hop latency was negligible (<10µs). In a distributed production GKE cluster where client pods and database pods reside on different physical nodes, each query roundtrip will incur real cross-node network hop latency, which typically adds ~300µs – 500µs to every database request. This overhead must be factored into production latency projections.

Performance Comparison (Actor Creation / Writes):

BenchmarkCreateActor_Redis-24                  14149   79,814 ns/op   1,518 B/op   25 allocs/op
BenchmarkCreateActor_Redis_AlwaysFsync-24      13494   87,054 ns/op   1,388 B/op   24 allocs/op
BenchmarkCreateActor_Aerospike-24               4785  252,157 ns/op   1,867 B/op   27 allocs/op
gantt
    title Write Operation Latency (ns)
    dateFormat  X
    axisFormat %s
    section Latencies
    Redis (everysec)  :active, 0, 79814
    Redis (always fsync) :crit, 0, 87054
    Aerospike (durable NVMe) : 0, 252157
Loading

Crucial Benchmark Insights:

  1. Local SSD/NVMe Performance: Enabling appendfsync always on Redis only increased latency by ~9% (shifting from 79.8µs to 87.0µs). On ultra-fast storage backings (like GCP Local SSDs), Redis can achieve RPO = 0 durability with virtually zero latency penalty.
  2. Aerospike Performance: At 252 microseconds per write, Aerospike is ~3.1x slower than Redis but remains exceptionally fast, consuming only 27 CPU allocations and operating well within our sub-millisecond budget.

4. Core Data Structure & Scheduling Requirements

Regardless of the chosen storage engine, Substrate's control plane must refactor two legacy database patterns to support scaling:

A. Eliminating the $O(N)$ Scheduling Bottleneck

Currently, when an actor is resumed, the AssignWorkerStep calls ListWorkers (running a full-table keyspace scan over all master nodes), filters for idle ones, and shuffles them in memory. This scale-blocking bottleneck must be replaced by an $O(1)$ Idle Worker Queue/Set:

graph LR
    subgraph Idle Worker Pool
        Set["pool:{default:pool-alpha}:idle_workers"]
        Set -->|1. SPOP| Worker["pod-abc"]
    end
    ControlPlane -->|Claim Worker| Set
    ControlPlane -->|2. Update Worker State| DB_Worker["worker:default:pool-alpha:pod-abc"]
Loading

B. Slot-Aware Sharding (Multi-Key Transactions)

Redis Cluster restricts transactions (WATCH/MULTI) to keys that hash to the exact same cluster slot.
To atomically assign a worker to an actor without key collision errors:

  • We must co-locate keys on the same shard using Redis Hash Tags (e.g. {actor:<actor-id>}).
  • Example keys:
    • Actor: actor:{actor-123}
    • Worker lease: {actor:actor-123}:worker

5. Discussion Points for the Community

We would love feedback from the maintainers and community on these directions:

  1. Should we officially support a Hybrid Cache architecture?
    Using Cloud Spanner/Postgres as the persistent registry for 1 billion actors, while keeping Valkey/Redis solely as a transient, fast in-memory session routing table for currently active actors.
  2. Does the community favor open-source Valkey (BSD-3) or the commercial/AGPL Aerospike Community Edition?
    Considering the operational overhead and licensing constraints of both.
  3. Should we proceed with implementing the $O(1)$ Idle Worker Set scheduling pull request?
    We have prototyped this locally and can submit it to clean up the $O(N)$ scan bottleneck in workflow_resume.go.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions