[RFC][Architecture] Control Plane Persistence Layer: Scalability, Durability, and Latency Evaluation

## 1. Objective & Context

Agent Substrate decouples the high-frequency lifecycle of **Actors** (sandboxed agent workloads) from **Workers** (physical Kubernetes Pods). To meet our North Star target of **sub-100ms workload activation (wakeup) latency** at a scale of **1 billion registered actors** and **1,000+ concurrent wakeups/second**, Substrate bypasses the standard Kubernetes API server for real-time scheduling and state routing, offloading dynamic states to a dedicated high-performance database.

This RFC aims to evaluate the trade-offs of our current database choice (Redis/Valkey) against alternatives like Aerospike or traditional consensus-backed relational databases (e.g. Cloud Spanner, CockroachDB), summarizing local benchmark results, production operations concerns, and data structure redesign requirements.

---

## 2. Architectural Dimensions & Trade-offs

To choose the right storage engine for Substrate's dynamic instance state, we evaluate candidates across five key dimensions:

### Comparative Database Matrix

| Dimension | Redis (Valkey) | Aerospike (Community) | Cloud Bigtable (NoSQL) | Cloud Spanner / CockroachDB | TBD (e.g. DynamoDB / ScyllaDB) |
| :--- | :--- | :--- | :--- | :--- | :--- |
| **Primary Storage Model** | 100% In-Memory (RAM) | Hybrid (Index in RAM, Data on SSD) | Wide-column key-value (Colossus SSD) | Distributed transactional disk/file | *Your proposal here* |
| **Write Latency** | Ultra-low (sub-100µs) | Very low (sub-300µs) | Low (2ms – 5ms) | High (5ms – 15ms due to Paxos sync) | *TBD* |
| **Consensus & Replication** | Async or Sync via `WAIT` | Paxos-like mesh replication | High-speed master-replica replication | Paxos/Raft consensus (Strict serializability) | *TBD* |
| **Node Consistency** | Eventually consistent (lag risk) | Strong consistency per record | Strongly consistent per row (local) | External consistency (immediate) | *TBD* |
| **License Type** | BSD-3 (Valkey) / SSPL (Redis) | AGPL-3.0 / Commercial | Proprietary (Google Cloud Managed) | Proprietary / Business Source License (BSL) | *TBD (prefer Open Source)* |
| **Managed Service Availability** | High (Memorystore, ElastiCache) | Limited (Aerospike Cloud) | High (Fully managed by GCP) | High (Fully managed serverless) | *TBD* |

---

### Deep Dive: Production Concerns by Database

### A. Redis / Valkey
* **Data Durability Concerns**: By default, Redis uses asynchronous replication. If a master node fails, writes in the replication buffer are lost. While `appendfsync always` provides complete durability, it historically degrades throughput on standard cloud storage.
* **Operational Mitigation**: Implementing the **`WAIT` command** allows synchronous replication (blocking writes until committed to $N$ replicas) without sacrificing in-memory speed.
* **Memory Cost at Scale**: At 1KB per actor record, storing 1 billion total registered actors requires **1 Terabyte of RAM**, costing thousands of dollars per month in idle memory.
* **License Considerations**: Redis recently migrated to a restrictive SSPL/RSALv2 license. **Valkey** (a Linux Foundation-backed BSD-3 fork) is the approved, open-source drop-in replacement.

### B. Aerospike
* **Cost Efficiency at Scale**: Unlike Redis, Aerospike stores the raw data records on raw block SSDs/NVMe, keeping only the keys/indexes in memory. This reduces the RAM footprint by up to **90%**, making 1 billion actors highly cost-efficient.
* **Consensus Model**: Uses Paxos-like mesh clustering. It supports highly reliable, durable single-record writes with built-in generation checks.
* **Community Restrictions**: Community Edition (AGPL-3.0) lacks advanced features like multi-site cross-datacenter replication (XDR) and dynamic partition rebalancing under severe cluster splits, which are gated behind expensive enterprise commercial licenses.

### C. Cloud Spanner / CockroachDB
* **Write Latency Penalty**: To guarantee absolute transaction durability, these databases require network roundtrips for consensus (Raft/Paxos) and fsync to disk. This increases write latencies to **5ms – 15ms**, making them completely incompatible with a **100ms wakeup budget** if placed in the critical path.
* **The Hybrid Cache Model**: In this model, Spanner acts as the durable, cold source of truth for 1 billion actors, while Redis/Valkey is used strictly as a transient, ephemeral session routing cache for currently active actors.

### D. Cloud Bigtable (GCP Managed NoSQL)
* **Write Latency & Throughput**: Bigtable provides excellent write throughput and low latency (**2ms – 5ms** for writes under steady-state), which is significantly faster than Spanner because it bypasses multi-zone Paxos synchronous roundtrips on local writes.
* **The Lock Challenge**: Bigtable **does not** support multi-row ACID transactions or native atomic TTL/lease deletions (Bigtable garbage collection and TTL cells are eventually consistent and can take hours to delete). Implementing Substrate's lease-lock contract (`AcquireLock`/`ReleaseLock`) directly on Bigtable is very difficult and would require an external coordinator like etcd or Redis.
* **Scheduling / Range Scans**: Bigtable does not support $O(1)$ random member pops natively. To select an idle worker, we would need a row key indexing strategy (e.g., `worker:<pool>#STATUS_IDLE#<pod_name>`) to perform fast prefix range scans, avoiding full-table scans.
* **Operational & License Concerns**: Fully managed by Google Cloud, eliminating operational overhead. However, it is proprietary and has a minimum cost profile (minimum 1-node cluster cost) which can be high for small local developer setups.

### E. TBD / Other Backends (e.g., AWS DynamoDB, ScyllaDB, Apache Cassandra)
* **Encouraging Contributions**: We strongly welcome community proposals and pull requests to implement additional backend stores that satisfy Substrate's `store.Interface` contract!
* **Design Considerations**: If you are interested in implementing a client for AWS DynamoDB, ScyllaDB, or Cassandra, please use this RFC issue thread to discuss key schema patterns, distributed locking strategies, and how you plan to avoid $O(N)$ worker range scan overhead.

---

## 3. Local Benchmarks Performed

We implemented a comparative, unified benchmark suite (`store_benchmark_test.go`) to test **Valkey/Redis** against **Aerospike** on a high-performance local workstation.

### Benchmark Configuration:
* **Valkey/Redis**: Standalone on `localhost:6379`, tested with `everysec` (default) and `always` (sync fsync) AOF policies.
* **Aerospike**: Community Edition 8.1.2 on `localhost:3000`, using optimized `PutBins` APIs and generation concurrency.
* **Hardware**: Intel(R) Xeon(R) CPU @ 2.20GHz, backed by high-speed local SSD storage.

> [!NOTE]
> **Important Networking Caveat**: These benchmarks were performed locally using virtual interfaces (vNIC) on Docker containers on a single host workstation. Network hop latency was negligible (<10µs). In a distributed production GKE cluster where client pods and database pods reside on different physical nodes, each query roundtrip will incur real cross-node network hop latency, which typically adds **~300µs – 500µs** to every database request. This overhead must be factored into production latency projections.

### Performance Comparison (Actor Creation / Writes):

```text
BenchmarkCreateActor_Redis-24                  14149   79,814 ns/op   1,518 B/op   25 allocs/op
BenchmarkCreateActor_Redis_AlwaysFsync-24      13494   87,054 ns/op   1,388 B/op   24 allocs/op
BenchmarkCreateActor_Aerospike-24               4785  252,157 ns/op   1,867 B/op   27 allocs/op
```

```mermaid
gantt
    title Write Operation Latency (ns)
    dateFormat  X
    axisFormat %s
    section Latencies
    Redis (everysec)  :active, 0, 79814
    Redis (always fsync) :crit, 0, 87054
    Aerospike (durable NVMe) : 0, 252157
```

### Crucial Benchmark Insights:
1. **Local SSD/NVMe Performance**: Enabling `appendfsync always` on Redis only increased latency by **~9%** (shifting from `79.8µs` to `87.0µs`). On ultra-fast storage backings (like GCP Local SSDs), **Redis can achieve RPO = 0 durability with virtually zero latency penalty**.
2. **Aerospike Performance**: At **252 microseconds** per write, Aerospike is ~3.1x slower than Redis but remains exceptionally fast, consuming only 27 CPU allocations and operating well within our sub-millisecond budget.

---

## 4. Core Data Structure & Scheduling Requirements

Regardless of the chosen storage engine, Substrate's control plane must refactor two legacy database patterns to support scaling:

### A. Eliminating the $O(N)$ Scheduling Bottleneck
Currently, when an actor is resumed, the `AssignWorkerStep` calls `ListWorkers` (running a full-table keyspace scan over all master nodes), filters for idle ones, and shuffles them in memory. This scale-blocking bottleneck must be replaced by an $O(1)$ **Idle Worker Queue/Set**:

```mermaid
graph LR
    subgraph Idle Worker Pool
        Set["pool:{default:pool-alpha}:idle_workers"]
        Set -->|1. SPOP| Worker["pod-abc"]
    end
    ControlPlane -->|Claim Worker| Set
    ControlPlane -->|2. Update Worker State| DB_Worker["worker:default:pool-alpha:pod-abc"]
```

### B. Slot-Aware Sharding (Multi-Key Transactions)
Redis Cluster restricts transactions (`WATCH`/`MULTI`) to keys that hash to the exact same cluster slot. 
To atomically assign a worker to an actor without key collision errors:
* We must co-locate keys on the same shard using **Redis Hash Tags** (e.g. `{actor:<actor-id>}`).
* Example keys:
  * Actor: `actor:{actor-123}`
  * Worker lease: `{actor:actor-123}:worker`

---

## 5. Discussion Points for the Community

We would love feedback from the maintainers and community on these directions:

1. **Should we officially support a Hybrid Cache architecture?** 
   Using Cloud Spanner/Postgres as the persistent registry for 1 billion actors, while keeping Valkey/Redis solely as a transient, fast in-memory session routing table for currently active actors.
2. **Does the community favor open-source Valkey (BSD-3) or the commercial/AGPL Aerospike Community Edition?**
   Considering the operational overhead and licensing constraints of both.
3. **Should we proceed with implementing the $O(1)$ Idle Worker Set scheduling pull request?**
   We have prototyped this locally and can submit it to clean up the $O(N)$ scan bottleneck in `workflow_resume.go`.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC][Architecture] Control Plane Persistence Layer: Scalability, Durability, and Latency Evaluation #12

1. Objective & Context

2. Architectural Dimensions & Trade-offs

Comparative Database Matrix

Deep Dive: Production Concerns by Database

A. Redis / Valkey

B. Aerospike

C. Cloud Spanner / CockroachDB

D. Cloud Bigtable (GCP Managed NoSQL)

E. TBD / Other Backends (e.g., AWS DynamoDB, ScyllaDB, Apache Cassandra)

3. Local Benchmarks Performed

Benchmark Configuration:

Performance Comparison (Actor Creation / Writes):

Crucial Benchmark Insights:

4. Core Data Structure & Scheduling Requirements

A. Eliminating the $O(N)$ Scheduling Bottleneck

B. Slot-Aware Sharding (Multi-Key Transactions)

5. Discussion Points for the Community

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Dimension	Redis (Valkey)	Aerospike (Community)	Cloud Bigtable (NoSQL)	Cloud Spanner / CockroachDB	TBD (e.g. DynamoDB / ScyllaDB)
Primary Storage Model	100% In-Memory (RAM)	Hybrid (Index in RAM, Data on SSD)	Wide-column key-value (Colossus SSD)	Distributed transactional disk/file	Your proposal here
Write Latency	Ultra-low (sub-100µs)	Very low (sub-300µs)	Low (2ms – 5ms)	High (5ms – 15ms due to Paxos sync)	TBD
Consensus & Replication	Async or Sync via `WAIT`	Paxos-like mesh replication	High-speed master-replica replication	Paxos/Raft consensus (Strict serializability)	TBD
Node Consistency	Eventually consistent (lag risk)	Strong consistency per record	Strongly consistent per row (local)	External consistency (immediate)	TBD
License Type	BSD-3 (Valkey) / SSPL (Redis)	AGPL-3.0 / Commercial	Proprietary (Google Cloud Managed)	Proprietary / Business Source License (BSL)	TBD (prefer Open Source)
Managed Service Availability	High (Memorystore, ElastiCache)	Limited (Aerospike Cloud)	High (Fully managed by GCP)	High (Fully managed serverless)	TBD

[RFC][Architecture] Control Plane Persistence Layer: Scalability, Durability, and Latency Evaluation #12

Description

1. Objective & Context

2. Architectural Dimensions & Trade-offs

Comparative Database Matrix

Deep Dive: Production Concerns by Database

A. Redis / Valkey

B. Aerospike

C. Cloud Spanner / CockroachDB

D. Cloud Bigtable (GCP Managed NoSQL)

E. TBD / Other Backends (e.g., AWS DynamoDB, ScyllaDB, Apache Cassandra)

3. Local Benchmarks Performed

Benchmark Configuration:

Performance Comparison (Actor Creation / Writes):

Crucial Benchmark Insights:

4. Core Data Structure & Scheduling Requirements

A. Eliminating the $O(N)$ Scheduling Bottleneck

B. Slot-Aware Sharding (Multi-Key Transactions)

5. Discussion Points for the Community

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions