Tech Overview of Compute-Compute Separation- A New Cloud Architecture for Real-Time Analytics
April 11, 2023
Rockset hosted a tech talk on its new cloud architecture that separates storage-compute and compute-compute for real-time analytics. With compute-compute separation in the cloud, users can allocate multiple, isolated clusters for ingest compute or query compute while sharing the same real-time data.
The talk was led by Rockset co-founder and CEO Venkat Venkataramani and principal architect Nathan Bronson as they shared how Rockset solves the challenge of compute contention by:
- Isolating streaming ingest and query compute for predictable performance even in the face of high-volume writes or reads. This enables users to avoid overprovisioning to handle bursty workloads
- Supporting multiple applications on shared real-time data. Rockset separates compute from hot storage and allows multiple compute clusters to operate on the shared data.
- Scaling out across multiple clusters for high concurrency applications
Below, I cover the high-level implementation shared in the talk and recommend checking out the recording for more details on compute-compute separation.
Embedded content: https://youtu.be/jUDDokvuDLw
What is the problem?
There is a fundamental challenge with real-time analytics database design: streaming ingest and low latency queries use the same compute unit. Shared compute architectures have the advantage of making recently generated data immediately available for querying. The downside is that shared compute architectures also experience contention between ingest and query workloads, leading to unpredictable performance for real-time analytics at scale.
There are three common but insufficient techniques used to tackle the challenge of compute contention:
Sharding: Scale out the database across multiple nodes. Sharding misdiagnoses the problem as running out of compute not workload isolation. With database sharding, queries can still step on one another. And, queries for one application can step on the other application.
Replicas: Many users attempt to create database replicas for isolation by designating the primary replica for ingestion and secondary replicas for querying. The issue that arises is that there is a lot of duplicate work required by each replica- each replica needs to process incoming data, store the data and index the data. And, the more replicas you have the more data movement and that leads to a slow scale up or down. Replicas work at small scale but this technique quickly falls apart under the weight of frequent ingestion.
Query directly from shared storage: Cloud data warehouses have separate compute clusters on shared data, solving the challenge of query and storage contention. That architecture does not go far enough as it does not make newly generated data immediately available for querying. In this architecture, the newly generated data must flush to storage before it is made available for querying, adding latency.
How does Rockset solve the problem?
Rockset introduces compute-compute separation for real-time analytics. Now, you can have a virtual instance, a compute and memory cluster, for streaming ingestion, queries and multiple applications.
Let’s delve under the hood on how Rockset built this new cloud architecture by first separating compute from hot storage and then separating compute from compute.
Separating compute from hot storage
First Generation: Sharded shared compute
Rockset uses RocksDB as its storage engine under the hood. RocksDB is a key-value store developed by Meta and used at Airbnb, Linkedin, Microsoft, Pinterest, Yahoo and more.
Each RocksDB instance represents a shard of the overall dataset, meaning that the data is distributed among a number of RocksDB instances. There is a complex M:N mapping between Rockset documents and RocksDB key-values. That’s because Rockset has a Converged Index with a columnar store, row store and search index under the hood. For example, Rockset stores many values in a single column in the same RocksDB key to support fast aggregations.
RocksDB memtables are an in-memory cache that stores the most recent writes. In this architecture, the query execution path accesses the memtable, ensuring that the most recently generated data is made available for querying. Rockset also stores a complete copy of the data on SSD for fast data access.
Second Generation: Compute-storage separation
In the second generation architecture, Rockset separates compute and hot storage for faster scale up and down of virtual instances. Rockset uses RocksDB’s pluggable file system to create a disaggregated storage layer. The storage layer is a shared hot storage service that is flash-heavy with a direct-attached SSD tier.
Third Generation: Query from shared storage
In the third generation, Rockset enables the shared hot storage layer to be accessed by multiple virtual instances.
The primary virtual instance is real time and the secondary instances have a periodic refresh of data. Secondary instances access snapshots from shared hot storage without getting access to fine-grain updates from the memtables. This architecture isolates virtual instances for multiple applications, making it possible for Rockset to support both real time and batch workloads efficiently.
Separating ingest compute from query compute
Fourth Generation: Compute-compute separation
In the fourth generation of the Rockset architecture, Rockset separates ingest compute from query compute.
Rocket has built upon previous generations of its architecture to add fine-grain replication of RocksDB memtables between multiple virtual instances. In this leader-follower architecture, the leader is responsible for translating ingested data into index updates and performing RocksDB compaction. This frees the follower from almost all of the compute load of ingest.
The leader creates a replication stream and sends updates and metadata changes to follower virtual instances. Since follower virtual instances no longer need to perform the brunt of the ingestion work, they use 6-10x less compute to process data from the replication stream. The implementation comes with a data delay of less than a 100 milliseconds between the leader and follower virtual instances.
Key Design Decisions
Primer on LSM Trees
Understanding key design decisions of compute-compute separation first requires a basic knowledge of the Log- Structured Merge Tree (LSM) architecture in RocksDB. In this architecture, writes are buffered in memory in a memtable. Megabytes of writes accumulate before being flushed to disk. Each file is immutable; rather than updating files in place, new files are created when data is changed. A background compaction process occasionally merges files to make storage more efficient. It merges old files into new files, sorting data and removing overwritten values. The benefit of compaction, in addition to minimizing the storage footprint, is that it reduces the number of locations from which the data needs to be read.
The important characteristic of LSM writes is that they are big and latency-insensitive. This gives us lots of options for making them durable in a cost-effective way.
Point reads are important for queries that use Rockset’s inverted indexes. Unlike the large latency-insensitive writes performed by an LSM, point reads result in small reads that are latency-critical. The core insight of Rockset’s disaggregated storage architecture is that we can simultaneously use two storage systems, one to get durability and one to get efficient fast reads.
Big Writes, Small Reads
Rockset stores copies of data in S3 for durability and a single copy in hot storage on SSDs for fast data access. Queries are up to 1000x faster on shared hot storage than S3.
Near Perfect Hot Storage Cache
As Rockset is a real-time database, latency matters to our customers and we cannot afford to miss accessing data from shared hot storage. Rockset hot storage is a near-perfect S3 cache. Most days there are no cache misses anywhere in our production infrastructure.
Here’s how Rockset solves for potential cache misses, including:
- Cold misses: To ensure data is always available in the cache, Rockset does a synchronous prefetch on file creation and scans S3 on a periodic basis.
- Capacity misses: Rockset has auto-scaling to ensure that the cluster does not run out of space. As a belt-and-suspenders strategy, if we do run out of disk space we evict the least-recently accessed data first.
- Software restart for upgrades: Dual-head serving for the rollout of new software. Rockset brings up the new process and makes sure that it is online before shutting down the old version of the service.
- Cluster resizing: If Rockset cannot find the indexed data during resizing, it runs a second-chance read using the old cluster configuration.
- Failure recovery: If a single machine fails, we distribute the recovery across all the machines in the cluster using rendezvous hashing.
Consistency and Durability
Rockset’s leader-follower architecture is designed to be consistent and durable even when there are multiple copies of the data. One way Rockset sidesteps some of the challenges of building a consistent and durable distributed database is by using persistent and durable infrastructure under the hood.
Leader-Follower Architecture
In the leader-follower architecture, the data stream feeding into the ingest process is consistent and durable. It is effectively a durable logical redo log, enabling Rockset to go back to the log to retrieve newly generated data in the case of a failure.
Rockset uses an external strongly-consistent metadata store to perform leader election. Each time a leader is elected it picks a cookie, a random nonce that is included in the S3 object path for all of the actions taken by that leader. The cookie ensures that even if an old leader is still running, its S3 writes won’t interfere with the new leader and its actions will be ignored by followers.
The input log position from the durable logical redo log is stored in a RocksDB key to ensure exactly-once processing of the input stream. This means that it is safe to bootstrap a leader from any recent valid RocksDB state.
The replication log is a superset of the RocksDB write-ahead logs, augmenting WAL entries with with additional events such as leader election. Key/value changes from the replication log are inserted directly into the memtable of the follower. When the log indicates that the leader has written the memtable to disk, however, the follower can just start reading the file created by the leader – the leader has already created the file on disaggregated storage. Similarly, when the follower gets notification that a compaction has finished, it can just start using the new compaction results directly without doing any of the compaction work.
In this architecture, the shared hot storage accomplishes just-in-time physical replication of the bytes of RocksDB’s SST files, including the physical file changes that result from compaction, while the leader/follower replication log carries only logical changes. In combination with the durable input data stream, this lets the leader/follower log be lightweight and non-durable.
Bootstrapping a leader
In the leader-follower architecture, the data stream feeding into the ingest process is consistent and durable. It is effectively a durable logical redo log, enabling Rockset to go back to the log to retrieve newly generated data in the case of a failure.
Rockset uses an external strongly-consistent metadata store to perform leader election. Each time a leader is elected it picks a cookie, a random nonce that is included in the S3 object path for all of the actions taken by that leader. The cookie ensures that even if an old leader is still running, its S3 writes won’t interfere with the new leader and its actions will be ignored by followers.
The input log position from the durable logical redo log is stored in a RocksDB key to ensure exactly-once processing of the input stream. This means that it is safe to bootstrap a leader from any recent valid RocksDB state.
The replication log is a superset of the RocksDB write-ahead logs, augmenting WAL entries with with additional events such as leader election. Key/value changes from the replication log are inserted directly into the memtable of the follower. When the log indicates that the leader has written the memtable to disk, however, the follower can just start reading the file created by the leader – the leader has already created the file on disaggregated storage. Similarly, when the follower gets notification that a compaction has finished, it can just start using the new compaction results directly without doing any of the compaction work.
In this architecture, the shared hot storage accomplishes just-in-time physical replication of the bytes of RocksDB’s SST files, including the physical file changes that result from compaction, while the leader/follower replication log carries only logical changes. In combination with the durable input data stream, this lets the leader/follower log be lightweight and non-durable.
Adding a follower
Followers use the leader’s cookie to find the most recent RocksDB snapshot in shared hot storage and subscribe to the leader’s replication log. The follower constructs a memtable with the most recently generated data from the replication log of the leader.
Real-World Implications
We’ve walked through the implementation of compute-compute separation and how it solves for:
Streaming ingest and query compute isolation:The problem of a data flash flood monopolizing your compute and jeopardizing your queries is solved with isolation. And, the same on the query side if you have a burst of users on your application. Scale independently so you can right size the virtual instance for your ingest or query workload.
Multiple applications on shared real-time data: You can spin up or down any number of virtual instances to segregate application workloads. Multiple production applications can share the same dataset, eliminating the need for replicas.
Linear concurrency scaling: You right-size the virtual instance based on query latency based on single query performance. Then you can autoscale for concurrency, spinning up the same virtual instance size for linear scaling.
We just scratched the surface on Rockset’s compute-compute architecture for real-time analytics. You can learn more by watching the tech talk or seeing how the architecture works in a step-by-step product demonstration.