Why Mutability Is Essential for Real-Time Data Analytics
March 10, 2022
This is the first post in a series by Rockset's CTO Dhruba Borthakur on Designing the Next Generation of Data Systems for Real-Time Analytics. We'll be publishing more posts in the series in the near future, so subscribe to our blog so you don't miss them!
The second post in this series is now available at Handling Out-of-Order Data in Real-Time Analytics Applications.
Dhruba Borthakur is CTO and co-founder of Rockset and is responsible for the company's technical direction. He was an engineer on the database team at Facebook, where he was the founding engineer of the RocksDB data store. Earlier at Yahoo, he was one of the founding engineers of the Hadoop Distributed File System. He was also a contributor to the open source Apache HBase project.
Successful data-driven companies like Uber, Facebook and Amazon rely on real-time analytics. Personalizing customer experiences for e-commerce, managing fleets and supply chains, and automating internal operations all require instant insights on the freshest data.
To deliver real-time analytics, companies need a modern technology infrastructure that includes these three things:
- A real-time data source such as web clickstreams, IoT events produced by sensors, etc.
- A platform such as Apache Kafka/Confluent, Spark or Amazon Kinesis for publishing that stream of event data.
- A real-time analytics database capable of continuously ingesting large volumes of real-time events and returning query results within milliseconds.
Event streaming/stream processing has been around for almost a decade. It’s well understood. Real-time analytics is not. One of the technical requirements for a real-time analytics database is mutability. Mutability is the superpower that enables updates, or mutations, to existing records in your data store.
Differences Between Mutable and Immutable Data
Before we talk about why mutability is key to real-time analytics, it’s important to understand what it is.
Mutable data is data stored in a table record that can be erased or updated with newer data. For instance, in a database of employee addresses, let’s say that each record has the name of the person and their current residential address. The current address information would be overwritten if the employee moves residences from one place to another.
Traditionally, this information would be stored in transactional databases — Oracle Database, MySQL, PostgreSQL, etc. — because they allow for mutability: Any field stored in these transactional databases is updatable. For today’s real-time analytics, there are many additional reasons why we need mutability, including data enrichment and backfilling data.
Immutable data is the opposite — it cannot be deleted or modified. Rather than writing over existing records, updates are append-only. This means that updates are inserted into a different location or you’re forced to rewrite old and new data to store it properly. More on the downsides of this later. Immutable data stores have been useful in certain analytics scenarios.
The Historic Usefulness of Immutability
Data warehouses popularized immutability because it eased scalability, especially in a distributed system. Analytical queries could be accelerated by caching heavily-accessed read-only data in RAM or SSDs. If the cached data was mutable and potentially changing, it would have to be continuously checked against the original source to avoid becoming stale or erroneous. This would have added to the operational complexity of the data warehouse; immutable data, on the other hand, created no such headaches.
Immutability also reduces the risk of accidental data deletion, a significant benefit in certain use cases. Take health care and patient health records. Something like a new medical prescription would be added rather than written over existing or expired prescriptions so that you always have a complete medical history.
More recently, companies tried to pair stream publishing systems such as Kafka and Kinesis with immutable data warehouses for analytics. The event systems captured IoT and web events and stored them as log files. These streaming log systems are difficult to query, so one would typically send all the data from a log to an immutable data system such as Apache Druid to perform batch analytics.
The data warehouse would append newly-streamed events to existing tables. Since past events, in theory, do not change, storing data immutably seemed to be the right technical decision. And while an immutable data warehouse could only write data sequentially, it did support random data reads. That enabled analytical business applications to efficiently query data whenever and wherever it was stored.
The Problems with Immutable Data
Of course, users soon discovered that for many reasons, data does need to be updated. This is especially true for event streams because multiple events can reflect the true state of a real-life object. Or network problems or software crashes can cause data to be delivered late. Late-arriving events need to be reloaded or backfilled.
Companies also began to embrace data enrichment, where relevant data is added to existing tables. Finally, companies started having to delete customer data to fulfill consumer privacy regulations such as GDPR and its “right to be forgotten.”
Immutable database makers were forced to create workarounds in order to insert updates. One popular method used by Apache Druid and others is called copy-on-write. Data warehouses typically load data into a staging area before it is ingested in batches into the data warehouse where it is stored, indexed and made ready for queries. If any events arrive late, the data warehouse will have to write the new data and rewrite already-written adjacent data in order to store everything correctly in the right order.
Another poor solution to deal with updates in an immutable data system is to keep the original data in Partition A (above) and write late-arriving data to a different location, Partition B. The application, and not the data system, will have to keep track of where all linked-but-scattered records are stored, as well as any resulting dependencies. This process is called referential integrity and has to be implemented by the application software.
Both workarounds have significant problems. Copy-on-write requires data warehouses to expend a significant amount of processing power and time — tolerable when updates are few, but intolerably costly and slow as the number of updates rise. That creates significant data latency that can rule out real-time analytics. Data engineers must also manually supervise copy-on-writes to ensure all the old and new data is written and indexed accurately.
An application implementing referential integrity has its own issues. Queries must be double-checked that they are pulling data from the right locations or run the risk of data errors. Attempting any query optimizations, such as caching data, also becomes much more complicated when updates to the same record are scattered in multiple places in the data system. While these may have been tolerable at slower-paced batch analytic systems, they are huge problems when it comes to mission-critical real-time analytics.
Mutability Aids Machine Learning
At Facebook, we built an ML model that scanned all-new calendar events as they were created and stored them in the event database. Then, in real-time, an ML algorithm would inspect this event, and decide whether it is spam. If it is categorized as spam, then the ML model code would insert a new field into that existing event record to mark it as spam. Because so many events were flagged and immediately taken down, the data had to be mutable for efficiency and speed. Many modern ML-serving systems have emulated our example and chosen mutable databases.
This level of performance would have been impossible with immutable data. A database using copy-on-write would quickly get bogged down by the number of flagged events it would have to update. If the database stored the original events in Partition A and appended flagged events to Partition B, this would require additional query logic and processing power, as every query would have to merge relevant records from both partitions. Both workarounds would have created an intolerable delay for our Facebook users, heightened the risk of data errors and created more work for developers and/or data engineers.
How Mutability Enables Real-Time Analytics
At Facebook, I helped design mutable analytics systems that delivered real-time speed, efficiency and reliability.
One of the technologies I founded was open source RocksDB, the high-performance key-value engine used by MySQL, Apache Kafka and CockroachDB. RocksDB’s data format is a mutable data format, which means that you can update, overwrite or delete individual fields in a record. It’s also the embedded storage engine at Rockset, a real-time analytics database I founded with fully mutable indexes.
By tuning open source RocksDB, it’s possible to enable SQL queries on events and updates arriving mere seconds before. Those queries can be returned in the low hundreds of milliseconds, even when complex, ad hoc and high concurrency. RocksDB’s compaction algorithms also automatically merge old and updated data records to ensure that queries access the latest, correct version, as well as prevent data bloat that would hamper storage efficiency and query speeds.
By choosing RocksDB, you can avoid the clumsy, expensive and error-creating workarounds of immutable data warehouses such as copy-on-writes and scattering updates across different partitions.
To sum up, mutability is key for today’s real-time analytics because event streams can be incomplete or out of order. When that happens, a database will need to correct and backfill missing and erroneous data. To ensure high performance, low cost, error-free queries and developer efficiency, your database must support mutability.
If you want to see all of the key requirements of real-time analytics databases, watch my recent talk at the Hive on Designing the Next Generation of Data Systems for Real-Time Analytics, available below.
The second post in this series is now available at Handling Out-of-Order Data in Real-Time Analytics Applications
More from Rockset
Sign Up for Free
Get started with $300 in free credits. No credit card required.