How Rockset Handles Data Deduplication
May 3, 2022
There are two major problems with distributed data systems. The second is out-of-order messages, the first is duplicate messages, the third is off-by-one errors, and the first is duplicate messages.
This joke inspired Rockset to confront the data duplication issue through a process we call deduplication.
As data systems become more complex and the number of systems in a stack increases, data deduplication becomes more challenging. That's because duplication can occur in a multitude of ways. This blog post discusses data duplication, how it plagues teams adopting real-time analytics, and the deduplication solutions Rockset provides to resolve the duplication issue. Whenever another distributed data system is added to the stack, organizations become weary of the operational tax on their engineering team.
Rockset addresses the issue of data duplication in a simple way, and helps to free teams of the complexities of deduplication, which includes untangling where duplication is occurring, setting up and managing extract transform load (ETL) jobs, and attempting to solve duplication at a query time.
The Duplication Problem
In distributed systems, messages are passed back and forth between many workers, and it’s common for messages to be generated two or more times. A system may create a duplicate message because:
- A confirmation was not sent.
- The message was replicated before it was sent.
- The message confirmation comes after a timeout.
- Messages are delivered out of order and must be resent.
The message can be received multiple times with the same information by the time it arrives at a database management system. Therefore, your system must ensure that duplicate records aren’t created. Duplicate records can be costly and take up memory unnecessarily. These duplicated messages must be consolidated into a single message.
Before Rockset, there were three general deduplication methods:
- Stop duplication before it happens.
- Stop duplication during ETL jobs.
- Stop duplication at query time.
Kafka was one of the first systems to create a solution for duplication. Kafka guarantees that a message is delivered once and only once. However, if the problem occurs upstream from Kafka, their system will see these messages as non-duplicates and deliver the duplicate messages with different timestamps. Therefore, exactly once semantics do not always solve duplication issues and can negatively impact downstream workloads.
Stop Duplication Before it Happens
Some platforms attempt to stop duplication before it happens. This seems ideal, but this method requires difficult and costly work to identify the location and causes of the duplication.
Duplication is commonly caused by any of the following:
- A switch or router.
- A failing consumer or worker.
- A problem with gRPC connections.
- Too much traffic.
- A window size that is too small for packets.
Note: Keep in mind this is not an exhaustive list.
This deduplication approach requires in-depth knowledge of the system network, as well as the hardware and framework(s). It is very rare, even for a full-stack developer, to understand the intricacies of all the layers of the OSI model and its implementation at a company. The data storage, access to data pipelines, data transformation, and application internals in an organization of any substantial size are all beyond the scope of a single individual. As a result, there are specialized job titles in organizations. The ability to troubleshoot and identify all locations for duplicated messages requires in-depth knowledge that is simply unreasonable for an individual to have, or even a cross-functional team. Although the cost and expertise requirements are very high, this approach offers the greatest reward.
Stop Duplication During ETL Jobs
Stream-processing ETL jobs is another deduplication method. ETL jobs come with additional overhead to manage, require additional computing costs, are potential failure points with added complexity, and introduce latency to a system potentially needing high throughput. This involves deduplication during data stream consumption. The consumption outlets might include creating a compacted topic and/or introducing an ETL job with a common batch processing tool (e.g., Fivetran, Airflow, and Matillian).
In order for deduplication to be effective using the stream-processing ETL jobs method, you must ensure the ETL jobs run throughout your system. Since data duplication can apply anywhere in a distributed system, ensuring architectures deduplicate in all places messages are passed is paramount.
Stream processors can have an active processing window (open for a specific time) where duplicate messages can be detected and compacted, and out-of-order messages can be reordered. Messages can be duplicated if they are received outside the processing window. Furthermore, these stream processors must be maintained and can take considerable compute resources and operational overhead.
Note: Messages received outside of the active processing window can be duplicated. We do not recommend solving deduplication issues using this method alone.
Stop Duplication at Query Time
Another deduplication method is to attempt to solve it at query time. However, this increases the complexity of your query, which is risky because query errors could be generated.
For example, if your solution tracks messages using timestamps, and the duplicate messages are delayed by one second (instead of 50 milliseconds), the timestamp on the duplicate messages will not match your query syntax causing an error to be thrown.
How Rockset Solves Duplication
Rockset solves the duplication problem through unique SQL-based transformations at ingest time.
Rockset is a Mutable Database
Rockset is a mutable database and allows for duplicate messages to be merged at ingest time. This system frees teams from the many cumbersome deduplication options covered earlier.
Each document has a unique identifier called
_id that acts like a primary key. Users can specify this identifier at ingest time (e.g. during updates) using SQL-based transformations. When a new document arrives with the same
_id, the duplicate message merges into the existing record. This offers users a simple solution to the duplication problem.
When you bring data into Rockset, you can build your own complex
_id key using SQL transformations that:
- Identify a single key.
- Identify a composite key.
- Extract data from multiple keys.
Rockset is fully mutable without an active window. As long as you specify messages with
_id or identify
_id within the document you are updating or inserting, incoming duplicate messages will be deduplicated and merged together into a single document.
Rockset Enables Data Mobility
Other analytics databases store data in fixed data structures, which require compaction, resharding and rebalancing. Any time there is a change to existing data, a major overhaul of the storage structure is required. Many data systems have active windows to avoid overhauls to the storage structure. As a result, if you map
_id to a record outside the active database, that record will fail. In contrast, Rockset users have a lot of data mobility and can update any record in Rockset at any time.
A Customer Win With Rockset
While we've spoken about the operational challenges with data deduplication in other systems, there's also a compute-spend element. Attempting deduplication at query time, or using ETL jobs can be computationally expensive for many use cases.
Rockset can handle data changes, and it supports inserts, updates and deletes that benefit end users. Here’s an anonymous story of one of the users that I’ve worked closely with on their real-time analytics use case.
A customer had a massive amount of data changes that created duplicate entries within their data warehouse. Every database change resulted in a new record, although the customer only wanted the current state of the data.
If the customer wanted to put this data into a data warehouse that cannot map
_id, the customer would’ve had to cycle through the multiple events stored in their database. This includes running a base query followed by additional event queries to get to the latest value state. This process is extremely computationally expensive and time consuming.
Rockset provided a more efficient deduplication solution to their problem. Rockset maps
_id so only the latest states of all records are stored, and all incoming events are deduplicated. Therefore the customer only needed to query the latest state. Thanks to this functionality, Rockset enabled this customer to reduce both the compute required, as well as the query processing time — efficiently delivering sub-second queries.