How Real-Time Updates Work on Rockset with Kshitij
Kshitij is a software engineer here at Rockset. This video will talk about how real-time updates work on Rockset via Rockset's Patch API and remote compaction for writes.
Show Notes
Kshitij Wadhwa:
Hi. I'm Kshitij, a software engineer here at Rockset and today I'm going to talk about how Rockset enables developers to index and query real-time updates. I primarily work on the platform team here. We build the Patch API and remote compaction for writes, which helps in separating compaction compute from storage for greater efficiency. Later in the slide, I will go more in-depth about these. A quick overview for those of you not familiar with Rockset. Rockset is a real-time indexing database in the cloud for serving low latency, high concurrency queries at scale. Rockset indexes data continuously from your OLTP database, stream or data link using secure built-in connectors that consume chain streams. Data is ingested schemalessly and a schema is automatically generated based on the exact fields and types present, and new data is reflected inquiries with P-95 of two seconds.
Kshitij Wadhwa:
Using Rockset, you can create personalized user experiences, real-time decision systems, or serve healthy applications. As part of this presentation, I will start with why real time updates matter to developers, and what are the problems with traditional data systems while processing updates. And finally, I'll talk about how Rockset enables low latency, real-time updates for the developers. Traditionally data processing used to combine some batch processing where jobs were run at frequent intervals. Let's say an hour or more to process large amounts of data before developers could make use of them. This means as data gets transported, it is already stale when it arrives and unacceptable for real time application needs.
Kshitij Wadhwa:
In contrast, near real-time processing adds a latency of a few minutes before fresh data is available for querying. Whereas, real-time means your application will see the new data in a few seconds, as events take place. The ability to get the changes that happen in an operational database and make them available for real-time applications is a core capability of many organizations.
Kshitij Wadhwa:
The ability to get the changes that happen in an operational database and make them available for real-time applications, is a core capability for many organizations. Change Data Capture are design pattern to track data changes is one such approach to monitoring and capturing events in a system. Businesses use CDC from operational databases to power real-time applications and various microservices that demand low data latency. Examples of which include fraud prevention, game leaderboard APIs and personalized recommendation APIs. In these cases, if the data comes in too late, the company and customers are adversely affected.
Kshitij Wadhwa:
From the database perspective, building multiple indices are necessary for serving low latency queries, especially complex queries involving joints and applications. The trade-off here is that, every single update now needs to write to every index causing many random database writes. Traditional data systems are based on B-trees where random writes to the database translates to random writes of storage, presenting in additional computer and I/O.
Kshitij Wadhwa:
This means, real-time updates with high writes won't be able to keep up and result in increased data latency. As I mentioned in the previous slide, that the key problem with multiple indices is updating these indices with very low data latency and operation management. Now I will discuss how Rockset solves it using RocksDB-Cloud. At Rockset, we use RocksDB Cloud as one of the building blocks of Rockset's distributed Converged Index. RocksDB-Cloud is open source and is fully compatible with RocksDB, which is an LSM storage engine. LSM trees are optimized for writes because they turn random writes to different indices into sequential writes to underlying storage.
Kshitij Wadhwa:
To quickly recap, sequential writes to the underlying storage is faster than random writes on storage because it doesn't result in additional I/O, which can be caused by moving data. Being an LSM storage engine, a set of background threads are used for compaction, and compaction is a process of combining a set of files and generating files with over-riding abilities plus from out output files.
Kshitij Wadhwa:
Compaction needs a lot of compute resources. The higher the write rate into the database, the more compute resources are need for compaction because the system is only stable if compaction is able to keep up with the new writes to your database. RocksDB-Cloud server encapsulates a compaction job, with a set of cloud objects, and sends the request to a remote stateless server and does operating compute from storage. This also ensures your queries don't suffer in case of high writes to your collections. All documents to the Rockset collection are mutable and can be updated at the field level. Even if these fields are deeply nested inside arrays and objects. With other data systems, every update to a single field in a document results in re-indexing the whole document leading to high compute and I/O. The patch APIs solves the problem that other databases have with re-indexing the whole document.
Kshitij Wadhwa:
At Rockset we use the patch API to implement any index of the collection. This means that any updates that a collection gets, only part of those fields in a document are re-indexed, while keeping the rest of the fields in that document untouched. This results in Rockset being orders of magnitude, more efficient with compute and I/O compared to other databases.
Kshitij Wadhwa:
As a summary, Rockset makes it possible to provide low latency and high writes with RocksDB-Cloud and remote compaction. Developers also save on compute and I/O with incremental indexing using the Patch API. Overall, real-time insights with millisecond query latency allow applications to move faster. Thanks for listening. If you think Rockset might help navigate your data better, check out talks.rockset.com to try out a quick start guide with a free developer account.