CDC on DynamoDB
May 10, 2022
DynamoDB is a popular NoSQL database available in AWS. It is a managed service with minimal setup and pay-as-you-go costing. Developers can quickly create databases that store complex objects with flexible schemas that can mutate over time. DynamoDB is resilient and scalable due to the use of sharding techniques. This seamless, horizontal scaling is a huge advantage that allows developers to move from a proof of concept into a productionized service very quickly.
However, DynamoDB, like many other NoSQL databases, is great for scalable data storage and single row retrieval but leaves a lot to be desired when it comes to analytics. With SQL databases, analysts can quickly join, group and search across historical data sets. With NoSQL, the language for performing these types of queries is often more cumbersome, proprietary, and joining data is either not possible or not recommended due to performance constraints.
To overcome this, Change Data Capture (CDC) techniques are often used to copy changes from the NoSQL database into an analytics database where analysts can perform more computationally heavy tasks across larger datasets. In this post, we’ll look at how CDC works with DynamoDB and its potential use cases.
How Change Data Capture Works on DynamoDB
We have previously discussed the many different CDC techniques available. DynamoDB uses a push-type model where changes are pushed to a downstream entity such as a queue or a direct consumer. DynamoDB pushes events about any changes to a DynamoDB stream that can be consumed by targets downstream.
Usually, push-based CDC patterns are more complex as they often require another service to act as the middleman between the producer and consumer of the changes. However, DynamoDB streams are natively supported within DynamoDB and can be simply configured and enabled with a touch of a button. This is because they are also a managed service within AWS. CDC on DynamoDB is easy because you only need to configure a consumer and an alternative data store.
Use Cases for CDC on DynamoDB
Let's take a look at some use cases for why you would need a CDC solution in the first place.
Archiving Historical Data
Due to its scalability and schemaless nature, DynamoDB is often used to store time-series data such as IoT data or weblogs. The schema of the data in these sources can change depending on what is being logged at any point in time and they often write data at variable speeds depending on current use. This makes DynamoDB a great use case for storing this data as it can handle the flexible schemas and can also scale up and down on-demand based on the throughput of data.
However, the utility of this data diminishes over time as the data becomes old and out of date. With pay-as-you-go pricing, the more data stored in DynamoDB the more it costs. This means you only want to use DynamoDB as a hot data store for frequently used data sets. Old and stale data should be removed to save cost and also help with efficiency. Often, companies don't want to simply delete this data and instead want to move it elsewhere for archival.
Setting up the CDC DynamoDB stream is a great use case to solve this. Changes can be captured and sent to the data stream so it can be archived in S3 or another data store and a data retention policy can be set up on the data in DynamoDB to automatically delete it after a certain period of time. This reduces storage costs in DynamoDB as the cold data is offloaded to a cheaper storage platform.
Real-Time Analytics on DynamoDB
As stated previously, DynamoDB is great at retrieving data fast but isn't designed for large-scale data retrieval or complex queries. For example, let's say you have a game that stores user events for each interaction and these events are being written to DynamoDB. Depending on the number of users playing at any time, you need to quickly scale your storage solution to deal with the current throughput making DynamoDB a great choice.
However, you now want to build a leaderboard that provides statistics for each of these interactions and shows the top ten players based on a particular metric. This leaderboard would need to update in real time as new events are captured. DynamoDB does not natively support real-time aggregations of data so this is another use case for using CDC out to an analytics platform.
Rockset, a real-time analytics database, is an ideal fit for this scenario. It has a built-in connector for DynamoDB that automatically configures the DynamoDB stream so changes are ingested into Rockset in near real time. The data is automatically indexed in Rockset for fast analytical queries and SQL querying to perform aggregations and calculations across the data.
Millisecond latency queries can be set up to constantly retrieve the latest version of the leaderboard as new data is ingested. Like DynamoDB, Rockset is a fully serverless solution providing the same scaling and hands-free infrastructure benefits.
Joining Datasets Together
Similar to its lack of analytics capabilities, DynamoDB doesn’t support the joining of tables in queries. NoSQL databases in general tend to lack this capability as data is stored in more complex structures instead of in flat, relational schemas. However, there are times when joining data together for analytics is critical.
Going back to our real-time gaming leaderboard, rather than just using data from one DynamoDB table, what if we wanted our leaderboard to contain other metadata about a user that comes from a different data source altogether? What if we also wanted to show past performance? These use cases would require queries with table joins.
Again, we could continue to use Rockset in this scenario. Rockset has multiple connectors available for databases like MySQL, Postgres, MongoDB, flat files and many more. We could set up connectors to update the data in real time and then amend our leaderboard SQL query to now join this data and a subquery of past performance to be shown alongside the current leaderboard scores.
Another use case for implementing CDC with DynamoDB streams is search. As we know, DynamoDB is great for fast document lookups using indexes but searching and filtering large data sets is typically slow.
For searching documents with lots of text, AWS offers CloudSearch, a managed search solution that provides flexible indexing to provide fast search results with custom, weighted ordering. It is possible to sync DynamoDB data into Cloudsearch however, currently, the solution does not make use of DymanoDB Streams and requires a manual technical solution to sync the data.
On the other hand, with Rockset you can use its DynamoDB connector to sync data in near real time into Rockset where for a simple search you can use standard SQL
where clauses. For more complex search, Rockset offers search functions to look for specific terms, boost certain results and also perform proximity matching. This could be a viable alternative to AWS CloudSearch if you aren’t searching through large amounts of text and is also easier to set up due to it using the DynamoDB streams CDC method. The data also becomes searchable in near real time and is indexed automatically. CloudSearch has limitations on data size and upload frequency in a 24-hour period.
A Flexible and Future-Proofed Solution
It is clear that AWS DynamoDB is a great NoSQL database offering. It is fully managed, easily scalable and cost-effective for developers building solutions that require fast writes and fast single row lookups. For use cases outside of this, you will probably want to implement a CDC solution to move the data into an alternative data store that is more suited to the use case. DynamoDB makes this easy with the use of DynamoDB streams.
Rockset takes advantage of DynamoDB streams by providing a built-in connector that can capture changes in seconds. As I have described, many of the common use cases for implementing a CDC solution for DynamoDB can be covered by Rockset. Being a fully managed service, it removes infrastructure burdens from developers. Whether your use case is real-time analytics, joining data and/or search, Rockset can provide all three on the same datasets, meaning you can solve more use cases with fewer architectural components.
This makes Rockset a flexible and future-proofed solution for many real-time analytic use cases on data stored in DynamoDB.