Mutable Data in Rockset
January 19, 2024
Data mutability is the ability of a database to support mutations (updates and deletes) to the data that’s stored inside it. It’s a critical feature, especially in real-time analytics where data constantly changes and you need to present the latest version of that data to your customers and end users. Data can arrive late, it can be out of order, it can be incomplete or you might have a scenario where you need to enrich and extend your datasets with additional information for them to be complete. In either case, the ability to change your data is very important.
Rockset is fully mutable
Rockset is a fully mutable database. It supports frequent updates and deletes on document level, and is also very efficient at performing partial updates, when only a few attributes (even those deeply nested ones) in your documents have changed. You can read more about mutability in real-time analytics and how Rockset solves this here.
Being fully mutable means that common problems, like late arriving data, duplicated or incomplete data can be handled gracefully and at scale within Rockset.
There are three different ways how you can mutate data in Rockset:
- You can mutate data at ingest time through SQL ingest transformations, which act as a simple ETL (Extract-Transform-Load) framework. When you connect your data sources to Rockset, you can use SQL to manipulate data in-flight and filter it, add derived columns, remove columns, mask or manipulate personal information by using SQL functions, and so on. Transformations can be done on data source level and on collection level and this is a great way to put some scrutiny to your incoming datasets and do schema enforcement when needed. Read more about this feature and see some examples here.
- You can update and delete your data through dedicated REST API endpoints. This is a great approach if you prefer programmatic access or if you have a custom process that feeds data into Rockset.
- You can update and delete your data by executing SQL queries, as you normally would with a SQL-compatible database. This is well suited for manipulating data on single documents but also on sets of documents (or even on whole collections).
In this blog, we’ll go through a set of very practical steps and examples on how to perform mutations in Rockset via SQL queries.
Using SQL to manipulate your data in Rockset
There are two important concepts to understand around mutability in Rockset:
- Every document that is ingested gets an
_id
attribute assigned to it. This attributes acts as a primary key that uniquely identifies a document within a collection. You can have Rockset generate this attribute automatically at ingestion, or you can supply it yourself, either directly in your data source or by using an SQL ingest transformation. Read more about the_id
field here. - Updates and deletes in Rockset are treated similarly to a CDC (Change Data Capture) pipeline. This means that you don’t execute a direct
update
ordelete
command; instead, you insert a record with an instruction to update or delete a particular set of documents. This is done with theinsert into select
statement and the_op
field. For example, instead of writingdelete from my_collection where id = '123'
, you would write this:insert into my_collection select '123' as _id, 'DELETE' as _op
. You can read more about the_op
field here.
Now that you have a high level understanding of how this works, let’s dive into concrete examples of mutating data in Rockset via SQL.
Examples of data mutations in SQL
Let’s imagine an e-commerce data model where we have a user
collection with the following attributes (not all shown for simplicity):
_id
name
surname
email
date_last_login
country
We also have an order
collection:
_id
user_id
(reference to theuser
)order_date
total_amount
We’ll use this data model in our examples.
Scenario 1 - Update documents
In our first scenario, we want to update a specific user’s e-mail. Traditionally, we would do this:
update user
set email = 'new_email@company.com'
where _id = '123';
This is how you would do it in Rockset:
insert into user
select
'123' as _id,
'UPDATE' as _op,
'new_email@company.com' as email;
This will update the top-level attribute email
with the new e-mail for the user 123
. There are other _op
commands that can be used as well - like UPSERT
if you want to insert the document in case it doesn’t exist, or REPLACE
to replace the full document (with all attributes, including nested attributes), REPSERT
, etc.
You can also do more complex things here, like perform a join, include a where
clause, and so on.
Scenario 2 - Delete documents
In this scenario, user 123
is off-boarding from our platform and so we need to delete his record from the collection.
Traditionally, we would do this:
delete from user
where _id = '123';
In Rockset, we will do this:
insert into user
select
'123' as _id,
'DELETE' as _op;
Again, we can do more complex queries here and include joins and filters. In case we need to delete more users, we could do something like this, thanks to native array support in Rockset:
insert into user
select
_id,
'DELETE' as _op
from
unnest(['123', '234', '345'] as _id);
If we wanted to delete all records from the collection (similar to a TRUNCATE
command), we could do this:
insert into user
select
_id,
'DELETE' as _op
from
user;
Scenario 3 - Add a new attribute to a collection
In our third scenario, we want to add a new attribute to our user
collection. We’ll add a fullname
attribute as a combination of name
and surname
.
Traditionally, we would need to do an alter table add column
and then either include a function to calculate the new field value, or first default it to null
or empty string, and then do an update
statement to populate it.
In Rockset, we can do this:
insert into user
select
_id,
'UPDATE' as _op,
concat(name, ' ', surname) as fullname
from
user;
Scenario 4 - Remove an attribute from a collection
In our fourth scenario, we want to remove the email
attribute from our user
collection.
Again, traditionally this would be an alter table remove column
command, and in Rockset, we will do the following, leveraging the REPSERT operation which replaces the whole document:
insert into user
select
*
except(email), --we are removing the email atttribute
'REPSERT' as _op
from
user;
Scenario 5 - Create a materialized view
In this example, we want to create a new collection that will act as a materialized view. This new collection will be an order summary where we track the full amount and last order date on country level.
First, we will create a new order_summary
collection - this can be done via the Create Collection API or in the console, by choosing the Write API data source.
Then, we can populate our new collection like this:
insert into order_summary
with
orders_country as (
select
u.country,
o.total_amount,
o.order_date
from
user u inner join order o on u._id = o.user_id
)
select
oc.country as _id, --we are tracking orders on country level so this is our primary key
sum(oc.total_amount) as full_amount,
max(oc.order_date) as last_order_date
from
orders_country oc
group by
oc.country;
Because we explicitly set _id
field, we can support future mutations to this new collection, and this approach can be easily automated by saving your SQL query as a query lambda, and then creating a schedule to run the query periodically. That way, we can have our materialized view refresh periodically, for example every minute. See this blog post for more ideas on how to do this.
Conclusion
As you can see throughout the examples in this blog, Rockset is a real-time analytics database that is fully mutable. You can use SQL ingest transformations as a simple data transformation framework over your incoming data, REST endpoints to update and delete your documents, or SQL queries to perform mutations at the document and collection level as you would in a traditional relational database. You can change full documents or just relevant attributes, even when they are deeply nested.
We hope the examples in the blog are useful - now go ahead and mutate some data!