Smart Schema: Enabling SQL Queries on Semi-Structured Data
November 19, 2020
Rockset is a real-time indexing database in the cloud for serving low-latency, high-concurrency queries at scale. It is particularly well-suited for serving the real-time analytical queries that power apps, such as personalization or recommendation engines, location search, and so on.
In this blog post, we show how Rockset’s Smart Schema feature lets developers use real-time SQL queries to extract meaningful insights from raw semi-structured data ingested without a predefined schema.
Challenges with Semi-Structured Data
Interrogating underlying data to frame questions about it is rather challenging if you don't understand the shape of the data.
This is particularly true given the nature of real-world data. Developers often find themselves working with data sets that are messy, with no fixed schema. For example, they will often include heavily nested JSON data with several deeply nested arrays and objects, with mixed data types and sparse fields.
In addition, you may need to continuously sync new data or pull data from different data sources over time. As a result, the shape of the underlying data will change continuously.
Problems with Current Data Systems
Most of the current data systems fail to address these pain points without introducing additional preprocessing steps that are, in themselves, painful.
In SQL-based systems, the data is strongly and statically typed. All the values in the same column have to be of the same type, and, in general, the data must follow a fixed schema that cannot be easily modified. Ingesting semi-structured data into SQL data systems is not an easy task, especially early on when the data model is still evolving. As a result, organizations usually have to build hard-to-maintain ETL pipelines to feed semi-structured data into their SQL systems.
In NoSQL systems, data is strongly typed but dynamically so. The same field can hold values of different types across documents. NoSQL systems are designed to simplify data writes, requiring no schema and little or no upfront data transformation.
However, while schemaless or schema-unaware NoSQL systems make it simple to ingest semi-structured data into the system without ETL pipelines, without a known data model, reading data out in a meaningful way is more complicated. They are also not as powerful at analytical queries as SQL systems due to their inability to perform complex joins and aggregations. Thus, with its rigid data typing and schemas, SQL continues to be a powerful and popular query language for real-time analytical queries.
Rockset Provides Data and Query Flexibility
At Rockset, we have built an SQL database that is dynamically typed but schema-aware. In this way, our customers benefit from the best of both data-system approaches: the flexibility of NoSQL without sacrificing any of the analytical powers of SQL.
To allow complex data to be written as easily as possible, Rockset supports schemaless ingestion of your raw semi-structured data. The schema does not need to be known or defined ahead of time, and no clunky ETL pipelines are required. Rockset then allows you to query this raw data using SQL—including complex analytical queries—by supporting fast joins and aggregations out of the box.
In other words, Rockset does not require a schema but is nevertheless schema-aware, coupling the flexibility of schemaless ingest at write time with the ability to infer the schema at read time.
Smart Schema: Concept and Architecture
Rockset automatically and continuously infers the schema based on the exact fields and types present in the ingested data. Note that Rockset generates the schema based on the entire data set, not just a sample of the data. Smart Schema evolves to fit new fields and types as new semi-structured data is schemalessly ingested.
Figure 1: Example of Smart Schema generated for a collection
Figure 1 shows on the left a collection of documents that have the fields "name," "age," and "zip." In this collection, there are both missing fields and fields with mixed types. On the right, you see the Smart Schema that would be built and maintained for this collection. For each field, you have all of its corresponding types, the occurrences of each field type, and the total number of documents in the collection. This helps us understand exactly what fields are present in the data set, what types they are, and how dense or sparse they may be.
For example, “zip” has a mixed data type: It is a string in three out of the six documents in the collection, a float in one, and an integer in one. It is also missing in one of the documents. Similarly “age” occurs four times as an integer and is missing in two of the documents.
So even without upfront knowledge of this collection’s schema, Smart Schema provides a good summary of how the data is shaped and what you can expect from the collection.
Smart Schema in Action: Movie Recommendations
This demo shows how the data from two ingested JSON data sets (commons.movie_ratings
and commons.movies
) can be navigated and used to construct SQL queries for a movie recommendation engine.
Understanding Shape of the Data
The first step is to use the Smart Schemas to understand the shape of the data sets, which were ingested as semi-structured data, without specifying a schema.
Figure 2: Smart Schema for an ingested collection
The automatically generated schema will appear on the left. Figure 2 gives a partial view of the list of fields that belong to the movie_ratings
collection, and when you hover over a field, you see the distribution of its underlying field types and the field’s overall occurrence within the collection.
The movieId
field, for example, is always a string, and it occurs in 100% of the documents in the collection. The rating
field, on the other hand, is of mixed types: 78% int and 22% float:
If you run the following query:
DESCRIBE movie-ratings;
you will see the schema for the movie_ratings
collection as a table in the Results panel as shown in Figure 3.
Figure 3: Smart Schema table for movie_ratings
Similarly, in the movies
collection, we have a list of fields, such as genres
, which is an array type with nested objects, each of which has id
, which is of type int, and name
, which is of type string.
So, you can think of the movies
and the movie_ratings
collections as dimension and fact collections, and now that we understand how to explore the shape of the data at a high level, let's start constructing SQL queries.
Constructing SQL Queries
Let's start by getting a list from the movie_ratings
collection of the movieId
of the top 5 movies in descending order of their average rating
. To do this, we use the SQL Editor in the Rockset Console to write a simple aggregation query as follows:
If you want to make sure that the average rating is based on a reasonable number of reviewers, you can add an additional predicate using the HAVING
clause, where the rating
count must be equal to or greater than 5.
When you run the query, here is the result:
If you want to list the top five movies by name instead of ID, you simply join the movie_ratings
collection with the movies
collection and extract the field title
from the output of that join. To do this, we copy the previous query and change it with an INNER JOIN
on the collection movies
(alias mv
)and update the qualifying fields (circled below) accordingly:
Now when you run the query, you get a list of movie titles instead of IDs:
And finally, let's say you also want to list the names of the genres that these movies belong to. The field genres
is an array of nested objects. In order to extract the field genres.name
, you have to flatten the array, i.e., unnest it. Copying (and formatting) the same query, you use UNNEST
to flatten the genres
array from the movies
collection (mv.genres
), giving it an alias g
and then extracting the genre name (g.name
) in the GROUP BY
clause:
And if you want to list the top five movies in a particular genre, you do it simply by adding a WHERE
clause under g.name
(in the example shown below, Thriller
):
Now you will get the top five movies in the genre Thriller, as shown below:
And That’s Not All...
If you want your application to give movie recommendations based on user-specified genres, ratings, and other such fields, this can be achieved by Rockset’s Query Lambdas feature, which lets you parameterize queries that can then be invoked by your application from a dedicated REST endpoint.
Check out our video where we talk about all Smart Schema, and let us know what you think.
Embedded content: https://www.youtube.com/watch?v=2fjO2qSRduc