Smart Schema: Enabling SQL Queries on Semi-Structured Data

November 19, 2020

Shawn Adams

Solutions Engineer, Rockset

Rockset is a real-time indexing database in the cloud for serving low-latency, high-concurrency queries at scale. It is particularly well-suited for serving the real-time analytical queries that power apps, such as personalization or recommendation engines, location search, and so on.

In this blog post, we show how Rockset’s Smart Schema feature lets developers use real-time SQL queries to extract meaningful insights from raw semi-structured data ingested without a predefined schema.

smart-schema-rockset

Challenges with Semi-Structured Data

Interrogating underlying data to frame questions about it is rather challenging if you don't understand the shape of the data.

This is particularly true given the nature of real-world data. Developers often find themselves working with data sets that are messy, with no fixed schema. For example, they will often include heavily nested JSON data with several deeply nested arrays and objects, with mixed data types and sparse fields.

In addition, you may need to continuously sync new data or pull data from different data sources over time. As a result, the shape of the underlying data will change continuously.

Problems with Current Data Systems

Most of the current data systems fail to address these pain points without introducing additional preprocessing steps that are, in themselves, painful.

In SQL-based systems, the data is strongly and statically typed. All the values in the same column have to be of the same type, and, in general, the data must follow a fixed schema that cannot be easily modified. Ingesting semi-structured data into SQL data systems is not an easy task, especially early on when the data model is still evolving. As a result, organizations usually have to build hard-to-maintain ETL pipelines to feed semi-structured data into their SQL systems.

In NoSQL systems, data is strongly typed but dynamically so. The same field can hold values of different types across documents. NoSQL systems are designed to simplify data writes, requiring no schema and little or no upfront data transformation.

However, while schemaless or schema-unaware NoSQL systems make it simple to ingest semi-structured data into the system without ETL pipelines, without a known data model, reading data out in a meaningful way is more complicated. They are also not as powerful at analytical queries as SQL systems due to their inability to perform complex joins and aggregations. Thus, with its rigid data typing and schemas, SQL continues to be a powerful and popular query language for real-time analytical queries.

Rockset Provides Data and Query Flexibility

At Rockset, we have built an SQL database that is dynamically typed but schema-aware. In this way, our customers benefit from the best of both data-system approaches: the flexibility of NoSQL without sacrificing any of the analytical powers of SQL.

To allow complex data to be written as easily as possible, Rockset supports schemaless ingestion of your raw semi-structured data. The schema does not need to be known or defined ahead of time, and no clunky ETL pipelines are required. Rockset then allows you to query this raw data using SQL—including complex analytical queries—by supporting fast joins and aggregations out of the box.

In other words, Rockset does not require a schema but is nevertheless schema-aware, coupling the flexibility of schemaless ingest at write time with the ability to infer the schema at read time.

Smart Schema: Concept and Architecture

Rockset automatically and continuously infers the schema based on the exact fields and types present in the ingested data. Note that Rockset generates the schema based on the entire data set, not just a sample of the data. Smart Schema evolves to fit new fields and types as new semi-structured data is schemalessly ingested.

smart-schema-ex Figure 1: Example of Smart Schema generated for a collection

Figure 1 shows on the left a collection of documents that have the fields "name," "age," and "zip." In this collection, there are both missing fields and fields with mixed types. On the right, you see the Smart Schema that would be built and maintained for this collection. For each field, you have all of its corresponding types, the occurrences of each field type, and the total number of documents in the collection. This helps us understand exactly what fields are present in the data set, what types they are, and how dense or sparse they may be.

For example, “zip” has a mixed data type: It is a string in three out of the six documents in the collection, a float in one, and an integer in one. It is also missing in one of the documents. Similarly “age” occurs four times as an integer and is missing in two of the documents.

So even without upfront knowledge of this collection’s schema, Smart Schema provides a good summary of how the data is shaped and what you can expect from the collection.

Smart Schema in Action: Movie Recommendations

This demo shows how the data from two ingested JSON data sets (commons.movie_ratings and commons.movies) can be navigated and used to construct SQL queries for a movie recommendation engine.

Understanding Shape of the Data

The first step is to use the Smart Schemas to understand the shape of the data sets, which were ingested as semi-structured data, without specifying a schema.

smart-schema-console Figure 2: Smart Schema for an ingested collection

The automatically generated schema will appear on the left. Figure 2 gives a partial view of the list of fields that belong to the movie_ratings collection, and when you hover over a field, you see the distribution of its underlying field types and the field’s overall occurrence within the collection.

The movieId field, for example, is always a string, and it occurs in 100% of the documents in the collection. The rating field, on the other hand, is of mixed types: 78% int and 22% float:

If you run the following query:

DESCRIBE movie-ratings;

you will see the schema for the movie_ratings collection as a table in the Results panel as shown in Figure 3.

smart-schema-movie-ratings Figure 3: Smart Schema table for movie_ratings

Similarly, in the movies collection, we have a list of fields, such as genres, which is an array type with nested objects, each of which has id, which is of type int, and name, which is of type string.

smart-schema-movies

So, you can think of the movies and the movie_ratings collections as dimension and fact collections, and now that we understand how to explore the shape of the data at a high level, let's start constructing SQL queries.

Constructing SQL Queries

Let's start by getting a list from the movie_ratings collection of the movieId of the top 5 movies in descending order of their average rating. To do this, we use the SQL Editor in the Rockset Console to write a simple aggregation query as follows:

smart-schema-sql-top5

If you want to make sure that the average rating is based on a reasonable number of reviewers, you can add an additional predicate using the HAVING clause, where the rating count must be equal to or greater than 5.

smart-schema-sql-top5-2

When you run the query, here is the result:

smart-schema-top5-id

If you want to list the top five movies by name instead of ID, you simply join the movie_ratings collection with the movies collection and extract the field title from the output of that join. To do this, we copy the previous query and change it with an INNER JOIN on the collection movies (alias mv)and update the qualifying fields (circled below) accordingly:

smart-schema-sql-top5-titles

Now when you run the query, you get a list of movie titles instead of IDs:

smart-schema-top5-titles

And finally, let's say you also want to list the names of the genres that these movies belong to. The field genres is an array of nested objects. In order to extract the field genres.name, you have to flatten the array, i.e., unnest it. Copying (and formatting) the same query, you use UNNEST to flatten the genres array from the movies collection (mv.genres), giving it an alias g and then extracting the genre name (g.name) in the GROUP BY clause:

smart-schema-sql-top5-genres

And if you want to list the top five movies in a particular genre, you do it simply by adding a WHERE clause under g.name (in the example shown below, Thriller):

smart-schema-sql-top5-thriller

Now you will get the top five movies in the genre Thriller, as shown below:

smart-schema-top5-thriller

And That’s Not All...

If you want your application to give movie recommendations based on user-specified genres, ratings, and other such fields, this can be achieved by Rockset’s Query Lambdas feature, which lets you parameterize queries that can then be invoked by your application from a dedicated REST endpoint.

Check out our video where we talk about all Smart Schema, and let us know what you think.

Embedded content: https://www.youtube.com/watch?v=2fjO2qSRduc