Using Smart Schema to Accelerate Insights from Nested JSON

February 21, 2019

,
See Rockset
in action

Get a product tour with a Rockset engineer

Developers often need to work with datasets without a fixed schema, like heavily nested JSON data with several deeply nested arrays and objects, mixed data types, null values, and missing fields. In addition, the shape of the data is prone to change when continuously syncing new data. Understanding the shape of a dataset is crucial to constructing complex queries for building applications or performing data science investigations.

This blog walks through how Rockset's Smart Schema feature automates schema inference at read time, enabling us to go from complex JSON data, with nested objects and arrays, to insights without any friction.

Using Smart Schema to Understand Your Data

On Grammy night, as I was watching the award ceremony live, I decided to start poking around the live Twitter stream to see how the Twitterverse was reacting to it. To do this, I ingested the live Twitter stream into a Rockset collection called twitter_collection to list the top 5 trending hashtags.

Without any upfront knowledge of what the Twitter data looks like, let's call DESCRIBE on the collection to understand the shape of the data.

The output of DESCRIBE has the following fields:

  • field: Every distinct field name in the collection
  • type: The data type of the field
  • occurrences: The number of documents that have this field in the given type
  • total: Total number of documents in the collection for top level fields, and total number of documents that have the parent field for nested fields

This output is what we refer to as Smart Schema. It tells us what fields are in the dataset, what types they are, and how dense or sparse they may be. Here is a snippet of the Smart Schema for twitter_collection.

rockset> DESCRIBE twitter_collection;

+-----------------------------------------------+---------------+---------+-----------+
| field                                         | occurrences   | total   | type      |
|-----------------------------------------------+---------------+---------+-----------|
| ['id']                                        | 4181419       | 4181419 | string    |
| ['_event_time']                               | 4181419       | 4181419 | timestamp |
| ['coordinates']                               | 4178582       | 4181419 | null_type |
| ['coordinates']                               | 2837          | 4181419 | object    |
| ['coordinates', 'type']                       | 2837          | 2837    | string    |
| ['coordinates', 'coordinates']                | 2837          | 2837    | array     |
| ['coordinates', 'coordinates', '*']           | 5673          | 5674    | float     |
| ['coordinates', 'coordinates', '*']           | 1             | 5674    | int       |
| ['created_at']                                | 4181419       | 4181419 | string    |
| ['display_text_range']                        | 228832        | 4181419 | array     |
| ['display_text_range', '*']                   | 457664        | 457664  | int       |
| ['entities']                                  | 4181419       | 4181419 | object    |
| ['entities', 'hashtags']                      | 4181419       | 4181419 | array     |
| ['entities', 'hashtags', '*']                 | 1301581       | 1301581 | object    |
| ['entities', 'hashtags', '*', 'indices']      | 1301581       | 1301581 | array     |
| ['entities', 'hashtags', '*', 'indices', '*'] | 2603162       | 2603162 | int       |
| ['entities', 'hashtags', '*', 'text']         | 1301581       | 1301581 | string    |
| ['entities', 'user_mentions']                 | 4181419       | 4181419 | array     |
+-----------------------------------------------+---------------+---------+-----------+

We can infer from this Smart Schema that the data appears to have JSON documents with nested objects, arrays, and scalars. In addition, it has sparse fields and fields of mixed types.

The field that looks most relevant here is entities.hashtags, which is nested inside an object called entities. Note that to access nested fields inside objects, we concatenate the field names with a . (dot) as a separator. Let's explore the array field entities.hashtags further to understand its shape.

entities.hashtags is an array of objects. Each of these objects has a field called indices, which is an array of integers, and a field called text, which is a string. Also, not all the documents that have the entities.hashtags array have nested objects within it, as is evident from the occurrences of the nested objects inside entities.hashtags being lesser than the occurrences of entities.hashtags.

Here are 2 sample hashtags objects from 2 documents in the collection:

{ 
    "hashtags": [ 
                    { "text": "AmazonMusic", 
                      "indices": [ 15, 27 ] 
                    }, 
                    { "text": "ジョニ・ミッチェル", 
                      "indices": [ 33, 43 ] 
                    }, 
                    { "text": "Blue", 
                      "indices": [ 46, 51 ] 
                    } 
                 ] 
 }
 
 { 
    "hashtags": [] 
 }

One document has the field hashtags with an array of nested objects, and the other document has hashtags with an empty array.

The field text nested inside the entities.hashtags array is the one we are interested in. Note that text is a SQL NULL or undefined in documents where entities.hashtags is an empty array. We can use the IS NOT NULL predicate to filter out all such values.

So What's Trending at the Grammys?

Now that we know what the data looks like, let's build a simple query to get a few text fields in the hashtags. Rockset treats arrays as virtual collections. When using a nested array as a target collection in queries we use the delimiter : (colon) as a separator between the root collection and the nested fields. We can use the field entities.hashtags, which is an array, as a target collection in the following query:

rockset> SELECT 
             text
         FROM
             twitter_collection:entities.hashtags AS hashtags 
         WHERE 
             text IS NOT NULL
         LIMIT 5;          

+----------------+
| text           |
|----------------|
| Grammys        |
| TearItUpBTS    |
| BLINK          |
| daSnakZ        |
| SNKZ           |
+----------------+

Great! Building from here, a query that lists 5 hashtags in the decreasing order of their counts—basically the top 5 trending hashtags—would look like this:

rockset> SELECT 
             text AS hashtag
         FROM 
             twitter_collection:entities.hashtags AS hashtags
         WHERE 
             text IS NOT NULL 
         GROUP BY 
             text 
         ORDER BY  
             COUNT(*) DESC 
         LIMIT 5; 

+-----------------+
| hashtag         |
|-----------------|
| GRAMMYs         |
| TearItUpBTS     |
| Grammys         |
| ROSÉ            |
| music           |
+-----------------+

Clearly, there was a lot of talk about the Grammys on Twitter and BTS seemed to be tearing it up!

Next, I was curious whom the Twitterverse was backing at the Grammys. I thought that would correlate with the most popular user mentions.

With a quick peek at the Smart Schema snippet above, I see an array field called entities.user_mentions that looks relevant.

Let's explore the nested array entities.user_mentions further using DESCRIBE.

rockset> DESCRIBE twitter_collection:entities.user_mentions;

+-----------------------+---------------+----------+-----------+
| field                 | occurrences   | total    | type      |
|-----------------------+---------------+----------+-----------|
| ['*']                 | 1531518       | 1531518  | object    |
| ['*', 'id']           | 329           | 1531518  | null_type |
| ['*', 'id']           | 1531189       | 1531518  | int       |
| ['*', 'id_str']       | 1531189       | 1531518  | string    |
| ['*', 'id_str']       | 329           | 1531518  | null_type |
| ['*', 'indices']      | 1531518       | 1531518  | array     |
| ['*', 'indices', '*'] | 3063036       | 3063036  | int       |
| ['*', 'name']         | 1531189       | 1531518  | string    |
| ['*', 'name']         | 329           | 1531518  | null_type |
| ['*', 'screen_name']  | 1531518       | 1531518  | string    |
+-----------------------+---------------+----------+-----------+

entities.user_mentions is an array of nested objects as we can see above. The most relevant fields in these nested objects appear to be name and screen_name. Let's stick with name for this analysis. From the Smart Schema above, we can see that while name is of type 'string' in most documents, it is a JSON NULL(null_type) in a few documents. A JSON NULL is not the same as a SQL NULL. We can filter the JSON NULLs out by using Rockset's typeof function.

Here is a simple query that lists 5 user mention names.

rockset> SELECT      
             col.name
         FROM
             twitter_collection:entities.user_mentions AS col
         WHERE 
             typeof(col.name) = 'string'
         LIMIT 5;

+------------------------------------+
| name                               |
|------------------------------------|
| Nina Dobrev                        |
| H.E.R.                             |
| nctea                              |
| StopVientresAlquiler               |
| 小林由依1st写真集_3月13日発売_公式     |
+------------------------------------+

To list the 5 most popular user mentions, I'll demonstrate another method that involves using UNNEST. I constructed the target collection by expanding the user_mentions array using UNNEST and joining it with twitter_collection. Here is the fully fleshed out query:

rockset> SELECT
             user.user.name
         FROM
             twitter_collection AS col, 
             UNNEST(col.entities.user_mentions AS user) AS user
         WHERE
             typeof(user.user.name) = 'string'     
         GROUP BY
             user.user.name
         ORDER BY
             COUNT(*) DESC
         LIMIT 5; 

+---------------------+
| name                |
|---------------------|
| 방탄소년단             |
| Michelle Obama      |
| H.E.R.              |
| lego                |
| BT21                |
+---------------------+

I needed some help from Google to translate "방탄소년단" for me.

Screen Shot 2019-02-20 at 4.35.44 PM

Even though they did not win at the Grammys, BTS had clearly won over the Twitterverse!

And we've gone from data to insights in no time, using Smart Schema to help us understand what our data is all about. No data prep, no schema modeling, no ETL pipelines.