Using Smart Schema to Accelerate Insights from Nested JSON
February 21, 2019
Developers often need to work with datasets without a fixed schema, like heavily nested JSON data with several deeply nested arrays and objects, mixed data types, null values, and missing fields. In addition, the shape of the data is prone to change when continuously syncing new data. Understanding the shape of a dataset is crucial to constructing complex queries for building applications or performing data science investigations.
This blog walks through how Rockset's Smart Schema feature automates schema inference at read time, enabling us to go from complex JSON data, with nested objects and arrays, to insights without any friction.
Using Smart Schema to Understand Your Data
On Grammy night, as I was watching the award ceremony live, I decided to start poking around the live Twitter stream to see how the Twitterverse was reacting to it. To do this, I ingested the live Twitter stream into a Rockset collection called twitter_collection
to list the top 5 trending hashtags.
Without any upfront knowledge of what the Twitter data looks like, let's call DESCRIBE on the collection to understand the shape of the data.
The output of DESCRIBE has the following fields:
- field: Every distinct field name in the collection
- type: The data type of the field
- occurrences: The number of documents that have this field in the given type
- total: Total number of documents in the collection for top level fields, and total number of documents that have the parent field for nested fields
This output is what we refer to as Smart Schema. It tells us what fields are in the dataset, what types they are, and how dense or sparse they may be. Here is a snippet of the Smart Schema for twitter_collection
.
rockset> DESCRIBE twitter_collection;
+-----------------------------------------------+---------------+---------+-----------+
| field | occurrences | total | type |
|-----------------------------------------------+---------------+---------+-----------|
| ['id'] | 4181419 | 4181419 | string |
| ['_event_time'] | 4181419 | 4181419 | timestamp |
| ['coordinates'] | 4178582 | 4181419 | null_type |
| ['coordinates'] | 2837 | 4181419 | object |
| ['coordinates', 'type'] | 2837 | 2837 | string |
| ['coordinates', 'coordinates'] | 2837 | 2837 | array |
| ['coordinates', 'coordinates', '*'] | 5673 | 5674 | float |
| ['coordinates', 'coordinates', '*'] | 1 | 5674 | int |
| ['created_at'] | 4181419 | 4181419 | string |
| ['display_text_range'] | 228832 | 4181419 | array |
| ['display_text_range', '*'] | 457664 | 457664 | int |
| ['entities'] | 4181419 | 4181419 | object |
| ['entities', 'hashtags'] | 4181419 | 4181419 | array |
| ['entities', 'hashtags', '*'] | 1301581 | 1301581 | object |
| ['entities', 'hashtags', '*', 'indices'] | 1301581 | 1301581 | array |
| ['entities', 'hashtags', '*', 'indices', '*'] | 2603162 | 2603162 | int |
| ['entities', 'hashtags', '*', 'text'] | 1301581 | 1301581 | string |
| ['entities', 'user_mentions'] | 4181419 | 4181419 | array |
+-----------------------------------------------+---------------+---------+-----------+
We can infer from this Smart Schema that the data appears to have JSON documents with nested objects, arrays, and scalars. In addition, it has sparse fields and fields of mixed types.
The field that looks most relevant here is entities.hashtags
, which is nested inside an object called entities
. Note that to access nested fields inside objects, we concatenate the field names with a . (dot) as a separator. Let's explore the array field entities.hashtags
further to understand its shape.
entities.hashtags
is an array of objects. Each of these objects has a field called indices
, which is an array of integers, and a field called text
, which is a string. Also, not all the documents that have the entities.hashtags
array have nested objects within it, as is evident from the occurrences of the nested objects inside entities.hashtags
being lesser than the occurrences of entities.hashtags
.
Here are 2 sample hashtags
objects from 2 documents in the collection:
{
"hashtags": [
{ "text": "AmazonMusic",
"indices": [ 15, 27 ]
},
{ "text": "ジョニ・ミッチェル",
"indices": [ 33, 43 ]
},
{ "text": "Blue",
"indices": [ 46, 51 ]
}
]
}
{
"hashtags": []
}
One document has the field hashtags
with an array of nested objects, and the other document has hashtags
with an empty array.
The field text
nested inside the entities.hashtags
array is the one we are interested in. Note that text
is a SQL NULL or undefined in documents where entities.hashtags
is an empty array. We can use the IS NOT NULL predicate to filter out all such values.
So What's Trending at the Grammys?
Now that we know what the data looks like, let's build a simple query to get a few text fields in the hashtags. Rockset treats arrays as virtual collections. When using a nested array as a target collection in queries we use the delimiter : (colon) as a separator between the root collection and the nested fields. We can use the field entities.hashtags
, which is an array, as a target collection in the following query:
rockset> SELECT
text
FROM
twitter_collection:entities.hashtags AS hashtags
WHERE
text IS NOT NULL
LIMIT 5;
+----------------+
| text |
|----------------|
| Grammys |
| TearItUpBTS |
| BLINK |
| daSnakZ |
| SNKZ |
+----------------+
Great! Building from here, a query that lists 5 hashtags in the decreasing order of their counts—basically the top 5 trending hashtags—would look like this:
rockset> SELECT
text AS hashtag
FROM
twitter_collection:entities.hashtags AS hashtags
WHERE
text IS NOT NULL
GROUP BY
text
ORDER BY
COUNT(*) DESC
LIMIT 5;
+-----------------+
| hashtag |
|-----------------|
| GRAMMYs |
| TearItUpBTS |
| Grammys |
| ROSÉ |
| music |
+-----------------+
Clearly, there was a lot of talk about the Grammys on Twitter and BTS seemed to be tearing it up!
Next, I was curious whom the Twitterverse was backing at the Grammys. I thought that would correlate with the most popular user mentions.
With a quick peek at the Smart Schema snippet above, I see an array field called entities.user_mentions
that looks relevant.
Let's explore the nested array entities.user_mentions
further using DESCRIBE.
rockset> DESCRIBE twitter_collection:entities.user_mentions;
+-----------------------+---------------+----------+-----------+
| field | occurrences | total | type |
|-----------------------+---------------+----------+-----------|
| ['*'] | 1531518 | 1531518 | object |
| ['*', 'id'] | 329 | 1531518 | null_type |
| ['*', 'id'] | 1531189 | 1531518 | int |
| ['*', 'id_str'] | 1531189 | 1531518 | string |
| ['*', 'id_str'] | 329 | 1531518 | null_type |
| ['*', 'indices'] | 1531518 | 1531518 | array |
| ['*', 'indices', '*'] | 3063036 | 3063036 | int |
| ['*', 'name'] | 1531189 | 1531518 | string |
| ['*', 'name'] | 329 | 1531518 | null_type |
| ['*', 'screen_name'] | 1531518 | 1531518 | string |
+-----------------------+---------------+----------+-----------+
entities.user_mentions
is an array of nested objects as we can see above.
The most relevant fields in these nested objects appear to be name
and screen_name
. Let's stick with name
for this analysis. From the Smart Schema above, we can see that while name
is of type 'string' in most documents, it is a JSON NULL(null_type) in a few documents. A JSON NULL is not the same as a SQL NULL. We can filter the JSON NULLs out by using Rockset's typeof function.
Here is a simple query that lists 5 user mention names.
rockset> SELECT
col.name
FROM
twitter_collection:entities.user_mentions AS col
WHERE
typeof(col.name) = 'string'
LIMIT 5;
+------------------------------------+
| name |
|------------------------------------|
| Nina Dobrev |
| H.E.R. |
| nctea |
| StopVientresAlquiler |
| 小林由依1st写真集_3月13日発売_公式 |
+------------------------------------+
To list the 5 most popular user mentions, I'll demonstrate another method that involves using UNNEST
. I constructed the target collection by expanding the user_mentions
array using UNNEST and joining it with twitter_collection
. Here is the fully fleshed out query:
rockset> SELECT
user.user.name
FROM
twitter_collection AS col,
UNNEST(col.entities.user_mentions AS user) AS user
WHERE
typeof(user.user.name) = 'string'
GROUP BY
user.user.name
ORDER BY
COUNT(*) DESC
LIMIT 5;
+---------------------+
| name |
|---------------------|
| 방탄소년단 |
| Michelle Obama |
| H.E.R. |
| lego |
| BT21 |
+---------------------+
I needed some help from Google to translate "방탄소년단" for me.
Even though they did not win at the Grammys, BTS had clearly won over the Twitterverse!
And we've gone from data to insights in no time, using Smart Schema to help us understand what our data is all about. No data prep, no schema modeling, no ETL pipelines.