Flattening a JSON Object So It’s Queryable Using Rockset
April 27, 2021
Many developers use NoSQL databases in order to ingest unstructured and schemaless data. When it comes to understanding the data by writing queries that join, aggregate, and search, it becomes more challenging. This is where Rockset becomes a great partner not only in understanding your unstructured data but in returning queries that join, aggregate, and search within milliseconds at scale. Rockset is a real-time indexing database built for the cloud that acts as an external indexing layer on top of your data lakes, data streams, transactional databases, and data warehouses.
In this twitch stream, we created a MongoDB Atlas instance. After the instance is created, you have the option to use the MongoDB preseeded databases. Here I used the database called netflix and the collection called movies.
After we configure the instance, we created an integration on Rockset with MongoDB, by using the built-in data connector for MongoDB. We provide limited credentials, so Rockset can read the data from MongoDB. The instructions to configure Atlas and create the Rockset integration can be found here — or you can watch the stream below!
Inspecting the data
Once the data is in Rockset, it will look something like this:
Embedded content: https://gist.github.com/nfarah86/ef1cc9da88e56226c4c46fd0e3c8e16e
If you noticed the field genres
looks like this:
"genres": "[{'id': 80, 'name': 'Crime'}]"
... Strings, Strings, everywhere…
Basically, we have a string type as a value, when it should be an array of objects. Let’s say you wanted to see all the genre’s names without the id key; you wouldn’t be able to write a query that can do this, as it’s currently formatted.
Transforming Genres from a JSON String → to an ARRAY
Rockset has a function called UNNEST, that can be used to expand array of values or documents to be queried (aka flattening the JSON object). Assuming no errors in how genres is formatted as a string, we can accomplish this in 2 steps:
Here, you can use JSON_PARSE, which parses a given JSON string as a JSON object:
SELECT JSON_PARSE("[{"id":3, "name":"thriller"}]");
When you run that in the Query Editor, you should get this back:
-- get an array of objects back
[{"id":3, "name":"thriller"}]
Keep in mind, our string is currently formatted like this:
“[{'id': 80,'name': 'Crime'}]"
- Expand the array and flatten the JSON object:
Use UNNEST:
SELECT
genres.value.name
FROM
yourCollectionName,
UNNEST(yourCollectionName.genres AS value) AS genres
GROUP BY
genres.value.name
;
When you run this query, you should get:
-- result of UNNEST where we return genres.name
[{"name": "Crime”}]
In the following recorded twitch stream, we actually got a curveball ball 🎾, where we couldn’t JSON_PARSE(genres). A parsing error was thrown because the string in the data is malformatted. In this case, we added an extra step to solve this. Check out the stream 👇 to see how we resolved the error-- (and don’t forget to follow us!)
TLDR: you can find all the resources you need to get started on Rockset in the developer corner.