Running SQL on Nested JSON

When I run SQL on nested JSON during a demo, a viewer will sometimes remark, "It's like magic!” It's definitely awesome to take raw JSON, without any ETL, data pipelines, or fixed schema, and make it immediately queryable via SQL, but it's far from magic. Instead, it has everything to do with how Rockset is designed.

The Challenge of SQL on JSON

When we surveyed the market, we saw the need for a solution that could perform fast SQL queries on fluid JSON data:

Some form of ETL to transform JSON to tables in SQL databases may be workable for basic JSON data with fixed fields that are known up front. However, JSON with nested objects or new fields that “can spring up every 2-4 weeks,” as the original Stack Overflow poster put it, is impossible to handle in such a rigid manner.

Relational databases offer alternative approaches to accommodate more complex JSON data. SQL Server stores JSON in varchar columns, while Postgres and MySQL have JSON data types. In these scenarios, users can ingest JSON data without conversion to SQL fields, but take a performance hit when querying the data because these columns support minimal indexing at best.

But what if there was a better way? What if we could ingest JSON, with its arrays and nested objects, without any upfront schema, and provide millisecond-latency SQL queries on the data at the same time?

SQL on Nested JSON Using Rockset

With lots of fields that change, get added/removed, etc, it can be rather cumbersome to maintain ETL pipelines. Rockset was designed to help with this problem—by indexing all fields in JSON documents, along with all type information, and exposing a SQL API on top of it.

For example, with a Rockset collection named new_collection, I can start by adding a single document to an empty collection that looks like:

{
    "my-field": "doc1",
    "my-other-field": "some text"
}

... and then query it.

rockset> select "my-field", "my-other-field" 
         from new_collection;

+------------+------------------+
| my-field   | my-other-field   |
+------------+------------------|
| doc1       | some text        |
+------------+------------------+

Now, if a new JSON document comes in with some new fields - maybe with some arrays, nested JSON objects, etc, I can still query it with SQL.

{
    "my-field": "doc2",
    "my-other-field":[
        {
            "c1": "this",
            "c2": "field",
            "c3": "has",
            "c4": "changed"
        }
    ]
}

I add that to the same collection and can query it just as before.

rockset> select "my-field", "my-other-field" 
         from new_collection;

+------------+---------------------------------------------------------------+
| my-field   | my-other-field                                                |
|------------+---------------------------------------------------------------|
| doc1       | some text                                                     |
| doc2       | [{'c1': 'this', 'c2': 'field', 'c3': 'has', 'c4': 'changed'}] |
+------------+---------------------------------------------------------------+

I can further flatten nested JSON objects and array fields at query time and construct the table I want to get to - without having to do any transformations beforehand.

rockset> select mof.* 
         from new_collection, unnest(new_collection."my-other-field") as mof;

+------+-------+------+---------+
| c1   | c2    | c3   | c4      |
|------+-------+------+---------|
| this | field | has  | changed |
+------+-------+------+---------+

In addition to this, there is strong type information stored, which means I won't get tripped up by having mixed types, etc. Adding a third document:

{
    "my-field": "doc3",
    "my-other-field":[
        {
            "c1": "unexpected",
            "c2": 99,
            "c3": 100,
            "c4": 101
        }
    ]
}

It still adds my document as expected.

rockset> select mof.* 
         from new_collection, unnest(new_collection."my-other-field") as mof;

+------------+-------+------+---------+
| c1         | c2    | c3   | c4      |
|------------+-------+------+---------|
| unexpected | 99    | 100  | 101     |
| this       | field | has  | changed |
+------------+-------+------+---------+

... and the fields are strongly typed.

rockset> select typeof(mof.c2) 
         from new_collection, unnest(new_collection."my-other-field") as mof;

+-----------+
| ?typeof   |
|-----------|
| int       |
| string    |
+-----------+

If all this sounds like a great way to run SQL on your nested JSON, you should give Rockset a try.

Real-time SQL on raw data

Related Posts

From Schemaless Ingest to Smart Schema: Enabling SQL on Raw Data

Rockset's schemaless SQL platform automatically infers schema at read time, allowing you to analyze messy data using SQL.

The Path to Better Pollution Forecasting Goes Through Nested JSON

Pittsburgh-based developer Doug Balog collects and analyzes nested JSON weather data to improve pollution forecasts in his community.

How to Build a Facebook Messenger Chatbot Powered by Fast SQL on CSV

Build a chatbot that provides instant responses, leveraging fast SQL queries on CSV data.