From Schemaless Ingest to Smart Schema: Enabling SQL on Raw Data

March 27, 2019

,

You have complex, semi-structured data—nested JSON or XML, for instance, containing mixed types, sparse fields, and null values. It's messy, you don't understand how it's structured, and new fields appear every so often. The application you're implementing needs to analyze this data, combining it with other datasets, to return live metrics and recommended actions. But how can you interrogate the data and frame your questions correctly if you don't understand the shape of your data? Where do you begin?

Schemaless Ingest of Raw Data

With such unwieldy data, and with so many unknowns, it would be easiest to use a data management system that offers enormous flexibility at write time. SQL databases do not fit the bill; they generally require that data adhere to a fixed schema that cannot be easily modified. Organizations will typically build hard-to-maintain ETL pipelines to feed data into their SQL systems.

NoSQL systems, on the other hand, are designed to simplify data writes and may require no schema, along with minimal or no upfront data transformation. Taking a similar approach, to allow complex data to be written as easily as possible, Rockset supports the schemaless ingest of your raw data.

Smart Schema to Enable SQL Queries

While NoSQL systems make it simple to write data into the system, reading data out in a meaningful way is more complicated. Without a known schema, it would be difficult to adequately frame the questions you want to ask of the data. And, somewhat obviously, querying with standard SQL is not an option in the case of NoSQL systems.

In contrast, querying SQL systems, which require fixed schemas, is straightforward and well-understood. These systems also have the benefit of better performance on analytic queries.

Recognizing that having a schema is helpful, Rockset couples the flexibility of schemaless ingest at write time with the efficiency of Smart Schema at read time. Think of Smart Schema as Rockset's automatic generation of a schema based on the exact fields and types present in the ingested data. It can represent semi-structured data, nested objects and arrays, mixed types, and nulls, and enable relational SQL queries over all these constructs.

Using Smart Schema to Analyze Raw Data

In Rockset, semi-structured data formats such as JSON, XML, Parquet, CSV, XLSX, and PDF are intermediate data representation formats; they are neither a row type nor a column type, in contrast to other systems that put all JSON values, for example, into a single column and give you no visibility into it. With Rockset, the data automatically gets stored as a scalar type, an object, or an array. Though Rockset lets you ingest and query raw data composed of mixed types, all fields are dynamically typed and all field values are strongly typed. This enables Rockset to generate a Smart Schema on the data.

With Smart Schema, you can query the underlying schema of data ingested in its raw form to get all the field names and their types across the dataset. Additionally, you can also get the frequency distribution of each field across its various mixed types to help get a sense of which fields are sparse and which ones can potentially co-occur. This ability to fully understand the shape of the data helps users craft complex queries to discover meaningful insights from their data.

Rockset lets you call DESCRIBE on an ingested collection to understand the underlying schema.

Usage: DESCRIBE <collection_name>

The output of DESCRIBE has the following fields:

  • field: Every distinct field name in the collection
  • type: The data type of the field
  • occurrences: The number of documents that have this field in the given type
  • total: Total number of documents in the collection for top level fields, and total number of documents that have the parent field for nested fields

Let's look at a sample JSON dataset that lists movies and their ratings across websites such as IMDB and Rotten Tomatoes (source: https://www.kaggle.com/afzale/rating-vs-gross-collector/version/2#2018-2-4.json)

{
    "12 Strong": {
        "Genre": "Action",
        "Gross": "$1,465,000",
        "IMDB Metascore": "54",
        "Popcorn Score": 72,
        "Rating": "R",
        "Tomato Score": 54
    },
    "A Ciambra": {
        "Genre": "Drama",
        "Gross": "unknown",
        "IMDB Metascore": "70",
        "Popcorn Score": "unknown",
        "Rating": "unrated",
        "Tomato Score": "unkown"
    },
    "The Final Year": {
        "popcornscore": 48,
        "rating": "NR",
        "tomatoscore": 84
    }
}

This dataset has objects with nested fields, fields with mixed types, and missing fields.

The shape of this dataset is succinctly captured below:

rockset> DESCRIBE movie_ratings

+--------------------------------------------+---------------+---------+-----------+
| field                                      | occurrences   | total   | type      |
|--------------------------------------------+---------------+---------+-----------|
| ['12 Strong']                              | 1             | 3       | object    |
| ['12 Strong', 'Genre']                     | 1             | 1       | string    |
| ['12 Strong', 'Gross']                     | 1             | 1       | string    |
| ['12 Strong', 'IMDB Metascore']            | 1             | 1       | string    |
| ['12 Strong', 'Popcorn Score']             | 1             | 1       | int       |
| ['12 Strong', 'Rating']                    | 1             | 1       | string    |
| ['12 Strong', 'Tomato Score']              | 1             | 1       | int       |
| ['A Ciambra']                              | 1             | 3       | object    |
| ['A Ciambra', 'Genre']                     | 1             | 1       | string    |
| ['A Ciambra', 'Gross']                     | 1             | 1       | string    |
| ['A Ciambra', 'IMDB Metascore']            | 1             | 1       | string    |
| ['A Ciambra', 'Popcorn Score']             | 1             | 1       | string    |
| ['A Ciambra', 'Rating']                    | 1             | 1       | string    |
| ['A Ciambra', 'Tomato Score']              | 1             | 1       | string    |
| ['The Final Year']                         | 1             | 3       | object    |
| ['The Final Year', 'popcornscore']         | 1             | 1       | int       |
| ['The Final Year', 'rating']               | 1             | 1       | string    |
| ['The Final Year', 'tomatoscore']          | 1             | 1       | int       |
+--------------------------------------------+---------------+---------+-----------+

Learn how Smart Schema, and the DESCRIBE command, helps you understand and utilize more complex data, in the context of collections that have documents with each of the following properties:

If you're interested to see Smart Schema in action, don't forget to check out our other blog, Using Smart Schema to Accelerate Insights from Nested JSON.