Dynamic Typing in SQL

As Peter Bailis put it in his post, querying unstructured data using SQL is a painful process. Moreover, developers frequently prefer dynamic programming languages, so interacting with the strict type system of SQL is a barrier.

We at Rockset have built the first schemaless SQL data platform. In this post and a few others that follow, we'd like to introduce you to our approach. We'll walk you through our motivations, a few examples, and some interesting technical challenges that we discovered while building our system.

What Is Strong Dynamic Typing?

Many of us at Rockset are fans of the Python programming language. We like its pragmatism, its no-nonsense “There should be one — and preferably only one — obvious way to do it” attitude (The Zen of Python), and, importantly, its simple but powerful type system.

Python is strongly and dynamically typed:

  • Strong, because values have one specific type (or None), and values of incompatible types don't automatically convert to each other. Strings are strings, numbers are numbers, booleans are booleans, and they do not mix except in clear, well-defined ways. Contrast with Javascript, which is weakly typed. Javascript allows (for example) addition and comparison between numbers and strings, with confusing results.
  • Dynamic, because variables acquire type information at runtime, and the same variable can, at different points in time, hold values of different type. a = 5 will make a hold an integer; a subsequent assignment a = 'hello' will make a hold a string. Contrast with Java and C, which are statically typed. Variables must be declared, and they may only hold values of the type specified at declaration.

Of course, no single language falls neatly into one of these categories, but they nevertheless form a useful classification for a high-level understanding of type systems.

Most SQL databases, in contrast, are strongly and statically typed. Values in the same column always have the same type, and the type is defined at the time of table creation and is difficult to modify later.

What's Wrong with SQL's Static Typing?

This impedance mismatch between dynamically typed languages and SQL's static typing has driven development away from SQL databases and towards NoSQL systems. It's easier to build apps on NoSQL systems, especially early on, before the data model stabilizes. Of course, dropping traditional SQL databases means you also tend to lose efficient indexes and the ability to perform complex queries and joins.

Also, modern data sets are often in a semi-structured form (JSON, XML, YAML) and don't follow a well-defined static schema. One often has to build a pre-processing pipeline to determine the correct schema to use, clean up the input data, and transform it to match the schema, and such pipelines are brittle and error-prone.

Even more, SQL doesn't traditionally deal very well with deeply nested data (JSON arrays of arrays of objects containing arrays...). The data pipeline then has to flatten the data, or at least the features that need to be accessed quickly. This adds even more complexity to the process.

What's the Alternative?

What if we tried to build a SQL database that is dynamically typed from the ground up, without sacrificing any of the power of SQL?

Rockset's data model is similar to JSON: values are either

  • scalars (numbers, booleans, strings, etc)
  • arrays, containing any number of arbitrary values
  • maps (which, borrowing from JSON, we call “objects”), mapping string keys to arbitrary values

We extend JSON's data model to support other scalar types as well (such as types related to date and time), but more on that in a future post.

Crucially, documents don't have to have the same fields. It's perfectly okay if a field occurs in (say) 10% of documents; queries will behave as if that field were NULL in the other 90%.

Different documents may have values of different types in the same field. This is important; many real data sets are not clean, and you'll find (for example) ZIP codes that are stored as integers in some part of the data set, and stored as strings in other parts. Rockset will let you ingest and query such documents. Depending on the query, values of unexpected types could be ignored, treated as NULL, or report errors.

There will be slight performance degradation caused by the dynamic nature of the type system. It is easier to write efficient code if you know that you're processing a large chunk of integers, for instance, rather than having to type-check every value. But, in practice, truly mixed-type data is rare — maybe there will be a few outlier strings in a column of integers, so type-checks can in practice be hoisted out of critical code paths. This is, at a high level, similar to what Just-In-Time compilers do for dynamic languages today: yes, variables may change types at runtime, but they usually don't, so it's worth optimizing for the common case.

Traditional relational databases originated in a time when storage was expensive, so they optimized the representation of every single byte on disk. Thankfully, this is no longer the case, which opens the door to internal representation formats that prioritize features and flexibility over space usage, which we believe to be a worthwhile trade-off.

A Simple Example

I'd like to walk you through a simple example of how you can use dynamic types in Rockset SQL. We'll start with a trivially small data set, consisting of basic biographical information for six imaginary people, given as a file with one JSON document per line (which is a format that Rockset supports natively):

{"name": "Tudor", "age": 40, "zip": 94542}
{"name": "Lisa", "age": 21, "zip": "91126"}
{"name": "Hana"}
{"name": "Igor", "zip": 94110.0}
{"name": "Venkat", "age": 35, "zip": "94020"}
{"name": "Brenda", "age": 44, "zip": "90210"}

As is often the case with real-world data, this data set is not clean. Some documents are missing certain fields, and the zip code field (which should be a string) is an int for some documents, and a float for others.

Rockset ingests this data set with no problem:

$ rock add tudor_example1 /tmp/example_docs
 COLLECTION       ID                                      STATUS   ERROR
tudor_example1   3e117812-4b50-4e55-b7a6-de03274fc7df-1   ADDED    None
tudor_example1   3e117812-4b50-4e55-b7a6-de03274fc7df-2   ADDED    None
tudor_example1   3e117812-4b50-4e55-b7a6-de03274fc7df-3   ADDED    None
tudor_example1   3e117812-4b50-4e55-b7a6-de03274fc7df-4   ADDED    None
tudor_example1   3e117812-4b50-4e55-b7a6-de03274fc7df-5   ADDED    None
tudor_example1   3e117812-4b50-4e55-b7a6-de03274fc7df-6   ADDED    None

and we can see that it preserved the original types of the fields:

$ rock sql
> describe tudor_example1;
+-----------+---------------+---------+--------+
| field     | occurrences   | total   | type   |
|-----------+---------------+---------+--------|
| ['_meta'] | 6             | 6       | object |
| ['age']   | 4             | 6       | int    |
| ['name']  | 6             | 6       | string |
| ['zip']   | 1             | 6       | float  |
| ['zip']   | 1             | 6       | int    |
| ['zip']   | 3             | 6       | string |
+-----------+---------------+---------+--------+

Note that the zip field exist in 5 out of the 6 documents, and is a float in one document, an int in another, and a string in the other three.

Rockset treats the documents where the zip field does not exist as if the field were NULL:

> select name, zip from tudor_example1;
+--------+---------+
| name   | zip     |
|--------+---------|
| Brenda | 90210   |
| Lisa   | 91126   |
| Venkat | 94020   |
| Tudor  | 94542   |
| Hana   | <null>  |
| Igor   | 94110.0 |
+--------+---------+

> select name from tudor_example1 where zip is null;
+--------+
| name   |
|--------|
| Hana   |
+--------+

And Rockset supports a variety of cast and type introspection functions that let you query across types:

> select name, zip, typeof(zip) as type from tudor_example1
  where typeof(zip) <> 'string';
+--------+--------+---------+
| name   | type   | zip     |
|--------+--------+---------|
| Igor   | float  | 94110.0 |
| Tudor  | int    | 94542   |
+--------+--------+---------+

> select name, zip::string as zip_str from tudor_example1;
+--------+-----------+
| name   | zip_str   |
|--------+-----------|
| Hana   | <null>    |
| Venkat | 94020     |
| Tudor  | 94542     |
| Igor   | 94110     |
| Lisa   | 91126     |
| Brenda | 90210     |
+--------+-----------+

> select name, zip::string zip from tudor_example1
  where zip::string = '94542';
+--------+-------+
| name   | zip   |
|--------+-------|
| Tudor  | 94542 |
+--------+-------+

Querying Nested Data

Rockset also enables you to query deeply nested data efficiently by treating nested arrays as top-level tables, and letting you use full SQL syntax to query them.

Let's augment the same data set, and add information about where these people work:

{"name": "Tudor", "age": 40, "zip": 94542, "jobs": [{"company":"FB", "start":2009}, {"company":"Rockset", "start":2016}] }
{"name": "Lisa", "age": 21, "zip": "91126"}
{"name": "Hana"}
{"name": "Igor", "zip": 94110.0, "jobs": [{"company":"FB", "start":2013}]}
{"name": "Venkat", "age": 35, "zip": "94020", "jobs": [{"company": "ORCL", "start": 2000}, {"company":"Rockset", "start":2016}]}
{"name": "Brenda", "age": 44, "zip": "90210"}

Add the documents to a new collection:

$ rock add tudor_example2 /tmp/example_docs
 COLLECTION       ID                                      STATUS   ERROR
tudor_example2   a176b351-9797-4ea1-9869-1ec6205b7788-1   ADDED    None
tudor_example2   a176b351-9797-4ea1-9869-1ec6205b7788-2   ADDED    None
tudor_example2   a176b351-9797-4ea1-9869-1ec6205b7788-3   ADDED    None
tudor_example2   a176b351-9797-4ea1-9869-1ec6205b7788-4   ADDED    None
tudor_example2   a176b351-9797-4ea1-9869-1ec6205b7788-5   ADDED    None

We support the semi-standard UNNEST SQL table function that can be used in a join or subquery to “explode” an array field:

> select p.name, j.company, j.start from
  tudor_example2 p cross join unnest(p.jobs) j 
  order by j.start, p.name;
+-----------+--------+---------+
| company   | name   | start   |
|-----------+--------+---------|
| ORCL      | Venkat | 2000    |
| FB        | Tudor  | 2009    |
| FB        | Igor   | 2013    |
| Rockset   | Tudor  | 2016    |
| Rockset   | Venkat | 2016    |
+-----------+--------+---------+

Testing for existence can be done with the usual semijoin (IN / EXISTS subquery) syntax. Our optimizer recognizes the fact that you are querying a nested field on the same collection and is able to execute the query efficiently. Let's get the list of people who worked at Facebook:

> select name from tudor_example2 
  where 'FB' in (select company from unnest(jobs) j);
+--------+
| name   |
|--------|
| Tudor  |
| Igor   |
+--------+

If you only care about nested arrays (but don't need to correlate with the parent collection), we have special syntax for this; any nested array of objects can be exposed as a top-level table:

> select * from tudor_example2.jobs j;
+-----------+---------+
| company   | start   |
|-----------+---------|
| ORCL      | 2000    |
| Rockset   | 2016    |
| FB        | 2009    |
| Rockset   | 2016    |
| FB        | 2013    |
+-----------+---------+

I hope that you can see the benefits of Rockset’s ability to ingest raw data, without any preparation or schema modeling, and still power strongly typed SQL efficiently.

In future posts, we'll shift gears and dive into the details of some interesting challenges that we encountered while building Rockset. Stay tuned!