IValue: efficient representation of dynamic types in C++

Introduction

In traditional SQL systems, a column's type is determined when the table is created, and never changes while executing a query. If you create a table with an integer-valued column, the values in that column will always be integers (or possibly NULL).

Rockset, however, is dynamically typed, which means that we often don't know the type of a value until we actually execute the query. This is similar to other dynamically typed programming languages, where the same variable may contain values of different types at different points in time:

$ python3
>>> a = 3
>>> type(a)
<class 'int'>
>>> a = 'foo'
>>> type(a)
<class 'str'>

Rockset's type system was initially based on JSON, and has since been extended to support other types as well:

  • bytes: taking a cue from Python, we distinguish between sequences of valid Unicode characters (string, which is internally represented as UTF-8) and sequences of arbitrary bytes (bytes)
  • date- and time-specific types (date, time, datetime, timestamp, microsecond_interval, month_interval)

There are other types that we use internally (and are never exposed to our users); also, the type system is extensible, with planned support for decimal (base-10 floating-point), geometry / geography types, and others.

In the following example, collection ivtest has documents containing one field a, which takes a variety of types:

$ rock create collection ivtest
Collection "ivtest" was created successfully in workspace "commons".

$ cat /tmp/a.docs
{"a": 2}
{"a": "hello"}
{"a": null}
{"a": {"b": 10}}
{"a": [2, "foo"]}

$ rock upload ivtest /tmp/a.docs
{
 "file_name":"a.docs",
 "file_upload_id":"c5ccc261-0096-4a73-8dfe-d6db8b8d130e",
 "uploaded_at":"2019-06-05T18:12:46Z"
}

$ rock sql
> select typeof(a), a from ivtest order by a;
+-----------+------------+
| ?typeof   | a          |
|-----------+------------|
| null_type | <null>     |
| int       | 2          |
| string    | hello      |
| array     | [2, 'foo'] |
| object    | {'b': 10}  |
+-----------+------------+
Time: 0.014s

This post shows one of many challenges that we encountered while building a fully dynamically typed SQL database: how we manipulate values of unknown types in our query execution backend (written in C++), while approaching the performance of using native types directly.

At first, we used protocol buffers similar to the definition below (simplified to only show integers, floats, strings, arrays, and objects; the actual oneof that we use has a few extra fields):

message Value {
  oneof value_union {
    int64 int_value = 1;
    double float_value = 2;
    string string_value = 3;
    ArrayValue array_value = 4;
    ObjectValue object_value = 5;
  }
}

message ArrayValue {
  repeated Value values = 1;
}

message ObjectValue {
  repeated KeyValue kvs = 1;
}

message KeyValue {
  string key = 1;
  Value value = 2;
}   

But we quickly realized that this is inefficient, both in terms of speed and in terms of memory usage. First, protobuf requires a heap memory allocation for every object; creating a Value that contains an array of 10 integers would perform:

  • a memory allocation for the top-level Value
  • an allocation for the array_value member
  • an allocation for the list of values (ArrayValue.values, which is a RepeatedPtrField)
  • an allocation for each of the 10 values in the array

for a total of 13 memory allocations.

Also, the 10 values in the array are not allocated contiguously in memory, which causes a further decrease in performance due to cache locality.

It was quickly clear that we needed something better, which we called IValue. Compared to the protobuf version, IValue is:

  • More memory efficient: while not as efficient as using native types directly, IValue must be small, and must avoid heap allocations wherever possible. IValue is always 16 bytes, and does not allocate heap memory for integers, booleans, floating-point numbers, and short strings.
  • Faster: arrays of scalar IValues are allocated contiguously in memory, leading to better cache locality. This is not as efficient as using native types directly, but it is a significant improvement over protobuf.

Most of Rockset's query execution engine operates on IValues (there are some parts that have specialized implementation for specific types, and this is an area of active improvement).

We'd like to share an overview of the IValue design. Note that IValue is optimized for Rockset's needs and is not meant to be portable — we use Linux and x86_64-specific tricks, and assume a little-endian memory layout.

The idea is in itself not novel; the techniques that we use date back to at least 1993, as surveyed in “Representing Type Information in Dynamically Typed Languages”. We decided to make IValue 128 bits instead of 64, as it allows us to avoid heap allocations in more cases (including all 64-bit integers); using the taxonomy defined in the paper, IValue is a double-wrapper scheme with qualifiers.

Internally, IValue is represented as a 128-bit (16-byte) value, consisting of:

  • a 64-bit field (called data)
  • a 48-bit field (called pointer, as it often, but not always, stores a pointer)
  • two 8-bit discriminator fields (called tag0 and tag1)

IValue overview (1)

tag1 indicates the type of the value. tag0 is usually a subtype, and the meaning of the other two fields changes depending on type. The pointer field is often a pointer to some other data structure, allocated on the heap, for the cases where heap allocations can't be avoided; as pointers are only 48 bits on x86_64, we are able to fit a pointer and the two discriminator fields in the same uint64_t.

We recognize two types of IValues:

  • immediate values: those that can fit within the 16 bytes of the IValue itself (while still leaving room for the 16-bit tag):

    • all scalars that fit within 128 - 16 = 112 bits: NULL, integers, floating-point values, booleans, date, time, etc
    • strings shorter than 16 bytes
  • non-immediate values, which require heap allocation.

tag1 has bit 7 clear (tag1 < 0x80) for all immediate values, and set (tag1 >= 0x80) for all non-immediate values. This allows us to distinguish between immediate and non-immediate values very quickly, using one simple bit operation. We can then copy, hash, and compare for equality immediate values by treating them as a pair of uint64_t integers.

Scalar Types

The representation for most scalar types is straightforward: tag0 is usually zero, tag1 identifies the type, pointer is usually zero, and data contains the value.

SQL NULL is all zeros, which is convenient (memset()ing a chunk of memory to zero makes it NULL when interpreted as IValue): null

Booleans have data = 0 for false and data = 1 for true, tag1 = 0x01 boolean

Integers have the value stored in data (as int64_t) and tag1 = 0x02 integer

And so on. The layouts for other scalar types (floating point, date / time, etc) are similar.

Strings

We handle character strings and byte strings similarly; the value of tag1 is the only difference. For the rest of the section, we'll only focus on character strings.

IValue strings are immutable, maintain the string's length explicitly, and are not null-terminated. In line with our goal to minimize heap allocations, IValue doesn't use any external memory for short strings (less than 16 bytes).

Instead, we implement the small string optimization: we store the string contents (padded with nulls) in the data, pointer, and tag0 fields; we store the string length in the tag1 field: tag1 is 0x1n, where n is the string's length.

An empty string has tag1 = 0x10 and all other bytes zero: empty string (1)

And, for example, the 11-byte string “Hello world” has tag1 = 0x1b (note the little-endian representation; the byte 'H' is first): hello world (1)

Strings longer than 15 bytes are stored out-of-line: tag1 is 0x80, pointer points to the beginning of the string (allocated on the heap using malloc()), and data contains the string length. (There is also the possibility of referencing a “foreign” string, where IValue doesn't own the memory but points inside a preallocated buffer, but that is beyond the scope of this post.) long string (1)

For example, the 19-byte string “Rockset is awesome!”: long string (2)

Vectors

Vectors (which we call “arrays”, adopting JSON's terminology) are similarly allocated on the heap: they are similar to vectors in most programming languages (including C++'s std::vector). tag1 is 0x82, pointer points to the beginning of the vector (allocated on the heap using malloc()), and data contains the vector's size and capacity (32 bits each). The vector itself is a contiguously allocated block of capacity() IValues (capacity() * 16 bytes); when reallocation is needed, the vector grows exponentially (with a factor that is less than 2, for the reasons described in Facebook's fbvector implementation.) vector (3)

Hash Maps

Maps (which we call “objects”, adopting JSON's terminology) are also allocated on the heap. We represent objects as open-addressing hash tables with quadratic probing; the size of the table is always a power of two, which simplifies probing. We probe with triangular numbers, just like Google's sparsehash, which. as Knuth tells us in The Art of Computer Programming (volume 3, chapter 6.4, exercise 20), automatically covers all slots.

Each hash table slot is 32 bytes — two IValues, one for the key, one for the value. As is usually the case with open-addressing hash tables, we need two special keys — one to represent empty slots, and one to represent deleted elements (tombstones). We reserve two values of tag1 for that purpose (0x06 and 0x05, respectively).

The pointer field points to the beginning of the hash table (a contiguous array of slots, allocated on the heap using malloc().) We store the current size of the hash table in the least-significant 32 bits of the data field. The tag0 field contains the number of allocated slots (as it's always a power of two, we store log2(number of slots) + 1, or zero if the table is empty).

The capacity field (most significant 32 bits of data) deserves further interest: it is the number of slots available for storing user data. Initially, it is the same as the total number of slots, but, as in all open-addressing hash tables, erasing an element from the table marks the slot as “deleted” and renders it unusable until the next rehash. So erasing an element actually decreases the table's capacity. hashtable (1)

Performance

IValue gives a substantial performance improvement over the old protobuf-based implementation:

  • creating arrays of strings is between 2x and 7x faster (depending on the string size; because of the small-string optimization, IValue is significantly faster for small strings)
  • creating arrays of integers is also 7x faster (because we no longer allocate memory for every individual array element)
  • iterating over large arrays of integers is 3x faster (because the values in the array are now allocated contiguously)

Future Work

Even though Rockset documents are allowed to contain data of multiple types in the same field, the situation shown in the introduction is relatively rare. In practice, most of the data is of the same type (or NULL), and, to recognize this, we are extending IValue to support homogeneous arrays.

All elements in a homogeneous array are of the same type (or NULL). The structure is similar to the regular (heterogeneous) arrays (described above), but the pointer field points directly to an array of the native type (int64_t for an array of integers, double for an array of floating-point values, etc). Similar to systems like Apache Arrow, we also maintain an optional bitmap that indicates whether a specific value is NULL or not.

The query execution code recognizes the common case where it produces a column of values of the same type, in which case it will generate a homogeneous array. We have efficient, vectorized implementations of common database operations on homogeneous arrays, allowing us significant performance improvements in the common case.

This is still an area of active work, and benchmark results are forthcoming.

Conclusion

We hope that you enjoyed a brief look under the hood of Rockset's engine. In the future, we'll share more details about our approaches to building a fully dynamically typed SQL database; if you'd like to give us a try, sign up for an account; if you'd like to help build this, we're hiring!

Real-time SQL on raw data

Related Posts

Building a SQL Development Environment for Messy, Semi-Structured Data

Learn how and why Rockset developed a new SQL development environment for messy, semi-structured data.

Converged Indexing: The Secret Sauce Behind Rockset's Fast Queries

Learn how Rockset delivers low-latency SQL for search and analytics using a combination of row, column, and search indexes.

Analytics on DynamoDB: Comparing Athena, Spark and Elastic

We compare options for real-time analytics on DynamoDB- Athena, Spark and Elastic - in terms of ease of setup, maintenance, query capability, latency. We also evaluate which use cases each of them are best suited for.