Data Types

This page covers the type system in Rockset.

Every value stored in Rockset is strongly typed, having one of the types below. A Rockset document itself has type object.

  • int
  • float
  • bool
  • string
  • bytes
  • null
  • u256
  • array
  • object
  • date
  • datetime
  • time
  • timestamp
  • month_interval
  • microsecond_interval
  • geography

These are the type names used in type functions and in the result of the DESCRIBE command. More information regarding each type is given below.

πŸ“˜

JSON Formatting

Check out our documentation on Rockset Extended JSON for information on how Rockset parses data types in JSON.

Basic Types

Integer

int

Integers are numeric values that do not have fractional components. In Rockset, integers must be in the range between -2^63 and 2^63-1, inclusive.

Aliases are integer and int64.

SELECT CAST()

Float

float

Floats are numeric values with fractional components. In Rockset, floats have 64-bit precision.

Aliases are double and double precision.

Boolean

bool

A boolean is either true or false.

Alias is boolean.

String

string

A string represents character data of any length.

In SQL statements, string literals should be enclosed with single quotes (such as 'hello'), not double quotes, which are interpreted as identifiers like field or collection names.

You can escape a single quote in a string literal by doubling up the single quote (ie. SELECT 'it''s' will return the string literal 'it's').

Aliases are char, varchar, and text.

Bytes

bytes

Values of type bytes contain arbitrary byte data of any length.

Aliases are binary and varbinary.

Null

null

We distinguish between two different forms of nulls: null and undefined. undefined is returned when selecting a collection field that does not exist, dereferencing an object field that does not exist, or when subscripting an array with an out-of-range index. null is returned when selecting a field whose value has been explicitly set to null.

Both null and undefined behave exactly the same in almost all cases, with the following exceptions:

  • Functions such as typeof and json_format.
  • A predicate IS UNDEFINED returns true for undefined and false for null. IS NULL
    returns true for both.
  • A distinction between undefined and null is preserved in the result set and in the JSON
    serialization.
  • Objects where fields have a null value are not equal to objects with a missing fields, i.e.
    {"a": 1, "b": null} != {"a": 1}.

Most functions will return null when one of the arguments is null or undefined. Most aggregate functions will ignore both null and undefined inputs.

To emphasize the difference between null and undefined, suppose we have a collection with the following documents:

{"name": {"first": "John", "last": "HopCroft"}},
{"name": {"first": "Robert", "last": "Tarjan"}},
{"name": {"first": "Alan", "middle": "Curtis", "last": "Kay"}},
{"name": {"first": "Edsger", "middle": null, "last": "Dijkstra"}}

In that case, the field name.middle is a null in the last document as opposed to undefined in the first 2 documents that it is absent in.

u256

u256

u256s are unsigned integers with 256 bits. In Rockset, u256s must be in the range between 0 and 2^256 - 1, inclusive.

u256 supports the following operations and functions:

  • Arithmetic operations +, -, *, /, and %.
    • u256 does not support arithmetic operations with other numeric types.
  • Comparison operations =, <>, !=, <, >=, >, and >=.
    • Comparing other numeric types with u256 will return false for equality and produce an error for inequality. The other comparison operations have certain behaviors outlined in the next section.
  • Bitwise operations AND, OR, NOT, and XOR.
  • Aggregation functions AVG and SUM.
    • If the input mixes u256 and other numeric types in a single grouping, AVG and SUM will throw an exception.
  • Aggregation functions MIN, MAX, MIN_BY, MAX_BY, and COUNT.
    • If the values are entirely int/float or entirely u256, these functions compare the values arithmetically.
    • If the values are a mixture of int/float and u256, u256 values are considered greater than all int and float values for MAX and MAX_BY and less than for MIN and MIN_BY.
  • U256_FROM_BASE(string, base)creates u256 values from any base strings between 2 and 36, inclusive. We also support casting from base 10 strings.
  • PACK_U256(int, int, int, int) creates u256 values from 4 integers.

u256 has certain behaviors depending on the commands and operations:

  • GROUP BY and JOINs will consider u256s different from ints or floats, even if they are arithmetically the same.
  • ORDER BY orders all u256s after all ints and floats, while ints and floats are interleaved.
  • Casting between u256 and int, float, and string.
    • int: Value must be in the range of the target type. Otherwise, casting will throw a domain error.
    • float: Value must be in the range of the target type. Otherwise, casting will throw a domain error. Casting can cause some potential loss of precision in either direction.
    • string: Casting assumes base 10 encoding.
  • select * from foo where x = U256_FROM_BASE('5', 10) yields documents where x is a u256 with a value of 5. If the column contains ints or floats, the query will not return those documents.
  • select * from foo where x = 5 yields documents where x is an int or float with a value of 5. If the column contains u256s, the query will not return those documents.
  • select * from foo where x > U256_FROM_BASE('5', 10) yields documents where x is a u256 with a value greater than 5. If the column contains ints or floats, the query will not return those documents or throw an invalid comparison error.
  • select * from foo where x > 5 yields documents where x is an int or float with a value greater than 5. If the column contains u256s, the query will not return those documents or throw an invalid comparison error.

Nested Types

Rockset supports arrays and objects, which can contain any value and hence be nested recursively. The number of levels of a nesting of a field in Rockset is capped at 32. Any given level may have cardinality (keys in object or elements in array) of up to 2^32.

Array

array

An array represents an ordered list of zero or more values, each of any of Rockset type. For example, [], [1, 2, 3], and [false, 42, ['hello', 'world']] are all valid arrays.

Arrays can be created inline with ARRAY [1,2,3] or simply [1,2,3].

Values inside an array foo can be accessed in SQL statements using the square bracket notation shown below.

-- foo = [false, 42, ['hello', 'world']]
foo[2]     -- has value 42
foo[3][1]  -- has value 'hello'

In Rockset, array indexing starts at 1.

Object

object

An object represents a dictionary whose keys are strings and values are of any Rockset type. For example, {}, {'a': 10, 'b': 20}, and {'a': 10, 'b': {'c': true}} are all valid objects.

Values inside an object can be accessed in SQL statements using the dot notation shown below.

-- foo = {'a': 10, 'b': {'c': true}}
foo.a      -- has value 10
foo.b      -- has value {'c': true}
foo.b.c    -- has value true
foo."b".c  -- also works, field names can be individually escaped with double quotes

Alias is map.

Date and Time Types

Refer also to the page on date and time functions for functions to construct and manipulate date and time values.

When data is ingested into or returned from Rockset in JSON form, these types can be specified in a special format shown below.

{
  "_id": "foo"
  "_event_time": {"__rockset_type": "timestamp", "value": "42"}
}

Note that field _event_time is parsed not as object but rather timestamp with value 42.

Date

date

A date value represents a logical calendar date (year, month, day) independent of time zone. A date does not represent a specific time period; it can differ based on timezones. To represent an absolute point in time, use a timestamp instead.

A date literal in SQL syntax is formatted as follows.

DATE 'YYYY-[M]M-[D]D'

This consists of:

  • YYYY: Four-digit year
  • [M]M: One or two digit month
  • [D]D: One or two digit day
DATE '2018-01-01'  -- example literal
DATE(2018, 1, 1)   -- constructor function

Datetime

datetime

A datetime value represents a point in time (year, month, day, hour, minute, second, microsecond). It does not refer to an absolute instance in time, unlike timestamp. Instead, it is the civil time; the time that a user would see on a watch or calendar.

A date literal in SQL syntax is formatted as follows.

DATETIME 'YYYY-[M]M-[D]D[( )[H]H:[M]M:[S]S[.DDDDDD]]'

This consists of:

  • YYYY: Four-digit year
  • [M]M: One or two digit month
  • [D]D: One or two digit day
  • ( ): A space separator
  • [H]H: One or two digit hour (valid values from 00 to 23)
  • [M]M: One or two digit minutes (valid values from 00 to 59)
  • [S]S: One or two digit seconds (valid values from 00 to 59)
  • [.DDDDDD]: Up to six fractional digits
DATETIME '2018-01-01 9:30:45.456'     -- example literal
DATETIME(2018, 1, 1, 9, 30, 45, 456)  -- constructor function

Time

time

A time value represents the time of the day (hour, minute, second, millisecond) independent of a specific date.

A time literal in SQL syntax is formatted as follows.

TIME '[H]H:[M]M:[S]S[.DDDDDD]'

This consists of:

  • [H]H: One or two digit hour (valid values from 00 to 23)
  • [M]M: One or two digit minutes (valid values from 00 to 59)
  • [S]S: One or two digit seconds (valid values from 00 to 59)
  • [.DDDDDD: Up to six fractional digits
TIME '09:30:45.456'   -- example literal
TIME(9, 30, 45, 456)  -- constructor function

Timestamp

timestamp

A timestamp value represents absolute date and time values independent of any time zone.

A timestamp literal in SQL syntax is formatted as follows.

TIMESTAMP 'YYYY-[M]M-[D]D[( |T)[H]H:[M]M[:[S]S[.DDDDDD]]][time zone]'

This consists of:

  • YYYY: Four-digit year
  • [M]M: One or two digit month
  • [D]D: One or two digit day
  • ( ): A space separator
  • [H]H: One or two digit hour (valid values from 00 to 23)
  • [M]M: One or two digit minutes (valid values from 00 to 59)
  • [S]S: One or two digit seconds (valid values from 00 to 59)
  • [.DDDDDD]: Up to six fractional digits
  • [time zone]: Offset from Coordinated Universal Time (UTC). When a time zone is not explicitly
    specified, the default time zone, UTC, is used. The offset is formatted as (+|-)H[H][:M[M]], Z (synonym for UTC), or UTC. When using this format, there is a space between the time zone and the rest of the timestamp only for the UTC suffix.
TIMESTAMP '2018-01-01 09:30:45.456-05:00'  -- example literal

Month Interval

month_interval

A month interval refers to a specific number of months.

As months have different lengths, month intervals may only be added to or subtracted from date or datetime values.

Examples of month intervals are shown below.

INTERVAL 3 MONTH;
INTERVAL 2 YEAR;
INTERVAL '2-3' YEAR TO MONTH;

Microsecond Interval

microsecond_interval

A microsecond interval refers to a fixed amount of time with microsecond precision.

Microsecond intervals may be added to or subtracted from dates, times, datetimes, and timestamps. Also, you get a microsecond interval when you subtract two dates, times, datetimes, or timestamps (indicating the length of time between the two time points).

Examples of microsecond intervals are shown below.

INTERVAL 2 DAY;
INTERVAL 3 HOUR;
INTERVAL 5 MINUTE;
INTERVAL 10 SECOND;
INTERVAL '2 10:23:45.56' DAY TO SECOND;

Geography

geography

A geography refers to a shape on the surface of a sphere (Earth). It can be either a point (latitude and longitude), a linestring (an ordered list of connected points), or a polygon (a set of loops which enclose some interior space). Rockset internally uses Google's S2 Geometry library to manipulate geographies. A few aspects of using geographies can be counterintuitive:

  • Polygons consist of one or more outer shells, which may contain holes. For more details, read
    about GeoJSON.
  • Polygons contain whichever side of their boundary which has less area. For instance, an arc making
    a circle around the earth at 1 degree north latitude would contain most of the northern hemisphere, and an arc making a circle around the earth at 1 degree south latitude would contain most of the southern hemisphere. It is not possible to construct a polygon which contains more than half the earth.
  • Polygons must follow all the
    restrictions described by S2. They may not intersect themselves, contain duplicate points, or be otherwise degenerate.
  • All segments connecting points are geodesics. A geodesic is the shortest path over the surface of
    a sphere, which is not a straight line in latitude/longitude space.
  • The earth is modeled as a sphere, not an ellipsoid. Because of this, results may be distorted by
    up to 0.56%.
  • Geographies are not stored internally as latitude/longitude pairs, but rather S2's internal 3
    dimensional representation. Therefore importing and exporting geographies may lead to small rounding errors due to floating point precision.
  • Do not construct geographies vulnerable to numerical precision issues. For example, edges with
    nearly antipodal endpoints may form a geodesic in a different direction than you expect.

This is how you would create a point at 34N 12E:

SELECT ST_GEOGPOINT(12, 34)
{"__rockset_type":"GEOGRAPHY","value":{"type":"Point","coordinates":[12,34]}}

Here is another way to construct the same point from a string in well known text format.

SELECT ST_GEOGFROMTEXT('POINT(12 34)')
{"__rockset_type":"GEOGRAPHY","value":{"type":"Point","coordinates":[12,34]}}

This example creates a linestring from 34N 12E to 78N 56E to 42N 42E:

SELECT ST_GEOGFROMTEXT('LINESTRING(12 34, 56 78, 42 42)')
{"__rockset_type":"GEOGRAPHY","value":{"type":"LineString","coordinates":[[12,34],[56,78],[42,42]]}}

Here is an example of a roughly square polygon around the west coast of Africa:

SELECT ST_GEOGFROMTEXT('POLYGON((0 0, 10 0, 10 10, 0 10))')
{"__rockset_type":"GEOGRAPHY","value":{"type":"Polygon","coordinates":[[[0,0],[10,0],[10,10],[0,10],[0,0]]]}}

You can import geography types into Rockset by applying either ST_GEOGPOINT or ST_GEOGFROMTEXT in an ingest transformation. Refer also to the geographic functions page for functions to construct and manipulate geographic values.

⚠️

Depending upon the data source, your geographic data may not be validated by default.

Syntactically correct, but semantically invalid geographic values may be inserted into your collections.
Ingest transformations provide a mechanism to control the behavior of semantically invalid geographic data written to your collections.