Working with Mixed Data Types within a Field Using Rockset

May 5, 2021

,

Register for
Index Conference

Hear talks on search and AI from engineers at Netflix, DoorDash, Uber and more.

So... you think all your data in a particular field are a string type, but when you try to run your query, you get some errors. Doing more investigation, it looks like you have some int and undefined types as well. Bummer...

panda crying

Despair not! We can actually work around this (without data prep 😉). To recap, in our first blog, we created an integration with MongoDB on Rockset, so Rockset can read and [update] the data coming in MongoDB. Once the data is in Rockset, we can run SQL on schemaless and unstructured data.

The data:

Embedded content: https://gist.github.com/nfarah86/ef1cc9da88e56226c4c46fd0e3c8e16e

We are interested in the release_date field: "release_date": "1991-06-07".

The query:

Rockset has a function called DATE_PARSE(), which allows you to turn a string formatted date into a date object. If you want to order movies by just the year, you can use EXTRACT().

Essentially, once you turn your string formatted date into a date object, you can then extract the year.

At first glance, this seems pretty easy to solve— if you wanted to order all the movie titles by the release year, you can write something like this:

SELECT
      t.title, t.release_date
FROM
    commons.TwtichMovies t
ORDER BY
    EXTRACT(
         YEAR
         FROM
         DATE_PARSE(t.release_date, '%Y-%m-%d')
    ) DESC
;

When running this query, we get a timestamp parsing error: Error [Query] Timestamp parse error:

This could mean you’re working with other data types that are not strings. To inspect, you can write something like this:

SELECT
      t.title, TYPEOF(t.release_date)
FROM
    commons.TwtichMovies t
WHERE 
    TYPEOF(t.release_date) != 'string'
;

This is what we get back:

sql result

Now, that we know what’s causing the error, we can re-write the query to discard anything that’s not a string type— right 🤗?

SELECT
      t.title, t.release_date
FROM
    commons.TwtichMovies t
WHERE
    TYPEOF(t.release_date) = 'string';
ORDER BY
     EXTRACT(
          YEAR
          FROM
              DATE_PARSE(t.release_date, '%Y-%m-%d')
     )DESC
;

WRONG 🥺! This actually returns a timestamp parsing error as well:
Error [Query] Timestamp parse error

You're probably saying to yourself, “what the heck.” One case we didn’t take into consideration earlier is that there could be empty strings 🤯- If we run the following query:

SELECT DATE_PARSE('', '%Y-%m-%d');

We get the same timestamp parsing error back:
Error [Query] Timestamp parse error

Aha.

How do we actually write this query to avoid the timestamp parsing errors? Here, we can actually check the LENGTH() of the string and filter out everything that doesn’t meet the length requirement— so something like this:

WHERE LENGTH(t.release_date) = 10

We can also TRY_CAST() t.release_date to a string. If the field value can’t be turned into a string, a null value is returned (i.e. it won’t error out). Putting this all together, we can technically write something like this:

SELECT
    t.title,
    t.release_date
FROM
    commons.TwtichMovies t
WHERE
    TRY_CAST(t.release_date AS string) is not null
    AND LENGTH(TRY_CAST(t.release_date AS string)) = 10
ORDER BY
    EXTRACT(
         YEAR
         FROM
         DATE_PARSE(t.release_date, '%Y-%m-%d')
    )
;

Voila! it works!

panda happy

During the stream, I actually wrote a more complicated version of this query. The above query and the query in the stream are equivalent. We also wrote queries that aggregate! You can catch the full breakdown of the session below:

Embedded content: https://youtu.be/PGpEsg7Qw7A

TLDR: you can find all the resources you need to get started on Rockset in the developer corner.