Working with Mixed Data Types within a Field Using Rockset
May 5, 2021
So... you think all your data in a particular field are a string type, but when you try to run your query, you get some errors. Doing more investigation, it looks like you have some int and undefined types as well. Bummer...
Despair not! We can actually work around this (without data prep 😉). To recap, in our first blog, we created an integration with MongoDB on Rockset, so Rockset can read and [update] the data coming in MongoDB. Once the data is in Rockset, we can run SQL on schemaless and unstructured data.
The data:
Embedded content: https://gist.github.com/nfarah86/ef1cc9da88e56226c4c46fd0e3c8e16e
We are interested in the release_date
field: "release_date": "1991-06-07".
The query:
Rockset has a function called DATE_PARSE(), which allows you to turn a string formatted date into a date object. If you want to order movies by just the year, you can use EXTRACT().
Essentially, once you turn your string formatted date into a date object, you can then extract the year.
At first glance, this seems pretty easy to solve— if you wanted to order all the movie titles by the release year, you can write something like this:
SELECT
t.title, t.release_date
FROM
commons.TwtichMovies t
ORDER BY
EXTRACT(
YEAR
FROM
DATE_PARSE(t.release_date, '%Y-%m-%d')
) DESC
;
When running this query, we get a timestamp parsing error: Error [Query] Timestamp parse error:
This could mean you’re working with other data types that are not strings. To inspect, you can write something like this:
SELECT
t.title, TYPEOF(t.release_date)
FROM
commons.TwtichMovies t
WHERE
TYPEOF(t.release_date) != 'string'
;
This is what we get back:
Now, that we know what’s causing the error, we can re-write the query to discard anything that’s not a string type— right 🤗?
SELECT
t.title, t.release_date
FROM
commons.TwtichMovies t
WHERE
TYPEOF(t.release_date) = 'string';
ORDER BY
EXTRACT(
YEAR
FROM
DATE_PARSE(t.release_date, '%Y-%m-%d')
)DESC
;
WRONG 🥺! This actually returns a timestamp parsing error as well:
Error [Query]
Timestamp parse error
You're probably saying to yourself, “what the heck.” One case we didn’t take into consideration earlier is that there could be empty strings 🤯- If we run the following query:
SELECT DATE_PARSE('', '%Y-%m-%d');
We get the same timestamp parsing error back:
Error [Query]
Timestamp parse error
Aha.
How do we actually write this query to avoid the timestamp parsing errors? Here, we can actually check the LENGTH() of the string and filter out everything that doesn’t meet the length requirement— so something like this:
WHERE LENGTH(t.release_date) = 10
We can also TRY_CAST() t.release_date
to a string. If the field value can’t be turned into a string, a null
value is returned (i.e. it won’t error out). Putting this all together, we can technically write something like this:
SELECT
t.title,
t.release_date
FROM
commons.TwtichMovies t
WHERE
TRY_CAST(t.release_date AS string) is not null
AND LENGTH(TRY_CAST(t.release_date AS string)) = 10
ORDER BY
EXTRACT(
YEAR
FROM
DATE_PARSE(t.release_date, '%Y-%m-%d')
)
;
Voila! it works!
During the stream, I actually wrote a more complicated version of this query. The above query and the query in the stream are equivalent. We also wrote queries that aggregate! You can catch the full breakdown of the session below:
Embedded content: https://youtu.be/PGpEsg7Qw7A
TLDR: you can find all the resources you need to get started on Rockset in the developer corner.