Index Scan: Using Rockset's Search Index to Speed up Range Scans Over a Specific Field
April 28, 2020
Recently, InfoWorld’s Martin Heller described Rockset as a "one-of-a-kind database for operational analytics." After testing Rockset with a variety of queries on a large collection, Heller rated Rockset 4.5 out of 5 stars. Heller’s review of Rockset can be found here.
Only one of the test queries timed out:
SELECT * FROM commons."twitter-firehose"
ORDER BY "twitter-firehose".favorite_count DESC LIMIT 10
For context, twitter-firehose
is one of Rockset’s demo collections. It contains 30 million documents and represents Twitter posts over one month. Heller explained that the query’s timeout was unavoidable because, “there is no way to make that query run fast with Rockset’s converged index, or any indexing scheme I can think of: It requires a full scan and a global sort.“
But what if I told you that there IS a way to make this query run fast? Interested? Flummoxed? Keep reading to learn how to best leverage Rockset’s Converged Indexing and make queries run faster.
Behold, Index Scan
If you don’t already know how Converged Indexing works, check out our previous blog, Converged Indexing: The Secret Sauce Behind Rockset’s Fast Queries. At a high level, Rockset indexes each column of every document in multiple ways. To speed up this particular query, the answer lies in how Rockset utilizes the search index.
The search index is optimized for locating which documents contain field foo
with value bar
. Since the search index is sorted by (field, value)
, then it is also optimized for range scans over a particular field. A range scan is when index entries are read sequentially, beginning at some start value and ending at some end value.
Rockset’s Index Scan access path is specifically designed for performing range scans over the search index. Using Index Scan, we can easily find the documents where field foo
has values within a certain range. Naturally, we can also use Index Scan to scan over all values of field foo
in ascending or descending order; this is the same as scanning over the range [MIN, MAX]
.
Bringing the focus back to our original query, we can drastically speed up the query by using Index Scan to fetch the values for the field favorite_count
in descending order. This way, we will only need to fetch 10 rows from the collection twitter-firehose
, instead of performing a full scan of the collection and doing a global sort. With the document IDs retrieved by Index Scan, we can fetch the values for the remaining fields in the collection from either the row-based store or column-based store to complete the query result set. Thus, using the Index Scan access path minimizes the amount of data fetched from the index, which dramatically cuts the query latency. Query compute cost is also reduced as a result of applying Index Scan to this query, which is pretty awesome.
Unlocking the Index Scan Magic
To unlock the Index Scan magic, we need to add a hint to the query so it becomes:
SELECT * FROM commons."twitter-firehose"
ORDER BY "twitter-firehose".favorite_count DESC LIMIT 10
HINT(access_path=index_scan, index_scan_sort_field=favorite_count)
Index Scan is relatively new and has not been incorporated into Rockset’s query optimizer yet. So for now, we must specify through query hint that we want to use the Index Scan access path, and we must specify the field that Index Scan will scan over. In the future, the query optimizer will decide the optimal access path to use, so this query hint will not be necessary.
What Kinds of Queries Benefit from Index Scan?
Essentially, any query that needs to do a range scan over a specific field can benefit from using Index Scan.
For example, consider running an ecommerce site. On the page where the items being sold are displayed, users can choose to apply viewing filters. A common filter is to allow users to view the items in either ascending or descending order of price. In this case, a query using Index Scan can quickly fetch the needed results:
SELECT item_name, price FROM items
ORDER BY price ASC -- or DESC
LIMIT 30
HINT(access_path=index_scan, index_scan_sort_field=price)
Another example can be found in the backend of an IoT application. Wearable fitness devices track a person’s heart rate, number of steps taken, and other useful metrics for assessing a person’s physical activity over time. Identifying patterns in these metrics can provide insights into how to improve a person’s fitness routine. One useful pattern would be to identify the times of day when a person works out (has higher heart rate).
SELECT time, heart_rate FROM items
WHERE heart_rate > 72 -- resting heart rate
HINT(access_path=index_scan, index_scan_sort_field=heart_rate)
To reiterate, Rockset’s Index Scan access path can dramatically decrease query latency and compute costs for queries doing range scans over a specific field. This is just one of the many improvements we have been making to Rockset’s query engine, so stay tuned for more to come!