Indexing Amazon S3 for Real-Time Analytics on Data Lakes

February 9, 2021

Shawn Adams

Solutions Engineer, Rockset

Amazon Simple Storage Service (Amazon S3) is one of the leading cloud object storage services available. It uses an HTTP interface, making it easy for application developers to integrate S3 into their applications.

Athena is a serverless query service provided by Amazon to query the data stored in Amazon S3 using standard SQL. Because it integrates easily with S3, is serverless, and uses a familiar language, Athena has become the default service for most business intelligence (BI) decision makers to query the large amounts of (usually streaming) data coming into their object stores.

Though it’s powerful enough to support massive batch analytics, Athena falls short when it comes to real-time analytics applications.

Limitations of Using S3 and Athena for Real-Time Analytics

The way Athena is built makes it clear that it’s not intended to be used for real-time analytics.

For example, when you run an Athena query, the query is submitted to a queue rather than being run immediately. When it’s time to run that query, the data is fetched from S3. Once the result is available, it is uploaded back to S3, in the designated path, where the application can finally access the result.

Furthermore, when querying S3 data from Athena, it has to query the complete dataset every time a query is run. You can create partitions when setting up the S3 bucket and the data path to limit the amount of data being queried, but once you set up the directory structure and the data is saved in that path, you can’t change it unless you’re ready to populate the data again. Additionally, the partition is limited only to timestamps, so you can’t have a custom partition, such as customer ID or zip code.

Another drawback is that there’s no way to index the data being populated in S3, meaning there’s no way to optimize query performance. You just have to hope that the dataset being queried is small enough that it doesn’t take too long to return with the results. You can build an effective analytics or reporting dashboard using the S3 and Athena combo, but if you try to build a real-time application you’ll find the latency is too high for it to be performant. Additionally, you can’t have more than a few concurrent connections to Athena. This will quickly become a bottleneck.

Because Athena is limited to running only five queries in parallel at any time by default, there’s no guarantee that your query will be executed immediately. It might work if you’re a small team or an individual. But if Athena is already integrated into an application with real users, they may have to wait minutes to get a response. This is definitely not a good user experience.

Athena is best for batch processing and applications where the latency of the result is not crucial. Athena also works well for data and business intelligence engineers who run a lot of ad hoc queries on the data during development. Once you’re ready to implement an application with low latency and high concurrency requirements though, you should start looking for alternatives.

Building Real-Time Analytics on S3 Using Rockset

Rockset was built with real-time analytics in mind. Rockset’s advanced indexes make it possible to serve results up to 125x faster than Athena, while making data ready to be queried in under a second of being ingested. For instance, you could have one application writing data to S3 while another application is querying for the same data in near-real time.

Athena is not a datastore by itself, it’s just a query engine for the datastore in S3. If you have JSON or CSV files in S3, they are going to be columnar in nature, and there’s only so much you can do with that kind of data. Rockset, however, takes that data and creates different types of indexes on it, thereby making queries as efficient as possible.

S3-Rockset Figure 1: Using Rockset to index data in Amazon S3 for real-time analytics

Converged Index

Rockset creates more than just one index for a piece of data coming into the database. For example, suppose you have JSON data coming into S3 with a field called “name” in it. Rockset sees this field and creates different types of key-value stores on this field. This feature is called converged indexing, and it comes with the following indexes:

Row store
Columnar store
Search index

converged-index Figure 2: Example of converged indexing

As you can see from Figure 3 below, these indexes are used for different purposes based on the query you’re running. For example, if you run a query to find the average value or to sum the values of a particular field, Rockset will optimize for this request and automatically use the columnar store to fetch the results. Similarly, if you are trying to filter your data based on the value of a particular field, Rockset will again optimize for that request and automatically use the search index.

converged-index-different-queries Figure 3: Different indexes are used for different types of queries

Having different types of indexes and letting Rockset decide which is best for a given query means you can stop worrying about optimizing your query and focus on building your feature.

Query Latency

Because Rockset automatically maintains these extensive indexes, less data has to be scanned to get the results of a query. This drastically reduces latency so that Rockset can be used in real-time applications.

This is possible because Rockset decides which index should be used on the fly based on the query. If required, Rockset can use multiple indexes for a single query.

Concurrent Queries

When many users are using your application and frequently querying the database, you need to have a large number of concurrent queries running. This is why Athena’s default limitation of five queries running in parallel can cause a bottleneck, and it’s not straightforward how to increase that number.

Conversely, Rockset supports 1000s of QPS (queries per second) by taking advantage of cloud elasticity and autoscaling compute as needed to handle large query volumes.

Mutability of Data and Schema

In Athena, if you want to change the schema, say to add or remove a field, you have to go to Hive or Glue to make that change. It’s very explicit and involves manual intervention. But with Rockset, it’s all dynamic.

Because Rockset creates indexes based on the data coming in, it automatically adjusts to the schema of the incoming data. This can be a huge timesaver when you have a variety of data coming in from many sources. With Rockset, the data becomes available for queries as soon as it is received, without the need for a predetermined schema.

Developer Productivity

Rockset offers a stored procedure-like feature called Query Lambdas. It is a named, parameterized SQL query saved on Rockset.

Query Lambdas are serverless stored queries in Rockset that use RESTful APIs for interfacing. They take parameters in the API request to be used in the query that will ultimately be run. The query result then comes back in the response of that API request.

The advantage of using Query Lambdas is that you can keep your application code free of hard-coded SQL queries. Based on your needs, you can easily change the query independently of the application and update the Query Lambda in the backend. This does not require any app updates on the user’s end, and they will continue to get the updated results.

Because the interface to Query Lambdas is RESTful APIs, it’s convenient for developers to get started. This also means that a backend team can be writing and updating queries on the Rockset end while frontend developers can simply consume the APIs and focus on improving the application, without having to write complex SQL queries.

Making Real-Time Analytics Possible on Data Lakes

While the S3 and Athena combination is adequate for asynchronous querying use cases, it is less well suited to real-time analytics. Athena was, after all, designed primarily for infrequent queries that could tolerate high variability in latency.

Real-time applications, on the other hand, demand a different type of architecture that optimizes for speed, concurrency, and schema flexibility. If you have a requirement to build more demanding applications on data in S3, Rockset offers a purpose-built solution for real-time analytics.

To learn more, view the Rockset Real-Time Analytics on Data Lakes tech talk with CTO, Dhruba Borthakur, for a more in-depth discussion of key considerations when building applications on S3 data.

To learn more, view the Rockset tech talk below with CTO, Dhruba Borthakur, for a more in-depth discussion of key considerations when building applications on S3 data.

Embedded content: https://youtu.be/9Ytmo6PCBHc