Tech Talk - Real-Time Analytics on Data Lakes: Indexing Amazon S3 for up to 125x Faster Queries

Tech Talk

Real-Time Analytics on Data Lakes: Indexing Amazon S3 for up to 125x Faster Queries

In this talk, Rockset Co-founder and CTO Dhruba Borthakur explains how real-time indexing on data lakes provides up to 125X faster queries than Athena.

More Details

While Athena is widely used for querying data in S3, it cannot provide the performance needed for real-time analytics like customer 360s, personalization, IoT applications and more. Dhruba draws on his experience as a founding engineer of RocksDB and explains how to use real-time indexing on your data lake for real-time analytics that powers high-performance applications.

Indexing vs scanning - up to 125x faster than Athena. Rockset automatically builds a search index, column index, and row index on data ingested from S3 to accelerate the types of queries that are common to applications. Athena performs scans when queried, and its pricing is based on size of data scanned, so it is better suited for occasional ad hoc queries rather than high performance real-time analytics.
High concurrency - 1000x concurrency vs Athena. Rockset has a distributed cloud-native architecture that allows ingest, storage and query tiers to scale independently in response to workload. The ability to scale query compute as needed allows Rockset to support large numbers of concurrent users without performance degradation. In contrast, Athena can only execute 5 concurrent queries and queues any additional queries.
Real-time analytics - 1 second end-to-end latency vs. hours with Athena. Rockset allows queries on JSON, Avro and Parquet formats without any schema or table definition. It supports schemaless ingestion of data and automatically generates schemas based on the exact fields and types present in the ingested data, so users can run SQL on their raw data. Athena requires the creation of schemas and tables before users can query the data, resulting in delays whenever the schema changes.

About the Speaker

Dhruba is the CTO and co-founder of Rockset. He was an engineer on the database team at Facebook, where he was the founding engineer of the RocksDB data store. Earlier at Yahoo, he was one of the founding engineers of the Hadoop Distributed File System.