Using the Amazon MSK Native Connector to Simplify Real-Time Analytics on Kafka

December 14, 2022

Register for
Index Conference

Hear talks on search and AI from engineers at Netflix, DoorDash, Uber and more.

Rockset’s native connector for Amazon Managed Streaming for Apache Kafka (MSK) makes it simpler and faster to ingest streaming data for real-time analytics. Amazon MSK is a fully managed AWS service that gives users the ability to build and run applications using Apache Kafka. Amazon MSK provides control-plane operations such as creating and deleting clusters, while allowing users to use Apache Kafka data-plane operations for producing and consuming data.

With the MSK integration, users do not need to build, deploy or operate any infrastructure components on the Kafka side. Here’s how Rockset is making it easier to ingest streaming data from MSK with this data integration:

  • The integration is managed entirely by Rockset and can be set up with just a few clicks, keeping with our philosophy of making real-time analytics accessible.
  • The integration is continuous so any new data in the Kafka topic will get indexed in Rockset, delivering an end-to-end data latency of around two seconds.
  • There is no need to pre-create a schema to run real-time analytics on event streams from Kafka. Rockset indexes the entire data stream so when new fields are added, they are immediately exposed and made queryable using SQL.

Under the Hood

Rockset’s Kafka integration adopts the Kafka Consumer API, which is a low-level, vanilla Java library that can be easily embedded into applications to tail data from a Kafka topic.

When you create a new collection from an Amazon MSK integration and specify one or more topics, Rockset tails those topics using the Kafka Consumer API and consumes data in real time. Rockset handles all the heavy lifting such as progress checkpointing and addressing common failure cases with the Aggregator Leaf Tailer Architecture (ALT). The consumption offsets are completely managed by Rockset, without saving any information inside a customer’s cluster. Each ingestion worker receives its own topic partition assignment and last processed offsets during the initialization from the ingestion coordinator, and then leverages the embedded consumer to fetch Kafka topic data.

The main difference between Amazon MSK and Confluent Kafka in Rockset’s Kafka integration is how we authenticate with your cluster. Amazon MSK uses IAM for secure authentication, so we added support for IAM authentication using AWS Cross-Account IAM Roles. When you create a new Amazon MSK integration and provide a Cross-Account IAM role, Rockset authenticates with your MSK cluster using the Amazon MSK Library for IAM.

Amazon MSK and Rockset for Real-Time Analytics

As soon as event data lands in MSK, Rockset automatically indexes it for sub-second SQL queries. You can search, aggregate and join data across Kafka topics and other data sources including data in S3, MongoDB, DynamoDB, Postgres, and more. Then, simply turn the SQL query into an API to serve data in your application.

We have also load tested the new MSK integration with sample data and various load configurations, sending a max throughput of approximately 33 MB/s.


Quick Amazon MSK Setup

Set up the Integration

To set up an Amazon MSK Integration, first go to the integrations page on the Rockset console. Select the Amazon MSK option and click “Start” to begin creating your MSK integration and provide information for Rockset to connect to your cluster.


Provide a name for your integration along with an optional description. Create a new IAM policy and attach the policy to a new or existing IAM role to give Rockset read access to your MSK cluster. Provide the role ARN for the IAM role and the bootstrap servers URL from your MSK cluster’s dashboard.



Create a Collection

A collection in Rockset is similar to a table in the SQL world. To create a collection, simply add in details including the Kafka topic(s) you want Rockset to consume. The starting offset enables you to backfill historical data as well as capture the latest streams.


Query Topic Data using SQL

As soon as the data is ingested, Rockset will index the data in a Converged Index for fast analytics at scale. This means you can query semi-structured, deeply nested data using SQL without needing to do any data preparation or performance tuning.

In this example, we can simply write a SQL query on the Amazon MSK data we've just set up the integration for, going from setup to query in a matter of minutes.


We’re excited to continue to make it easy for developers and data teams to analyze streaming data in real time. If you’re a user of Amazon MSK, it’s easier now than ever before with Rockset’s native support for MSK.