9 Lessons Learned Scaling Elasticsearch for Real-Time Analytics
We capture lessons learned from engineering teams who scaled real-time analytics on Elasticsearch and common challenges around streaming ingestion, managing relationships, frequently changing data and more.
Elasticsearch is a popular open-source search engine adopted by many engineering teams because it’s blazingly fast due to indexing, developer-friendly and has great documentation. Many engineering teams start to use Elasticsearch for log analytics and then addon more use cases, including real-time analytics.
In this whitepaper, we cover how teams have scaled Elasticsearch for real-time analytics and overcome challenges at scale including:
- Streaming ingestion: An Elasticsearch cluster can throttle when ingesting change data capture (CDC) streams with frequent inserts, updates and deletes. As a result, many companies batch frequently changing fields using the Bulk API for stability and cost-efficiency.
- Sharding and re-sharing: The number of shards is set when the cluster is created and determined based on access patterns and load. This can be challenging for multi-tenant applications where the access patterns and load vary.
- Modeling relationships: Elasticsearch is a non-relational database and there are several workarounds to SQL-style joins including data denormalization, application-side joins, nested objects and parent-child relationships.
- Managing the system at scale: There are many knobs that can be turned in order to operate Elasticsearch at scale and that’s why engineering teams need to be well-versed in data management, the Query DSL, data processing and cluster management.
This whitepaper is sponsored by Intel and Rockset. Rockset achieves 84% faster performance with Intel Xeon Scalable processors for real-time analytics in the cloud.