How to Update Documents in Elasticsearch
January 23, 2024
Elasticsearch is an open-source search and analytics engine based on Apache Lucene. When building applications on change data capture (CDC) data using Elasticsearch, you’ll want to architect the system to handle frequent updates or modifications to the existing documents in an index.
In this blog, we’ll walk through the different options available for updates including full updates, partial updates and scripted updates. We’ll also discuss what happens under the hood in Elasticsearch when modifying a document and how frequent updates impact CPU utilization in the system.
Example application with frequent updates
To better understand use cases that have frequent updates, let’s look at a search application for a video streaming service like Netflix. When a user searches for a show, ie “political thriller”, they are returned a set of relevant results based on keywords and other metadata.
Let’s look at an example document in Elasticsearch of the show “House of Cards”:
The search can be configured in Elasticsearch to use
description as full-text search fields. The
views field, which stores the number of views per title, can be used to boost content, ranking more popular shows higher. The
views field is incremented every time a user watches an episode of a show or a movie.
When using this search configuration in an application the scale of Netflix, the number of updates performed can easily cross millions per minute as determined by the Netflix Engagement Report. From the Netflix Engagement Report, users watched ~100 billion hours of content on Netflix between January to July. Assuming an average watch time of 15 minutes per episode or a movie, the number of views per minute reaches 1.3 million on average. With the search configuration specified above, each view would require an update in the millions scale.
Many search and analytics applications can experience frequent updates, especially when built on CDC data.
Performing updates in Elasticsearch
Let’s delve into a general example of how to perform an update in Elasticsearch with the code below:
Full updates versus partial updates in Elasticsearch
The index API retrieves the entire document, makes changes to the document and then reindexes the document. With the update API, you simply send the fields you wish to modify, instead of the entire document. This still results in the document being reindexed but minimizes the amount of data sent over the network. The update API is especially useful in cases where the document size is large and sending the entire document over the network will be time consuming.
Let’s see how both the index API and the update API work using Python code.
Full updates using the index API in Elasticsearch
As you can see in the code above, the index API requires two separate calls to Elasticsearch which can result in slower performance and higher load on your cluster.
Partial updates using the update API in Elasticsearch
Partial updates internally use the reindex API, but have been configured to only require a single network call for better performance.
You can use the update API in Elasticsearch to update the view count but, by itself, the update API cannot be used to increment the view count based on the previous value. That is because we need the older view count to set the new view count value.
Let’s see how we can fix this using a powerful scripting language, Painless.
Partial updates using Painless scripts in Elasticsearch
Painless is a scripting language designed for Elasticsearch and can be used for query and aggregation calculations, complex conditionals, data transformations and more. Painless also enables the use of scripts in update queries to modify documents based on complex logic.
In the example below, we use a Painless script to perform an update in a single API call and increment the new view count based on the value of the old view count.
The Painless script is pretty intuitive to understand, it is simply incrementing the view count by 1 for every document.
Updating a nested object in Elasticsearch
Nested objects in Elasticsearch are a data structure that allows for the indexing of arrays of objects as separate documents within a single parent document. Nested objects are useful when dealing with complex data that naturally forms a nested structure, like objects within objects. In a typical Elasticsearch document, arrays of objects are flattened, but using the nested data type allows each object in the array to be indexed and queried independently.
Painless scripts can also be used to update nested objects in Elasticsearch.
Adding a new field in Elasticsearch
Adding a new field to a document in Elasticsearch can be accomplished through an index operation.
You can partially update an existing document with the new field using the Update API. When dynamic mapping on the index is enabled, introducing a new field is straightforward. Simply index a document containing that field and Elasticsearch will automatically figure out the suitable mapping and add the new field to the mapping.
With dynamic mapping on the index disabled, you will need to use the update mapping API. You can see an example below of how to update the index mapping by adding a “category” field to the movies index.
Updates in Elasticsearch under the hood
While the code is simple, Elasticsearch internally is doing a lot of heavy lifting to perform these updates because data is stored in immutable segments. As a result, Elasticsearch cannot simply make an in-place update to a document. The only way to perform an update is to reindex the entire document, regardless of which API is used.
Elasticsearch uses Apache Lucene under the hood. A Lucene index is composed of one or more segments. A segment is a self-contained, immutable index structure that represents a subset of the overall index. When documents are added or updated, new Lucene segments are created and older documents are marked for soft deletion. Over time, as new documents are added or existing ones are updated, multiple segments may accumulate. To optimize the index structure, Lucene periodically merges smaller segments into larger ones.
Updates are essentially inserts in Elasticsearch
Since each update operation is a reindex operation, all updates are essentially inserts with soft deletes.
There are cost implications for treating an update as an insert operation. On one hand, the soft deletion of data means that old data is still being retained for some period of time, bloating the storage and memory of the index. Performing soft deletes, reindexing and garbage collection operations also take a heavy toll on CPU, a toll that is exacerbated by repeating these operations on all replicas.
Updates can get more tricky as your product grows and your data changes over time. To keep Elasticsearch performant, you will need to update the shards, analyzers and tokenizers in your cluster, requiring a reindexing of the entire cluster. For production applications, this will require setting up a new cluster and migrating all of the data over. Migrating clusters is both time intensive and error prone so it's not an operation to take lightly.
Updates in Elasticsearch
The simplicity of the update operations in Elasticsearch can mask the heavy operational tasks happening under the hood of the system. Elasticsearch treats each update as an upsert, requiring the full document to be recreated and reindexed. For applications with frequent updates, this can quickly become expensive as we saw in the Netflix example where millions of updates happen every minute. We recommend either batching updates using the Bulk API, which adds latency to your workload, or looking at alternative solutions when faced with frequent updates in Elasticsearch.
Rockset, a search and analytics database built in the cloud, is a mutable alternative to Elasticsearch. Being built on RocksDB, a key-value store popularized for its mutability, Rockset can make in-place updates to documents. This results in only the value of individual fields being updated and reindexed rather than the entire document. If you’d like to compare the performance of Elasticsearch and Rockset for update-heavy workloads, you can start a free trial of Rockset with $300 in credits.