0

Elasticsearch Delete Documents

We share the different methods available in Elasticsearch to delete data and their impact on performance and cluster health using the example of a social networking site.

Contents


Introduction

Deleting data in Elasticsearch can be done at different levels, including deleting an individual document, deleting an index, or using deletion queries to remove multiple documents based on specific criteria. In this blog, we share the different methods available in Elasticsearch to delete data and their impact on performance and cluster health using the example of a social networking site.

Deleting Documents

To demonstrate the different ways to delete documents in Elasticsearch, we’ll use an example of a social networking site like Linkedin. Linkedin enables users to search for different content across its site, including job posts, newsfeed posts, people and groups, using a combination of keyword search and filters. When designing the search functionality, we want to ensure that deleted posts, jobs and deactivated users do not appear in the search results. Let’s see how this requirement translates to the different deletion methods available in Elasticsearch.

Delete API

The Delete API in Elasticsearch allows you to remove a single document from an index based on its unique ID. This operation is straightforward and is commonly used when you know the specific document you want to delete. If each of the posts has a unique ID associated with it, it is simple to remove the document from the search results.

      import requests

      # Elasticsearch settings
      elasticsearch_url = "http://localhost:9200"
      index_name = "posts"
      post_id = "12345"

      # Construct the URL for the Delete API
      url = f"{elasticsearch_url}/{index_name}/_doc/{post_id}"

      # Send a DELETE request to Elasticsearch
      response = requests.delete(url)
    

Delete By Query API

The Delete By Query API in Elasticsearch allows you to delete documents from one or more indices that match a specific criteria specified in a query. For example, a user may deactivate their account on the social networking site and we want all of their authored posts to be removed from the search results. In this case, it will be more efficient to delete multiple posts using a single query.

      import requests

      # Elasticsearch settings
      elasticsearch_url = "http://localhost:9200"
      index_name = "my_index"

      # Author ID to match for deletion
      author_id = "77368c91-7b77-4bcc-b31f-d4dd6f943aae"

      # Query to match documents for deletion based on the author field
      query = {
          "query": {
              "match": {
                  "authorId": author_id
              }
          }
      }

      # Construct the URL for the Delete By Query API
      url = f"{elasticsearch_url}/{index_name}/_delete_by_query"

      # Send a POST request with the query
      response = requests.post(url, json=query)
    

Bulk API

Elasticsearch has a Bulk API that is efficient at performing bulk operations, including bulk deletion operations, using a single request. For the social networking site example, we may want to use the Bulk API to remove bot accounts to ensure the integrity of the site. We can use the Bulk API to delete all of the unique IDs of bot accounts and their relevant posts using a single API request.

      import requests

      # Elasticsearch server URL
      elasticsearch_url = "http://localhost:9200"

      # Index name
      index_name = "your_index"

      # List of document IDs to delete
      document_ids_to_delete = ["doc_id_1", "doc_id_2", "doc_id_3"]

      # Construct the Bulk API request payload
      bulk_request_body = ""
      for doc_id in document_ids_to_delete:
        bulk_request_body += '{"delete": {"_index": "' + index_name + '", "_id": "' + doc_id + '"}}\n'

      # Make the Bulk API request
      bulk_api_url = f"{elasticsearch_url}/{index_name}/_bulk"
      response = requests.post(bulk_api_url, data=bulk_request_body, headers={"Content-Type": "application/json"})
    

When to use which API?

The examples above show how you can use the Delete API for simple use cases and progress to the Bulk API for more complex use cases. The Delete by Query API is suitable when you have complex rules, such as deleting documents that match certain criteria, that can be expressed in a single query. If we want to collate data from multiple sources and identify documents for deletion based on custom logic, the Bulk API is a good choice as it's a single network call, saving on costs.

Best Practices when deleting documents

Versioning

In a distributed system like Elasticsearch, which supports the concurrent handling of requests across multiple nodes, scenarios can arise where a document is being updated by one entity while another attempts to delete that same document. This concurrent access can lead to version conflicts, resulting in unexpected system behavior and inaccurate outcomes for users. Version conflicts are challenging to diagnose and resolve for developers, primarily because they manifest only under specific conditions—namely, when simultaneous API requests target the same document.

Elasticsearch addresses this challenge through the use of document versioning. Each document within the system is assigned a unique version number, which is incremented with every update. By specifying the expected version of a document in a delete request, we can ensure that the operation only succeeds if the document has not been altered since that version. If the document's version does not match the specified version (indicating that the document has been updated since the version was last checked), Elasticsearch will flag this as a version conflict, and the delete request will be rejected. Versions prevent the accidental deletion of documents that have been recently updated by other processes.

      from elasticsearch import Elasticsearch

      # Assuming you have an Elasticsearch instance running locally
      es = Elasticsearch([{'host': 'localhost', 'port': 9200}])

      index_name = 'your_index_name'
      document_id = 'your_document_id'
      expected_version = 5

      # Delete document with specified version
      try:
        es.delete(index=index_name, id=document_id, version=expected_version)
        print(f"Document with ID {document_id} deleted successfully.")
      except Exception as e:
      if "VersionConflictEngineException" in str(e):
        print(f"Version conflict: Document with ID {document_id} has been updated.")
      else:
        print(f"Error deleting document: {e}")
    

We can also use the version_type parameter to handle version conflicts more flexibly in Elasticsearch. The version_type parameter allows us to dictate how Elasticsearch should respond when there’s a discrepancy between the version we specified in the request and the current version of the document in the index.

The default setting for version_type is internal. In this mode, the delete operation will only succeed if the versions match exactly. When we set the version_type to external, Elasticsearch adopts a more flexible approach to version control. With external, the delete operation checks if the version is less than the current document version and rejects the deletion if this condition is true. However, if the specified version is equal to or greater than the document's current version, the operation proceeds. This allows for scenarios where an external system or a different part of the application, which may not be perfectly synchronized with Elasticsearch's versioning, dictates the necessity of a deletion based on its logic.

Refreshing shards

Elasticsearch's architecture uses segments, which are immutable, write-once files that store indexed documents. This design choice has significant implications for how Elasticsearch handles data operations such as adding, updating, and deleting documents.

Over time, as more documents are added, updated, or deleted in Elasticsearch, the number of segments in an index grows. Too many segments can lead to increased search latency and resource usage. To mitigate this, Elasticsearch periodically performs a process called segment merging. During this process, smaller segments are combined into larger ones, and any documents marked as deleted are physically removed from the new segment. This not only helps in reclaiming space but also in keeping the search performance optimized.

When a document is marked for deletion in Elasticsearch, it isn't immediately removed from the segment it resides in. Instead, Elasticsearch marks the document as deleted within the segment's metadata. The document continues to exist in the segment and can still be returned in search results until the segment it resides in is merged. During the segment merging process, Elasticsearch physically removes documents that have been marked as deleted, effectively purging them from the storage.

In scenarios where immediate removal of a document from search results is necessary, such as when a user deletes a post on a platform, Elasticsearch offers a solution through the refresh parameter. By using this parameter in your delete query, you can force Elasticsearch to refresh the index, making the deletion effective immediately. This operation is resource-intensive and can impact performance if used frequently or on large indices.

      from elasticsearch import Elasticsearch

      # Elasticsearch configuration
      es = Elasticsearch([{'host': 'localhost', 'port': 9200}])

      index_name = 'your_index_name'
      document_id = 'your_document_id'

      # Deleting a document with immediate index refresh
      es.delete(index=index_name, id=document_id, refresh='true')
    

Running delete by query asynchronously

Elasticsearch supports asynchronous delete operations where a deletion is initiated by a client and then run as a background task by the cluster, freeing the client from waiting for the operation to complete and allowing for the efficient management of cluster resources.

Asynchronous deletes are advisable when you are deleting a large number of documents that can take minutes to complete. HTTP requests are not meant to take minutes to complete; they’re meant to be fast operations. Long running HTTP requests are a waste of resources for the cluster, client and proxy servers.

We can use the wait_for_completion parameter to run deletes asynchronously.

      import requests

      # Elasticsearch server URL
      elasticsearch_url = "http://localhost:9200"

      # Index name
      index_name = "your_index_name"

      # Query body for delete by query
      query_body = {
          "query": {
              "match": {
                  "field": "value"
              }
          }
      }

      # Set wait_for_completion parameter
      wait_for_completion = False # Set to False if you want asynchronous execution

      # Construct the URL for delete by query operation
      delete_by_query_url =
      f"{elasticsearch_url}/{index_name}/_delete_by_query?wait_for_completion={wait_for_completion}"

      # Make the delete by query request using the requests library
      response = requests.post(delete_by_query_url, json=query_body)
    

We can monitor the asynchronous deletion tasks with the following API call:

      GET /_tasks?actions=*delete*
    

Throttling delete requests

Delete operations in Elasticsearch can be resource-intensive because they not only involve removing the document but also updating the index and possibly triggering segment merges. These operations consume CPU, I/O, and memory resources, which could otherwise be used for handling search queries or indexing new data.

In many applications, the immediacy of delete operations is less critical than the performance of search queries or the indexing of new data. For example, in the social networking site, ensuring that search results are returned quickly is critical for the user experience, while the slow deletion of data will not be as noticeable to users.

Throttling delete requests in Elasticsearch helps to control the rate at which deletes are processed, ensuring that they do not negatively impact the rest of cluster performance. In Elasticsearch, we can set a maximum rate of deletions to stop them from taking over the cluster.

      from elasticsearch import Elasticsearch

      # Assuming you have an Elasticsearch instance running locally
      es = Elasticsearch([{'host': 'localhost', 'port': 9200}])

      index_name = 'your_index_name'
      query_body = {
          "query": {
              "match": {
                  "field": "value"
              }
          },
          "requests_per_second": 100 # Adjust this value based on your requirements
      }

      # Perform the delete by query operation with throttling
      response = es.delete_by_query(index=index_name, body=query_body)
    

When we exceed the maximum rate of deletion, Elasticsearch will slow down the number of requests to the system; it will not delete the requests.

Deleting documents is resource-intensive and may require additional resources to be provisioned.

Preventing accidental deletions

Accidentally deleting data can compromise product functionality and user trust. Elasticsearch offers safeguards to help prevent the accidental deletion of data including soft deletes, blocking of hard deletes in index settings and role based access controls.

With soft deletes, documents are not physically removed from the database but rather they are marked for deletion. This method ensures that the data remains in the database and can be recovered or restored. With soft deletes, we can mark the document as deleted using a boolean attribute in the document instead of using the Delete API. This is the easiest and most effective way to block document deletions.

We can also block deletes using the configuration setting index.blocks.delete which prevents the deletion of documents from an index. When this setting is enabled, any attempt to delete documents using the Delete API will result in an error. The downside of this option is that it is not possible to make any deletions of data from the index, resulting in storage bloat.

      PUT /your_index/_settings
      {
        "settings": {
          "index.blocks.write": "true",
          "index.blocks.delete": "true"
        }
      }
    

We recommend using role based access controls in Elasticsearch to give fine-grained access to users, ensuring that only authorized users can delete data.

Conclusion

Elasticsearch offers different ways to delete documents including the Delete API to delete individual documents, Delete by Query API to delete documents that meet the criteria specified in queries and Bulk API for large-scale deletions. The simple APIs make it easy to delete data but behind the scenes deletions can trigger segment merge operations that are resource-intensive and may impact cluster performance. That’s because Elasticsearch is built using immutable Lucene segments and deletions can cause segments to become unbalanced, triggering unpredictable merge operations. Furthermore, to remove deleted data from queries requires shards to be refreshed, deleted data does not automatically disappear from results.

If you have a workload with frequent deletes, updates or inserts, you may want to consider a mutable alternative, Rockset. Rockset supports field-level mutability as it is built using RocksDB. Compare the performance of deletions in Rockset to Elasticsearch by starting a free trial with $300 in credits.