Performance Optimization

Few people think about the application's performance unless it is already very bad. The same can be said about the search engine, so you should think about the search performance before issues appear. Unlike previous articles, this will be primarily technical and helpful for application administrators and search engine developers.

Solution

Spiral

Many search index-related issues may appear during application usage, especially under a heavy load. I will describe the most common and typical issues and offer a solution for each of these issues.

I want to split search index issues into application and search engine issues.

Application issues appear when the amount of data and the load goes above some reasonable threshold and the search index no longer can handle it. Let us check two common situations.

The first one is when the application sends too much data to the search index. Many application administrators do not know what data is sent to the search index and its use. So, they send all the data they have, leading to lots of unused data stored there. The fix is simple and straightforward: get rid of unused data.

The second issue is that the application may trigger too many indexation requests or refresh data in the search index too often. That leads to the search index overheads, consumption of additional computing resources, and performing too many operations that do not change anything. There are several solutions here. You may property track when the data is changed and send the indexation request. Then you may refresh only part of the data instead of all data. Finally, you may allow the administrator to trigger indexation manually instead of automatic requests.

Search engine issues are caused by how the search index processes and stores data. We may split them into three groups: index size issues, heavy CPU load, and high memory consumption.

Index size issues are the easiest to address. You may decrease the amount of data sent to the search index (see similar application issues), eliminate duplicated search words and tokens, remove or skip rarely used search tokens. In other words, you need to keep only the minimum required amount of information needed for the search to work correctly. Sometimes it may even make sense to refactor some features and use other storage (e.g., DBMS or memory cache) instead of search index to get the required data.

Heavy CPU load is significantly harder to handle. CPU consumption may be caused either by the indexation or by search operations.

The indexation process consumes CPU time because of indexation algorithms. They process original data, split it into words and tokens, and convert them into special data structures used by the search index during the search. The best way to fix this issue is to find or write another algorithm that produces the same (or similar) output but consumes less CPU time.

Search operations may consume CPU either if there is too much data or it is stored not in the right way. The ways to decrease the amount of data are already described above. As for the storing, you need to understand how exactly the search index stores your data and then searches against it. Then you need to check for unnecessary or useless operations in this process and eliminate them. For example, you may skip forced data conversion to UTF-8, splitting it into too many small tokens, or disable fuzzy search.

Memory usage issues are usually caused by too much data or too large structures built during the indexation and search processes. Both these issues have already been discussed and addressed above. In most cases, if you have high memory usage, you will also have a high disk or CPU usage. So, the general recommendation is to address the source issue first and then check what else may lead to increased memory consumption.

Implementation

Graph

To address index size issues, a developer has to understand what analyzer or tokenizer is used for each full-text search field. In most cases, full-text search fields are responsible for most of the data and load during the indexation, so starting from them is a good idea. A developer may also use analyze API to check what tokens are generated and if some unused or useless tokens can be excluded.

CPU-related issues can be related either to the analyzer and tokenizer for the indexation part or to the search query structure for the search part. It is good to understand what is causing high CPU load (indexation or search) and then optimize the appropriate part. Indexation optimization may include a more efficient or optimal custom analyzer. Search optimization is usually started by checking the query itself, its fields, and how the search engine is searching against them.

Finally, because memory-related issues are usually related to index size or high CPU load issues, it is better to address source issues first and then check what else may consume the memory.

Let us have a look at the example of index size optimization. For example, here is the index with a custom analyzer that uses the ngram filter:

curl -X PUT "localhost:9200/performance-before" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "test_fulltext_analyzer": {
            "tokenizer": "whitespace",
            "filter": [ "ngram" ]
          }
        },
        "filter": {
          "ngram": {
            "type": "nGram",
            "min_gram": 1,
            "max_gram": 100
          }
        }
      },
      "max_ngram_diff": 100
    }
  }
}
'

If we check tokens generated using this analyzer, we will see that it generates 15 tokens for one word Hello.

curl -X GET "localhost:9200/performance-before/_analyze?pretty" -H 'Content-Type: application/json' -d'
{
  "analyzer" : "test_fulltext_analyzer",
  "text" : "Hello"
}
'
---
{
  "tokens" : [
    {
      "token" : "H",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "He",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "Hel",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "Hell",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "Hello",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "e",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "el",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "ell",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "ello",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "l",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "ll",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "llo",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "l",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "lo",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "o",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    }
  ]
}

Most of these tokens are unused because people usually search from the beginning of the word. So, we can replace the ngram filter with the edge-ngram filter to eliminate this issue:

curl -X PUT "localhost:9200/performance-after" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "test_fulltext_analyzer": {
            "tokenizer": "whitespace",
            "filter": [ "edge-ngram" ]
          }
        },
        "filter": {
          "edge-ngram": {
            "type": "edge_ngram",
            "min_gram": 1,
            "max_gram": 100
          }
        }
      }
    }
  }
}
'

Consequently, the list of tokens is significantly shorter and will take less space on a disk.

{
  "tokens" : [
    {
      "token" : "H",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "He",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "Hel",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "Hell",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "Hello",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    }
  ]
}

Previous lesson
<<
Tracking User Behavior