Error-tolerant Search

It is wildly known that people often make mistakes and typos when they are entering text. This issue also affects search engine requests. So, you as a customer may want to take care of such use cases. There are two typical ways to address this issue: error-tolerant search and search suggestions. This article explains everything about error-tolerant search and how to use it properly.

Solution

Words and letters

Error-tolerant or fault-tolerant search is an algorithm that allows ignoring certain mistakes in the search requests entered by users. Every customer would like to have such a feature as it saves clicks and time for users.

Advantages of the error-tolerant search are clear: better user experience, easier and quicker navigation, better conversion, and finally, customer satisfaction as a consequence.

There are also several disadvantages. The first one is a less precise search that may lead to lots of irrelevant results. The second one is the fact that error-tolerant search consumes more computing resources (memory, CPU).

However, this type of search is still a valuable solution, especially if done correctly despite these disadvantages. Let us check how to do it.

There are three primary best practices to make error-tolerant search work:

  1. You need to understand where the error-tolerant search is required. For example, if you are publishing information about books, you may use it for the book title and description, but not for book number (ISBN).

  2. You need to tune the search engine to tolerate only specific types or a number of errors. E.g., you may allow only one error to reduce the number of false-positive results.

  3. You need to improve relevance by artificially narrowing down the number of results. There are many ways to do it: set minimum relevance, boost exact match or specific results based on customer preferences, and so on.

Combining these practices allows you to build a search engine that gives the user exactly what she wants and allows making mistakes without significantly impacting the search relevance and application performance.

Implementation

Laptop

One of the most critical factors that affect error-tolerant search is search performance in general, or search speed and used computing resources in particular. All these things depend on the algorithm used to calculate if some word matches the request or not.

One of the most common algorithms used to do that is Levenshtein distance. In general, it is a pretty good algorithm to calculate the difference between strings, but it consumes quite a lot of resources if strings are very different. To overcome this issue, you may accept only strings with a minimal difference, let's say one or two characters. This way, the search will be efficient and relevant enough, and it will not consume a lot of computing resources.

Now let us check how to do it in practice.

First, let us create an index for content articles with a simple structure. The first field is key, and it represents an article identifier(s), and we do not want to allow error-tolerant search in this field. So, we will use the whitespace analyzer to split it into tokens by spaces and do not apply any other optimizations. The second field description stores the article description. We will apply a standard English language analyzer and run an error-tolerant search against this field. Pay attention to the index structure's lack of error-tolerance as it happens in real-time during the full-text search. The last query adds three sample articles.

curl -X PUT "localhost:9200/fuzzy-search" curl -X PUT "localhost:9200/fuzzy-search/_mapping" -H 'Content-Type: application/json' -d' { "properties": { "key": { "type": "text", "analyzer": "whitespace" }, "description": { "type": "text", "analyzer": "english" } } } ' curl -X PUT "localhost:9200/fuzzy-search/_doc/1" -H 'Content-Type: application/json' -d' { "key": "apple 111", "description": "Apple An apple is an edible fruit produced by an apple tree (Malus domestica). Apple trees are cultivated worldwide and are the most widely grown species in the genus Malus. fruit sweet red yellow" } ' curl -X PUT "localhost:9200/fuzzy-search/_doc/2" -H 'Content-Type: application/json' -d' { "key": "orange 222", "description": "Orange The orange is the fruit of various citrus species in the family Rutaceae (see list of plants known as orange); it primarily refers to Citrus × sinensis,[1] which is also called sweet orange, to distinguish it from the related Citrus × aurantium, referred to as bitter orange. fruit sour orange" } ' curl -X PUT "localhost:9200/fuzzy-search/_doc/3" -H 'Content-Type: application/json' -d' { "key": "banana 333", "description": "Banana A banana is an elongated, edible fruit – botanically a berry[1][2] – produced by several kinds of large herbaceous flowering plants in the genus Musa.[3] In some countries, bananas used for cooking may be called \"plantains\", distinguishing them from dessert bananas. fruit sweet yellow" } '

Now let us build a query to this index. We will use a standard match query against the key field and then apply fuzziness to the description field to implement an error-tolerant search. The value in the fuzziness field defines the Levenshtein distance, i.e., how many characters may differ in our error-tolerant search. Finally, both these queries are wrapped into the bool query to implement OR condition.

curl -X POST "localhost:9200/fuzzy-search/_search?pretty" -H 'Content-Type: application/json' -d' { "query": { "bool" : { "should" : [ { "match" : { "key" : { "query" : "plnt" } } }, { "match" : { "description" : { "query" : "plnt", "fuzziness" : 1 } } } ] } } } '

Please pay attention to the plnt query phrase; it is a word plant with a typo. So, an error-tolerant search should find documents with word plant, which is exactly what it does.

{ "took" : 9, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 2, "relation" : "eq" }, "max_score" : 0.33749554, "hits" : [ { "_index" : "fuzzy-search", "_type" : "_doc", "_id" : "2", "_score" : 0.33749554, "_source" : { "key" : "orange 222", "description" : "Orange The orange is the fruit of various citrus species in the family Rutaceae (see list of plants known as orange); it primarily refers to Citrus × sinensis,[1] which is also called sweet orange, to distinguish it from the related Citrus × aurantium, referred to as bitter orange. fruit sour orange" } }, { "_index" : "fuzzy-search", "_type" : "_doc", "_id" : "3", "_score" : 0.33323938, "_source" : { "key" : "banana 333", "description" : "Banana A banana is an elongated, edible fruit – botanically a berry[1][2] – produced by several kinds of large herbaceous flowering plants in the genus Musa.[3] In some countries, bananas used for cooking may be called \"plantains\", distinguishing them from dessert bananas. fruit sweet yellow" } } ] } }

Previous lesson
<<
Multi-language Search

Next lesson
Search Autocomplete >>