Search Synonym Management

Many customers who handle lots of data sooner or later conclude that they want to manage word synonyms. There are several use cases when synonyms are a great tool to satisfy the customer. Let us check these cases and see how to implement search with synonyms.

Solution

Dictionary

There are many situations when a customer may want to use synonyms. Let us go through them:

  • synonyms are the first and most trivial case when you indeed want to add some synonyms to words commonly used in the search index, e.g., huge > big;

  • typos are common in any application, so you may want to define a word with a typo as a synonym to a correct spelling word, e.g., firt > first;

  • abbreviations - many customers prefer to use abbreviations in the search while the actual document may contain a complete description or vice versa; a search engine can perform both transformations using synonyms, e.g., approx > approximately;

  • euphemisms are another use case when people may want to search for some data using an unusual word that describes what they are looking for, and synonyms can help here.

Synonym management can be done either manually or automatically. Manual management assumes that some person will write all synonyms and then send them to the search engine. Automatic management means that an administrator will periodically pull the list of synonyms from some external source (database, service, API). To provide a convenient way to apply synonyms, it is worth implementing the import of synonyms from an Excel or CVS file.

Synonyms can be unidirectional or bidirectional. Unidirectional means that when you define a pair of synonyms A > B, a user can find B using word A but can not find A using word B. Bidirectional synonyms always work in both directions.

Implementation

Keyboard

Let us try to create a test index and see how to make synonyms work using built-in features of Elasticsearch.

First, we will create a simple index with a definition of custom synonym_filter and synonym_analyzer. The filter uses a standard Elasticsearch synonym filter to set up a list of synonym words. In our case, there is only one pair of synonyms: family=group. The analyzer uses a standard whitespace tokenizer with a custom filter to replace synonyms when possible.

curl -X PUT "localhost:9200/synonym-management" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "synonym_analyzer": {
            "tokenizer": "whitespace",
            "filter": [ "synonym_filter" ]
          }
        },
        "filter": {
          "synonym_filter": {
            "type": "synonym",
            "synonyms": [ "family, group" ]
          }
        }
      }
    }
  }
}
'

The next part is mapping. We will define only one data field but different analyzers for indexation and search. Indexation will be using a standard whitespace analyzer, while search will use our custom synonym_analyzer. The point is that if you need to add or change synonyms, you will not need to reindex everything because synonym replacement will happen during the search, not during the indexation. And, as usual, there are two documents we will use for testing.

curl -X PUT "localhost:9200/synonym-management/_mapping" -H 'Content-Type: application/json' -d'
{
  "properties": {
    "data": {
      "type": "text",
      "analyzer": "whitespace",
      "search_analyzer": "synonym_analyzer"
    }
  }
}
'
curl -X PUT "localhost:9200/synonym-management/_doc/1" -H 'Content-Type: application/json' -d'
{
  "data": "Apple An apple is an edible sweet fruit produced by an apple tree (Malus domestica). Apple trees are cultivated worldwide and are the most widely grown species in the genus Malus."
}
'
curl -X PUT "localhost:9200/synonym-management/_doc/2" -H 'Content-Type: application/json' -d'
{
  "data": "Orange The orange is the fruit of various citrus species in the family Rutaceae (see list of plants known as orange); it primarily refers to Citrus × sinensis,[1] which is also called sweet orange, to distinguish it from the related Citrus × aurantium, referred to as bitter orange."
}
'

Now let us try to see how synonyms work. We will use a standard match query to make them work. Pay attention that the query contains the word group that is not presented in the original documents but exists only in the synonyms configuration.

curl -X POST "localhost:9200/synonym-management/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "query": {
    "match" : { 
      "data" : "group" 
    }
  },
  "_source": false
}
'

And, as you can see, the result contains the second document that includes the word family, defined as a synonym for the group.

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.6370649,
    "hits" : [
      {
        "_index" : "synonym-management",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.6370649
      }
    ]
  }
}

Previous lesson
<<
Relevance Optimization

Next lesson
On User Preferences >>