Vova Bilyachat

Melbourne, Australia

Search like a Google with Elasticsearch. Autocomplete, Did you mean and search for items.

21 July 2015

Right now i am working with Elasticsearch for more than one year. In this post i would like to share my knowledge how to build full text search on documents, make user spell corrections and suggest user what to type depends on your index. To work with elasticsearch i preffer to use head plugin instead of curl librarry, its easy and i have some extra tool such as browse index documents, build query and so on.

Create and Configure Index

To begin with index configuration open http://localhost:9200/_plugin/head/ (PS. You should have head plugin installed) then go to any request tab and make request to create index

PUT http://localhost:9200/your_index_name/
{
    "settings": {
        "index": {
            "analysis": {
                "filter": {
                    "stemmer": {
                        "type": "stemmer",
                        "language": "english"
                    },
                    "autocompleteFilter": {
                        "max_shingle_size": "5",
                        "min_shingle_size": "2",
                        "type": "shingle"
                    },
                    "stopwords": {
                        "type": "stop",
                        "stopwords": ["_english_"]
                    }
                },
                "analyzer": {
                    "didYouMean": {
                        "filter": ["lowercase" ],
                        "char_filter": [ "html_strip" ],
                         "type": "custom",
                         "tokenizer": "standard"
                    },
                    "autocomplete": {
                        "filter": ["lowercase", "autocompleteFilter" ],
                        "char_filter": ["html_strip"],
                        "type": "custom",
                        "tokenizer": "standard"
                    },
                    "default": {
                        "filter": ["lowercase","stopwords", "stemmer"],
                        "char_filter": ["html_strip"],
                        "type": "custom",
                        "tokenizer": "standard"
                    }
                }
            }
        }
    }
}

Now let me explain what we did.

  • default - This analyzer will be used to search documents in our storage. Analyzer is using following filter
  • html_strip - this filter will act first it will remove html in text if preset or decode html characters, is very handy if you using non English language then some characters are encoded
  • lowercase - since elasticsearch divide< text to terms they will be case sensitive so that is why text must be lower-cased so it will& not depend on user input and will work with any text
  • stopwords - filter which will remove stop words like is, a, all, an, etc. We don’t want this common used words to score our results since they occur very often but should have no impact on our search
  • stemmer - will finish cleaning text and token.
  • didYouMean - this analyzer is simple since most of tasks will be done by elasticsearch native query. But again it will use lowercase filter, html_strip.
  • autocomplete - this anlyzer will be used to suggest user what to type depends on his input  and index content. For instance if user type ad, ES should suggest what else user can type depends on commonly used text phrases in index, so it could be address, addon depends what was commonly used in index. This analyzer is using: shingle filter his role will be take text and split to blocks like “this is example”, “is example”, “example”. Later i will show how this analyzer works in real life.

We can test each analyzer to see how elasticsearch will save tokens using analyze api. To test autocomplete simply run

POST http://localhost:9200/your_index_name/_analyze
{
  "analyzer" : "autocomplete",
  "text" : "Recommend questions get too fulfilled. He fact in we case miss sake. Entrance be throwing he do blessing"
}

in your browser.

{
  "tokens": [
    {
      "token": "recommend",
      "start_offset": 0,
      "end_offset": 9,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "recommend questions",
      "start_offset": 0,
      "end_offset": 19,
      "type": "shingle",
      "position": 0,
      "positionLength": 2
    },
    {
      "token": "recommend questions get",
      "start_offset": 0,
      "end_offset": 23,
      "type": "shingle",
      "position": 0,
      "positionLength": 3
    },
    {
      "token": "recommend questions get too",
      "start_offset": 0,
      "end_offset": 27,
      "type": "shingle",
      "position": 0,
      "positionLength": 4
    },
    {
      "token": "recommend questions get too fulfilled",
      "start_offset": 0,
      "end_offset": 37,
      "type": "shingle",
      "position": 0,
      "positionLength": 5
    },
    {
      "token": "questions",
      "start_offset": 10,
      "end_offset": 19,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "questions get",
      "start_offset": 10,
      "end_offset": 23,
      "type": "shingle",
      "position": 1,
      "positionLength": 2
    },
    {
      "token": "questions get too",
      "start_offset": 10,
      "end_offset": 27,
      "type": "shingle",
      "position": 1,
      "positionLength": 3
    },
    {
      "token": "questions get too fulfilled",
      "start_offset": 10,
      "end_offset": 37,
      "type": "shingle",
      "position": 1,
      "positionLength": 4
    },
    {
      "token": "questions get too fulfilled he",
      "start_offset": 10,
      "end_offset": 41,
      "type": "shingle",
      "position": 1,
      "positionLength": 5
    },
    {
      "token": "get",
      "start_offset": 20,
      "end_offset": 23,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "get too",
      "start_offset": 20,
      "end_offset": 27,
      "type": "shingle",
      "position": 2,
      "positionLength": 2
    },
    {
      "token": "get too fulfilled",
      "start_offset": 20,
      "end_offset": 37,
      "type": "shingle",
      "position": 2,
      "positionLength": 3
    },
    {
      "token": "get too fulfilled he",
      "start_offset": 20,
      "end_offset": 41,
      "type": "shingle",
      "position": 2,
      "positionLength": 4
    },
    {
      "token": "get too fulfilled he fact",
      "start_offset": 20,
      "end_offset": 46,
      "type": "shingle",
      "position": 2,
      "positionLength": 5
    },
    {
      "token": "too",
      "start_offset": 24,
      "end_offset": 27,
      "type": "<ALPHANUM>",
      "position": 3
    },
    {
      "token": "too fulfilled",
      "start_offset": 24,
      "end_offset": 37,
      "type": "shingle",
      "position": 3,
      "positionLength": 2
    },
    {
      "token": "too fulfilled he",
      "start_offset": 24,
      "end_offset": 41,
      "type": "shingle",
      "position": 3,
      "positionLength": 3
    },
    {
      "token": "too fulfilled he fact",
      "start_offset": 24,
      "end_offset": 46,
      "type": "shingle",
      "position": 3,
      "positionLength": 4
    },
    {
      "token": "too fulfilled he fact in",
      "start_offset": 24,
      "end_offset": 49,
      "type": "shingle",
      "position": 3,
      "positionLength": 5
    },
    {
      "token": "fulfilled",
      "start_offset": 28,
      "end_offset": 37,
      "type": "<ALPHANUM>",
      "position": 4
    },
    {
      "token": "fulfilled he",
      "start_offset": 28,
      "end_offset": 41,
      "type": "shingle",
      "position": 4,
      "positionLength": 2
    },
    {
      "token": "fulfilled he fact",
      "start_offset": 28,
      "end_offset": 46,
      "type": "shingle",
      "position": 4,
      "positionLength": 3
    },
    {
      "token": "fulfilled he fact in",
      "start_offset": 28,
      "end_offset": 49,
      "type": "shingle",
      "position": 4,
      "positionLength": 4
    },
    {
      "token": "fulfilled he fact in we",
      "start_offset": 28,
      "end_offset": 52,
      "type": "shingle",
      "position": 4,
      "positionLength": 5
    },
    {
      "token": "he",
      "start_offset": 39,
      "end_offset": 41,
      "type": "<ALPHANUM>",
      "position": 5
    },
    {
      "token": "he fact",
      "start_offset": 39,
      "end_offset": 46,
      "type": "shingle",
      "position": 5,
      "positionLength": 2
    },
    {
      "token": "he fact in",
      "start_offset": 39,
      "end_offset": 49,
      "type": "shingle",
      "position": 5,
      "positionLength": 3
    },
    {
      "token": "he fact in we",
      "start_offset": 39,
      "end_offset": 52,
      "type": "shingle",
      "position": 5,
      "positionLength": 4
    },
    {
      "token": "he fact in we case",
      "start_offset": 39,
      "end_offset": 57,
      "type": "shingle",
      "position": 5,
      "positionLength": 5
    },
    {
      "token": "fact",
      "start_offset": 42,
      "end_offset": 46,
      "type": "<ALPHANUM>",
      "position": 6
    },
    {
      "token": "fact in",
      "start_offset": 42,
      "end_offset": 49,
      "type": "shingle",
      "position": 6,
      "positionLength": 2
    },
    {
      "token": "fact in we",
      "start_offset": 42,
      "end_offset": 52,
      "type": "shingle",
      "position": 6,
      "positionLength": 3
    },
    {
      "token": "fact in we case",
      "start_offset": 42,
      "end_offset": 57,
      "type": "shingle",
      "position": 6,
      "positionLength": 4
    },
    {
      "token": "fact in we case miss",
      "start_offset": 42,
      "end_offset": 62,
      "type": "shingle",
      "position": 6,
      "positionLength": 5
    },
    {
      "token": "in",
      "start_offset": 47,
      "end_offset": 49,
      "type": "<ALPHANUM>",
      "position": 7
    },
    {
      "token": "in we",
      "start_offset": 47,
      "end_offset": 52,
      "type": "shingle",
      "position": 7,
      "positionLength": 2
    },
    {
      "token": "in we case",
      "start_offset": 47,
      "end_offset": 57,
      "type": "shingle",
      "position": 7,
      "positionLength": 3
    },
    {
      "token": "in we case miss",
      "start_offset": 47,
      "end_offset": 62,
      "type": "shingle",
      "position": 7,
      "positionLength": 4
    },
    {
      "token": "in we case miss sake",
      "start_offset": 47,
      "end_offset": 67,
      "type": "shingle",
      "position": 7,
      "positionLength": 5
    },
    {
      "token": "we",
      "start_offset": 50,
      "end_offset": 52,
      "type": "<ALPHANUM>",
      "position": 8
    },
    {
      "token": "we case",
      "start_offset": 50,
      "end_offset": 57,
      "type": "shingle",
      "position": 8,
      "positionLength": 2
    },
    {
      "token": "we case miss",
      "start_offset": 50,
      "end_offset": 62,
      "type": "shingle",
      "position": 8,
      "positionLength": 3
    },
    {
      "token": "we case miss sake",
      "start_offset": 50,
      "end_offset": 67,
      "type": "shingle",
      "position": 8,
      "positionLength": 4
    },
    {
      "token": "we case miss sake entrance",
      "start_offset": 50,
      "end_offset": 77,
      "type": "shingle",
      "position": 8,
      "positionLength": 5
    },
    {
      "token": "case",
      "start_offset": 53,
      "end_offset": 57,
      "type": "<ALPHANUM>",
      "position": 9
    },
    {
      "token": "case miss",
      "start_offset": 53,
      "end_offset": 62,
      "type": "shingle",
      "position": 9,
      "positionLength": 2
    },
    {
      "token": "case miss sake",
      "start_offset": 53,
      "end_offset": 67,
      "type": "shingle",
      "position": 9,
      "positionLength": 3
    },
    {
      "token": "case miss sake entrance",
      "start_offset": 53,
      "end_offset": 77,
      "type": "shingle",
      "position": 9,
      "positionLength": 4
    },
    {
      "token": "case miss sake entrance be",
      "start_offset": 53,
      "end_offset": 80,
      "type": "shingle",
      "position": 9,
      "positionLength": 5
    },
    {
      "token": "miss",
      "start_offset": 58,
      "end_offset": 62,
      "type": "<ALPHANUM>",
      "position": 10
    },
    {
      "token": "miss sake",
      "start_offset": 58,
      "end_offset": 67,
      "type": "shingle",
      "position": 10,
      "positionLength": 2
    },
    {
      "token": "miss sake entrance",
      "start_offset": 58,
      "end_offset": 77,
      "type": "shingle",
      "position": 10,
      "positionLength": 3
    },
    {
      "token": "miss sake entrance be",
      "start_offset": 58,
      "end_offset": 80,
      "type": "shingle",
      "position": 10,
      "positionLength": 4
    },
    {
      "token": "miss sake entrance be throwing",
      "start_offset": 58,
      "end_offset": 89,
      "type": "shingle",
      "position": 10,
      "positionLength": 5
    },
    {
      "token": "sake",
      "start_offset": 63,
      "end_offset": 67,
      "type": "<ALPHANUM>",
      "position": 11
    },
    {
      "token": "sake entrance",
      "start_offset": 63,
      "end_offset": 77,
      "type": "shingle",
      "position": 11,
      "positionLength": 2
    },
    {
      "token": "sake entrance be",
      "start_offset": 63,
      "end_offset": 80,
      "type": "shingle",
      "position": 11,
      "positionLength": 3
    },
    {
      "token": "sake entrance be throwing",
      "start_offset": 63,
      "end_offset": 89,
      "type": "shingle",
      "position": 11,
      "positionLength": 4
    },
    {
      "token": "sake entrance be throwing he",
      "start_offset": 63,
      "end_offset": 92,
      "type": "shingle",
      "position": 11,
      "positionLength": 5
    },
    {
      "token": "entrance",
      "start_offset": 69,
      "end_offset": 77,
      "type": "<ALPHANUM>",
      "position": 12
    },
    {
      "token": "entrance be",
      "start_offset": 69,
      "end_offset": 80,
      "type": "shingle",
      "position": 12,
      "positionLength": 2
    },
    {
      "token": "entrance be throwing",
      "start_offset": 69,
      "end_offset": 89,
      "type": "shingle",
      "position": 12,
      "positionLength": 3
    },
    {
      "token": "entrance be throwing he",
      "start_offset": 69,
      "end_offset": 92,
      "type": "shingle",
      "position": 12,
      "positionLength": 4
    },
    {
      "token": "entrance be throwing he do",
      "start_offset": 69,
      "end_offset": 95,
      "type": "shingle",
      "position": 12,
      "positionLength": 5
    },
    {
      "token": "be",
      "start_offset": 78,
      "end_offset": 80,
      "type": "<ALPHANUM>",
      "position": 13
    },
    {
      "token": "be throwing",
      "start_offset": 78,
      "end_offset": 89,
      "type": "shingle",
      "position": 13,
      "positionLength": 2
    },
    {
      "token": "be throwing he",
      "start_offset": 78,
      "end_offset": 92,
      "type": "shingle",
      "position": 13,
      "positionLength": 3
    },
    {
      "token": "be throwing he do",
      "start_offset": 78,
      "end_offset": 95,
      "type": "shingle",
      "position": 13,
      "positionLength": 4
    },
    {
      "token": "be throwing he do blessing",
      "start_offset": 78,
      "end_offset": 104,
      "type": "shingle",
      "position": 13,
      "positionLength": 5
    },
    {
      "token": "throwing",
      "start_offset": 81,
      "end_offset": 89,
      "type": "<ALPHANUM>",
      "position": 14
    },
    {
      "token": "throwing he",
      "start_offset": 81,
      "end_offset": 92,
      "type": "shingle",
      "position": 14,
      "positionLength": 2
    },
    {
      "token": "throwing he do",
      "start_offset": 81,
      "end_offset": 95,
      "type": "shingle",
      "position": 14,
      "positionLength": 3
    },
    {
      "token": "throwing he do blessing",
      "start_offset": 81,
      "end_offset": 104,
      "type": "shingle",
      "position": 14,
      "positionLength": 4
    },
    {
      "token": "he",
      "start_offset": 90,
      "end_offset": 92,
      "type": "<ALPHANUM>",
      "position": 15
    },
    {
      "token": "he do",
      "start_offset": 90,
      "end_offset": 95,
      "type": "shingle",
      "position": 15,
      "positionLength": 2
    },
    {
      "token": "he do blessing",
      "start_offset": 90,
      "end_offset": 104,
      "type": "shingle",
      "position": 15,
      "positionLength": 3
    },
    {
      "token": "do",
      "start_offset": 93,
      "end_offset": 95,
      "type": "<ALPHANUM>",
      "position": 16
    },
    {
      "token": "do blessing",
      "start_offset": 93,
      "end_offset": 104,
      "type": "shingle",
      "position": 16,
      "positionLength": 2
    },
    {
      "token": "blessing",
      "start_offset": 96,
      "end_offset": 104,
      "type": "<ALPHANUM>",
      "position": 17
    }
  ]
}

Mapping

PUT http://localhost:9200/your_index_name/_mapping/YOUR_TYPE_NAME/
{
    "YOUR_TYPE_NAME": {
        "properties": {
            "autocomplete": {
                "type": "string",
                "analyzer": "autocomplete"
            },
            "description": {
                "type": "string",
                "copy_to": ["did_you_mean", "autocomplete"]
            },
            "did_you_mean": {
                "type": "string",
                "analyzer": "didYouMean"
            },
            "title": {
                "type": "string",
                "copy_to": ["autocomplete","did_you_mean"]
            }
        }
    }
}

In my solution i am using i have only title and description, and then they are copied to autocomplete and did_you_mean fields.

In my original application i don’t have mapping at all because i don’t have any knowledge which fields customer has and i have dynamic mapping, where all my string typed fields are copied to medata field, such as did_you_mean or autocomplete. It is also possible use multi_field type, but issue is that for suggest queries there is limitation for one field only.

Search documents and use spell corrections

POST http://localhost:9200/your_index_name/your_type/
{
    "title": "Real sold my in call.",
    "description": "So by colonel hearted ferrars. Draw from upon here gone add one. He in sportsman household otherwise it perceived instantly. Is inquiry no he several excited am. Called though excuse length ye needed it he having. Whatever throwing we on resolved entrance together graceful. Mrs assured add private married removed believe did she. \r\n\r\nIs he staying arrival address earnest. To preference considered it themselves inquietude collecting estimating. View park for why gay knew face. Next than near to four so hand. Times so do he downs me would. Witty abode party her found quiet law. They door four bed fail now have. \r\n\r\nUnpacked reserved sir offering bed judgment may and quitting speaking. Is do be improved raptures offering required in replying raillery. Stairs ladies friend by in mutual an no. Mr hence chief he cause. Whole no doors on hoped. Mile tell if help they ye full name. "
}
POST http://localhost:9200/your_index_name/your_type/
{
    "title": "Use securing confined his shutters.",
    "description": "Delightful as he it acceptance an solicitude discretion reasonably. Carriage we husbands advanced an perceive greatest. Totally dearest expense on demesne ye he. Curiosity excellent commanded in me. Unpleasing impression themselves to at assistance acceptance my or. On consider laughter civility offended oh. \r\n\r\nAdd you viewing ten equally believe put. Separate families my on drawings do oh offended strictly elegance. Perceive jointure be mistress by jennings properly. An admiration at he discovered difficulty continuing. We in building removing possible suitable friendly on. Nay middleton him admitting consulted and behaviour son household. Recurred advanced he oh together entrance speedily suitable. Ready tried gay state fat could boy its among shall. "
}
POST http://localhost:9200/your_index_name/your_type/
{
    "title": "Sense child do state to defer mr of forty.",
    "description": " Become latter but nor abroad wisdom waited. Was delivered gentleman acuteness but daughters. In as of whole as match asked. Pleasure exertion put add entrance distance drawings. In equally matters showing greatly it as. Want name any wise are able park when. Saw vicinity judgment remember finished men throwing. \r\n\r\nFull he none no side. Uncommonly surrounded considered for him are its. It we is read good soon. My to considered delightful invitation announcing of no decisively boisterous. Did add dashwoods deficient man concluded additions resources. Or landlord packages overcame distance smallest in recurred. Wrong maids or be asked no on enjoy. Household few sometimes out attending described. Lain just fact four of am meet high. "
}
POST http://localhost:9200/your_index_name/your_type/
{
    "title": "His having within saw become ask passed misery giving. ",
    "description": "Recommend questions get too fulfilled. He fact in we case miss sake. Entrance be throwing he do blessing up. Hearts warmth in genius do garden advice mr it garret. Collected preserved are middleton dependent residence but him how. Handsome weddings yet mrs you has carriage packages. Preferred joy agreement put continual elsewhere delivered now. Mrs exercise felicity had men speaking met. Rich deal mrs part led pure will but. \r\n\r\nNeeded feebly dining oh talked wisdom oppose at. Applauded use attempted strangers now are middleton concluded had. It is tried no added purse shall no on truth. Pleased anxious or as in by viewing forbade minutes prevent. Too leave had those get being led weeks blind. Had men rose from down lady able. Its son him ferrars proceed six parlors. Her say projection age announcing decisively men. Few gay sir those green men timed downs widow chief. Prevailed remainder may propriety can and. "
}
POST http://localhost:9200/search_like_a_google/_search/
{
    "suggest": {
        "didYouMean": {
            "text": "test",
            "phrase": {
                "field": "did_you_mean"
            }
        }
    },
    "query": {
        "multi_match": {
            "query": "test",
            "fields": ["description", "title" ]
        }
    }
}

In index there is no any work test so elasticsearch should suggest something else to search and as expected in response there is suggestion from index

{
  "took": 4,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 0,
    "max_score": null,
    "hits": []
  },
  "suggest": {
    "didYouMean": [
      {
        "text": "test",
        "offset": 0,
        "length": 4,
        "options": [
          {
            "text": "tell",
            "score": 0.08653918
          }
        ]
      }
    ]
  }
}

Query autocomplete

Autocomplete in this solution is more suggestion for user what he can search depends on index content. Idea is that elasticsearch should “take” phrases which are in index and suggest user depends what he wrote as an input.

POST http://localhost:9200/YOUR_INDEX_NAME/_search/
{
    "size": 0,
    "aggs": {
        "autocomplete": {
            "terms": {
                "field": "autocomplete",
                "order": {
                    "\_count": "desc"
                },
                "include": {
                    "pattern": "c.\*"
                }
            }
        }
    },
    "query": {
        "prefix": {
            "autocomplete": {
                "value": "c"
            }
        }
    }
}

This query works in following way. Query search for all terms which starts from “c” then using aggregation terms it will calculate usage of terms. For auto-complete field shingle term tokenizer is used which will divide text to kind of groups and depends on their usage elasticsearch will suggest user what to type. In response there is suggestions depends on index.

{
  "took": 3,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 4,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "autocomplete": {
      "doc_count_error_upper_bound": 1,
      "sum_other_doc_count": 112,
      "buckets": [
        {
          "key": "carriage",
          "doc_count": 2
        },
        {
          "key": "chief",
          "doc_count": 2
        },
        {
          "key": "concluded",
          "doc_count": 2
        },
        {
          "key": "call",
          "doc_count": 1
        },
        {
          "key": "called",
          "doc_count": 1
        },
        {
          "key": "called though",
          "doc_count": 1
        },
        {
          "key": "called though excuse",
          "doc_count": 1
        },
        {
          "key": "called though excuse length",
          "doc_count": 1
        },
        {
          "key": "called though excuse length ye",
          "doc_count": 1
        },
        {
          "key": "can",
          "doc_count": 1
        }
      ]
    }
  }
}

This solution is useful because if index grow then ES will provide smarter suggestion. And most important that search requests will return some documents.

Idea how to implement autocomplete i took from this blog