ElasticSearch AutoComplete: A story worth telling 🔥
It all started with me trying to implement a website wide search functionality for Spacejoy (🥱Ahh.. an organization I work with), though we could have built it over our main database MongoDB. But for some good enough reasons, we were inclined towards Elasticsearch/Solr, let me explain why & how we achieved the same.
All we needed was an autocomplete or autosuggest service which kicks in whenever a user starts typing in…..Just like it works with Google, Amazon Bing, etc.
I can use MongoDB for auto-suggest which actually is our main database which would be much simpler to build & release. it scales well with increasing datasets and is good enough for limited keyword search, But wait am I missing something important here 🤔…
There are few points on which I’m having 2nd thought
- What if data increases exponentially over time?
- Is MongoDB a good choice for providing app-wide search?
- What if we are not limited to keywords only?
- What if we need a Full-Text-Search tomorrow?
- Think about a scenario where more than 1000k users are using this?
Let’s look at something else like ElasticSearch/Solr…they are good for scaling, made for better & effective search, Full-Text-Search is one of the key features, & they are blazing fast.
We were more inclined towards ElasticSearch as we already had ELK in place which we used for APM, Log collection & Monitoring, extending this for providing additional search functionality was an easy job (At least we thought…until we started implementing)
Search With Elastic ☄️
There can be various approaches to build autocomplete functionality in Elasticsearch. We will discuss the following approaches.
- Prefix Query
- Edge Ngram
- Completion Suggester
This approach involves using a prefix query against a custom field. The value for this field can be stored as a keyword so that multiple terms(words) are stored together as a single term. This can be accomplished by using a keyword tokenizer. This approach has some disadvantages.
- Since the matching is supported only at the beginning of the term, one cannot match the query in the middle of the text.
- This type of query is not optimized for the large dataset and may result in increased latency.
- Since this is a query, duplicate results won’t be filtered out. One workaround to deal with this approach can be using an aggregation query to group results and then filtering out results. This involves a bit of processing though on the server-side.
This approach involves using different analyzers at index and search time. When indexing the document, a custom analyzer with an edge n-gram filter can be applied. At search time, a standard analyzer can be applied. which prevents the query from being split.
Edge N-gram tokenizer first breaks the text down into words on custom characters (space, special characters, etc..) and then keeps the n-gram from the start of the string only.
This approach works well for matching queries in the middle of the text as well. This approach is generally fast for queries but may result in slower indexing and in large index storage.
Elasticsearch is shipped with an in-house solution called Completion Suggester. It uses an in-memory data structure called Finite State Transducer(FST). Elasticsearch stores FST on a per-segment basis, which means suggestions scale horizontally as more new nodes are added.
Some of the things to keep in mind when implementing Completion Suggester
- The autosuggest items should have
completiontypes as its field type.
- An input field can have various canonical or alias names for a single term.
- Weights can be defined with each document to control their ranking.
- Storing all the terms in lowercase helps in the case-insensitive match.
- Context suggesters can be enabled to support filtering or boosting by certain criteria.
This is the best approach to implement autocomplete with ElasticSearch
We to start with creating an index by providing some additional settings to enable suggestions. I just wanted to create an index
seokeywoddesignmappings having following
"keywordName": "boho bedroom wall decor",
"keywordSlug": "boho-bedroom wall decor",
Let’s create the index with the following settings
If we see the mapping, we will observe that name is a nested field that contains several fields, each analyzed in a different way.
keywordName.keywordstringis analyzed using a Keyword tokenizer, hence it will be used for Prefix Query Approach
keywordName.edgengramis analyzed using Edge Ngram tokenizer, hence it will be used for Edge Ngram Approach.
keywordName.completionis stored as a completion type, hence it will be used for Completion Suggester.
Let’s test our implementation after feeding multiple datapoints on the index