ElasticSearch AutoComplete: A story worth telling š„
--
It all started with me trying to implement a website wide search functionality for Spacejoy (š„±Ahh.. an organization I work with), though we could have built it over our main database MongoDB. But for some good enough reasons, we were inclined towards Elasticsearch/Solr, let me explain why & how we achieved the same.
Problem Statement
All we needed was an autocomplete or autosuggest service which kicks in whenever a user starts typing inā¦..Just like it works with Google, Amazon Bing, etc.
Approach
I can use MongoDB for auto-suggest which actually is our main database which would be much simpler to build & release. it scales well with increasing datasets and is good enough for limited keyword search, But wait am I missing something important here š¤ā¦
There are few points on which Iām having 2nd thought
- What if data increases exponentially over time?
- Is MongoDB a good choice for providing app-wide search?
- What if we are not limited to keywords only?
- What if we need a Full-Text-Search tomorrow?
- Think about a scenario where more than 1000k users are using this?
Letās look at something else like ElasticSearch/Solrā¦they are good for scaling, made for better & effective search, Full-Text-Search is one of the key features, & they are blazing fast.
We were more inclined towards ElasticSearch as we already had ELK in place which we used for APM, Log collection & Monitoring, extending this for providing additional search functionality was an easy job (At least we thoughtā¦until we started implementing)
Search With Elastic āļø
There can be various approaches to build autocomplete functionality in Elasticsearch. We will discuss the following approaches.
- Prefix Query
- Edge Ngram
- Completion Suggester
Prefix Query
This approach involves using a prefix query against a custom field. The value for this field can be stored as a keyword so that multiple terms(words) are stored together as a single term. This can be accomplished by using a keyword tokenizer. This approach has some disadvantages.
- Since the matching is supported only at the beginning of the term, one cannot match the query in the middle of the text.
- This type of query is not optimized for the large dataset and may result in increased latency.
- Since this is a query, duplicate results wonāt be filtered out. One workaround to deal with this approach can be using an aggregation query to group results and then filtering out results. This involves a bit of processing though on the server-side.
Edge Ngrams
This approach involves using different analyzers at index and search time. When indexing the document, a custom analyzer with an edge n-gram filter can be applied. At search time, a standard analyzer can be applied. which prevents the query from being split.
Edge N-gram tokenizer first breaks the text down into words on custom characters (space, special characters, etc..) and then keeps the n-gram from the start of the string only.
This approach works well for matching queries in the middle of the text as well. This approach is generally fast for queries but may result in slower indexing and in large index storage.
Completion Suggester
Elasticsearch is shipped with an in-house solution called Completion Suggester. It uses an in-memory data structure called Finite State Transducer(FST). Elasticsearch stores FST on a per-segment basis, which means suggestions scale horizontally as more new nodes are added.
Some of the things to keep in mind when implementing Completion Suggester
- The autosuggest items should have
completion
types as its field type. - An input field can have various canonical or alias names for a single term.
- Weights can be defined with each document to control their ranking.
- Storing all the terms in lowercase helps in the case-insensitive match.
- Context suggesters can be enabled to support filtering or boosting by certain criteria.
This is the best approach to implement autocomplete with ElasticSearch
Implementation
We to start with creating an index by providing some additional settings to enable suggestions. I just wanted to create an index seokeywoddesignmappings
having following _doc
schema.
{
"keywordCategory": "Bedroom",
"keywordName": "boho bedroom wall decor",
"keywordSlug": "boho-bedroom wall decor",
"designId": "5f9b063e2dbaaf001ccf3162",
"keywordId": "5ff85cfa98fc8d2e75256f83",
"isActive": true
}
Letās create the index with the following settings
If we see the mapping, we will observe that name is a nested field that contains several fields, each analyzed in a different way.
- Field
keywordName.keywordstring
is analyzed using a Keyword tokenizer, hence it will be used for Prefix Query Approach - Field
keywordName.edgengram
is analyzed using Edge Ngram tokenizer, hence it will be used for Edge Ngram Approach. - Field
keywordName.completion
is stored as a completion type, hence it will be used for Completion Suggester.
Testing
Letās test our implementation after feeding multiple datapoints on the index
Prefix Query
{
"query": {
"prefix": {
"name.keywordstring": "th"
}
}
}
Edge Ngram
{
"query": {
"match": {
"keywordName.edgengram": "bo"
}
}
}
Completion Suggester
{
"suggest": {
"keyword-suggest-fuzzy": {
"prefix": "boho",
"completion": {
"field": "keywordName.completion",
"fuzzy": {
"fuzziness": 1
},
"skip_duplicates": true
}
}
}
}