Full-Text Search is an Input-Oriented Task
UPDATE: This article refers to our hosted Elasticsearch offering by an older name, Found. Please note that Found is now known as Elastic Cloud.
Full-text search is an interesting field for backend developers: it requires raw technical knowledge, knowledge about the input languages and lots of feedback gathering on the usability side. In this article, I'd like to give you a rough overview of the challenges in each area and inform you about basic solutions for them.
I’ve been implementing search solutions using Elasticsearch for the last 2 years. One of the things I start off with is gathering as much information about the data I’m going to work with as soon as possible.
This focus on the input data is often lacking for Elasticsearch beginners. Very often, they end up with managing to get all data into Elasticsearch and then trying to fix everything while trying to get things out again. Instead, they should have worked through the whole chain of things. In this post, I want to give a few pointers on things that are implemented in an afternoon, but save you headaches for days.
There’s two dangers lurking here: The real world is complex and clearing up that complexity is your first and foremost task. The second is that search rarely crashes, but degrades in quality if you make an error. Those errors will pop up over time when you cannot find a document for some reason. We’ll try to make sure these aren’t standard cases.
Elasticsearch is based on Lucene, which implements a so-called inverted index. In contrast to key-value stores and to a lesser extent SQL databases, where the main way to access stored data is a (usually non-natural) key, inverted indices use the data itself as the key into the database. Following that line of thinking, it is pretty clear that messy input leads to messy retrieval. This means, we should take a close look at the data first.
First thing you should do before setting up a search project is get hold of some data. Don’t fake it (for example by randomly generating text using a word list), use the real data. If your project has not produced any data yet, find text that matches your domain. Alternatively, if you intend to gather data later, build that part first.
Let’s have a look at real real world data issues.
Encodings are the most technical of issues when talking about text. Sadly, the situation here can still be summed up like this:
On encodings: Americans: “I don't want to know”; Japanese: “How can you not know?!”; Europeans: “Unicode solves everything”
— Yehuda Katz (1) May 17, 2010
For all Western languages, Unicode works well though. The issue is less the Western-Eastern language split (which are very different), but the tendency of Western programmers to not care about those issues too much - especially if they think the language they are working in is free of any specialities.
Case in point: (American) English. English writing is generally considered free of all those “nasty” things like diacritics. This assumption is naïve, especially in a language using loan words heavily. Some people might also just use fancy words like allongé in book titles - and suddenly you have to take care of that stuff anyway. No amount of struggling will help you, unless you are the BBC and can just ban usage of any diacritics. Also consider that dropping diacritics from names can be seen as an insult.
The good news is: Elasticsearch is focused on handling text and as such comes with all necessary things to handle Unicode text. There is one issue though: they come as a plugin called analysis-icu. ICU is a vast library, implementing all interesting operations for Unicode. Today, we will just skim the surface, by using three standard components:
icu_tokenizer: this is a tokenizer breaking strings at everything unicode considers as “segmenting words”.
icu_folding: folding is what people often refer to as “lowercasing”. The difference is that not all characters have a lowercase equivalent or the mapping might be non-trivial. Don’t think too hard about it, this is the operation you want.
icu_normalize: How do I compare “ü” to “u”? Don’t think too hard about that question either. Unicode characters have a “normal” form, fit for comparison.
normalizewill give that to you. (Answer: for the sake of comparison, they are considered the same).
All in all, don’t fear Unicode. If you are handling text, consider it the default. Install the ICU plugin first, start using it right away and don’t look back. It makes standard Unicode issues a no-brainer and will not be any worse than the standard implementations. If in doubt, use it.
Now, with those pesky encodings out of the way, let’s have a look at input languages. Even if we don’t have any problems with encoding them, they still all come with their own particularities. All these have to be dealt with to create a proper search people want to use. One of the problems here is: your input is usually rather cleanly structured and thought-out, search input is rather ad-hoc. So we need to match these two kinds of input still. Let’s look at a few examples:
- French and English have elisions: “j’aime” (I love), “it’s”, people rarely enter them as search phrases
- Germans can use “ue” for input when they mean “ü” (e.g. because they use a foreign keyboard)
- Germans can combine words almost arbitrarily to form a new one.
- Other languages have that in a more constrained fashion, consider “football”. People might search for “ball” and expect to find it.
- English has two major spellings: American and English (“Consistent English spelling is a question of honor and labour”)
From a users perspective, they have a reasonable expectation that your search function handles those things. As with all things to do with search, a 100% solution is extremely hard or even impossible. But after finding such a problem statement, pleasing solutions are often easy to find.
- Elisions can be cleaned using the elision token filter
- The german2 stemmer variant transforms all “ue” into “ü” in most cases, at the cost of breaking a very low number of words.
- The elasticsearch-analysis-german plugin ships an algorithmic decompounder usable in German, turning words back into subwords. For easier languages, the compound token filter can help you.
- For a known set of spelling variants, try to get hold of a word list from a free source (e.g. wikipedia) and feed them to a synonym filter
Finding and dealing with these particularities will improve your search functionality. The interesting thing here is that a lot of these issues need understanding of the input language, especially when validating. Don’t hesitate to ask a native speaker for help! Also: start with the nasty examples first.
The Elasticsearch guide has a great article on these particularities.
“We want great search functionality, can Elasticsearch do that?”
Sure, it can, but only if you define what “great” is. And this is where things become interesting: there is not single definition of “relevant”. For example, if the search is for restaurants, I can match for names and dishes as much as I want, if the restaurant is on the other side of the country, it’s probably not relevant. Distance is an important criterion there. On the other hand, if we are talking about a specialised website for three-star-restaurants, people will probably be ready to travel for that. France has 26 of them, so anything but the broadest location filter by name will not add a lot to the user experience. Shops need important products up, TV production company employees will search for internal three-letter codes for their shows, while TV viewers will search for a popular abbrevation of the same shows. In some cases, a narrow set of results is wanted, while broad results are preferred somewhere else. Criteria might work against each other, e.g. putting too much weight on recency of a document can lead to new documents overlaying better documents where the text matches better.
The most important part of implementing search with relevant results is realizing that “relevancy” is not a technical term. It can be modelled using technology later, but it is a business and ultimately a management task to define what “relevant” is. This is highly dependent on what the data in question represents and what people are interested in. Receiving feedback from many channels is important. Finding strategies is as important as understanding the technical parts. Here are some suggestions:
- Have a list of all queries that were problematic, or illustrate what you want well. Don’t just put that in tickets, have a central list for them. Annotate them with your expectations. Either as a wiki page or even as a shared spreadsheet somewhere. Be aggressive about adding things to that list.
- Consider building an interface that allows you to play with the parameters of your queries. Build it mirrored: the left hand side shows the current parameters, the right hand side allows you to play and compare. This allows you to get someone non-technical over to your desk and play for a few minutes. Use that regularly.
- And most importantly: Log all your queries. Find popular ones. Check their results. See whether they fit your expectations.
- Interview people about what they would like to know about your data.
Your relevancy model is the core of your product. Sometimes, it’s simple. At other times, it’s a full-time job. Try to keep the feedback wheel turning.
All this shows why I love working in this field so much. It comes down to a very diverse and interesting set of tasks.
First is the die-hard technical component: how do I work with the input structure? What are the pitfalls (Encoding, structural issues)? How does this all work in detail?
Then we have the required knowledge beyond the technical: what particularities do the input languages have? Are there multiple? What makes them special?
And finally we have the task of interacting with others to find out what they need and want to later put that into writing. Interviewing people or having them find new stuff is a very pleasant break from coding.
Elasticsearch supports you with all those tiny details.
And with that, I’d like to wish you a great holiday and a happy new year!