How to overcome some of the toughest site-search design challenges

Introduction

Designing an effective site-search for your eCommerce store is no mean feat. Creating a search engine that truly understands your customers and customises their results can lead to several technical challenges. But how should you go about overcoming them?

Challenges in natural language processing

Stemming, partial matching, inverse partial matching, and misspellings are four of the problems to consider in the field of natural language processing. Let's look at each.

1. Stemming

WHAT IT IS AND WHY IT'S IMPORTANT

Across languages, words have different variations depending on factors like tense, case, the number of items being referred to, and many other grammatical constructs.

However, most of these words all have a common word stem. The word 'print', for instance, is the stem of words like 'prints', 'printer', 'printing', 'printable', and so on. In many Indo-European languages, these stems come first and the suffix that follows determines the meaning of the word. This is not always the case, however: sometimes it's the prefix of the stem that determines meaning.

Naturally, you can't predict exactly which words your customers will use to search your website. A user looking for a printer might search using any of the words mentioned above. This means your site-search needs to be smart enough to discern what the user is looking for – by going beyond just recognising the words they've typed.

This is where stemming comes in. In natural language programming (NLP), stemming is the act of identifying the stem of the words used in a search query and extrapolating from that which 'family' these words belong to.

WHY IS IT HARD TO SOLVE?

Although in most cases words are regular and share an identical stem, there are numerous exceptions.

Sometimes words are spelt differently when tenses change (run/ran), or a stem can feature prefixes and suffixes (friend/befriend/friendship). In some devious cases, attaching a common suffix to a stem word can dramatically change its meaning (head/heading).

All of these variations can easily confuse your site-search.

HOW TO SOLVE IT

The most commonly used approach to solving this issue is simplistic. It involves statistically compiling a list of common word suffixes, stripping these suffixes from all the words in which they're found, and indexing the resulting stripped-down versions of each word.

This solution works fairly well, especially if you only strip down words if they leave a stem word that is fairly common. Most open-source search engines and commercial software use this method.

However, this strategy does not account for the exceptions mentioned above, and it means you need to keep a list of word stems for each language the search engine has to operate with.

Still, coupled with fuzzy and partial matching, this method allows your site-search to interpret most word variations.

partial matching

2. Partial matching

WHAT IT IS AND WHY IT'S IMPORTANT

Many of the most commonly used words are compound words. Terms such as 'meatball' or 'makeup' work by fusing two words together as an effective, efficient, and easily-understood means of labelling new things. However, compound words can confuse search engines.

Partial matching will allow your site-search to consider products a match so long as part of the keyword attached to the item's index appears in the search query. It seems confusing, but it's actually rather simple. Partial matching just means that if a user searches for 'meat', they'll find 'meatballs' and 'meatloaf' in their search results too.

WHY IS IT HARD TO SOLVE?

Depending on the language and context, compound words can be constructed in different ways.
Words can be composed with or without spaces ('makeup' vs 'make up').
Descriptive words sometimes come first ('starfish' is a 'fish' shaped like a 'star'), and sometimes not ('flashback' is a 'flash' that goes 'back' to the past).
Nouns can be compounded with both nouns ('tablespoon') and verbs ('telltale').
Sometimes the compound word means something only vaguely related to its components (a 'skylight' is neither a sky nor a light).
Some words are compounds of more than two words ('counterclockwise').
Sometimes we omit some letters when we compose a word ('smoke' + 'fog' = 'smog', or 'breakfast' + 'lunch' = 'brunch').

Getting partial matching right is vital for delivering useful results.

HOW TO SOLVE IT

Because compound word composition varies so much between languages and context, it's very hard to create an algorithm that solves every case without introducing unwanted side-effects. What you can do is break compound words down into component parts, using statistical data to find which word parts are commonly used, and index these parts separately so your search engine can retrieve them. This is known as partial matching.

The main problem with this solution is that it can create some false positives in the indices. Not all words are compounds just because they contain other words (a 'catalogue' is not a 'cat' or a 'log'/ a 'scapegoat' is not a 'cape' or a 'goat').

However, partial matching, when fine-tuned and combined with stemming, is a decent solution for handling compound words in search queries.

inverse partial matching

3. Inverse partial matching

WHAT IT IS AND WHY IT'S IMPORTANT

Inverse partial matching is needed when a user types a word that is a compound of the term they're looking for. For instance, a user might search for 'toolbox', but the product is indexed as a 'work box' or 'box for tools' in the search index. In this case, the search engine needs to break apart the customer’s query and search for the component parts separately.

WHY IS IT HARD TO SOLVE?

This carries with it the same challenges as normal partial matching, only it's even trickier because the corpus (collection of words) is not controlled and the query might contain previously unseen words or combinations.

HOW TO SOLVE IT

You can apply mostly the same methods as discussed for partial matching, but you need to be careful not to overdo it. Many languages contain short, common words such as 'a' and 'do', and you don’t want to break up queries too severely. 'Doorknob' shouldn't be broken into 'do', 'or' and 'knob', for example.

4. Misspellings (fuzzy matching)

WHAT IT IS AND WHY IT'S IMPORTANT

25% of site-search queries are misspelt, and most of the time users aren't even aware that they've made a mistake. But they will take notice when your site-search doesn't fix the problem – 73% of users will leave your site within 2 minutes if they can't find what they're looking for.

It's vital to ensure your site-search can handle misspelling to avoid losing customers.

WHY IS IT HARD TO SOLVE?

Whether it's just a slip of the hand or a genuine error, everyone makes spelling mistakes from time to time.

In some cases, there are multiple valid spellings for the same word which can easily trip up search engines ('jewelry' and 'jewellery'). There are also occasions when users spell a term correctly, but the word is misspelt in the product catalogue instead. Handling misspellings can be trickier than it first seems.

HOW TO SOLVE IT

Most search engines provide results for misspelt queries by implementing some kind of fuzzy text matching, often utilising a variation on the Levenshtein distance algorithm. This algorithm can be used to calculate how much the typed query differs from a given word or phrase in the catalogue, and thus order results by their proximity to the typed query. For instance, the words “cool” and “book” have a distance of 2, since you need to replace the “c” with “b”, and then replace the “l” with “k”.

Combined with other NLP methods, such as stemming and partial matching, this can be a very efficient way of delivering relevant results from misspelt queries. However, it requires careful tuning of parameters, such as what kinds of modifications should increase the distance and where the distance should cut off when it's grown too large. It can also be quite CPU intensive if not implemented or configured correctly.

Even better results can be achieved if Levenshtein distance is augmented with information about statistically common misspellings or QWERTY-aware weighting. Of course, most of these would be language-specific and might not work for all demographics.

Check out the video below for more information on how Loop54 tackles tricky NLP challenges:

Contextual challenges

Context is everything when it comes to determining the intention behind the language we use, and, for humans, this process is instinctual. However, search engines must be carefully tuned to understand contexts and present applicable results.

1. Synonymity

WHAT IT IS AND WHY IT'S IMPORTANT

It's tricky to predict exactly what words users will use to describe a particular product when searching your site. Some might search for a 'jumper' while others might enter 'sweatshirts'. Anyone who has worked with product inventory or ad keyword management knows that you need to carefully consider which words could be used to describe your products or services.

Conversely, most users won't think twice about what words they use to search – and if your site-search can't understand synonyms, your customers will go elsewhere.

WHY IS IT HARD TO SOLVE?

Synonyms are mostly not related to each other in any way except that they mean the same thing – the word 'cream' is very different from 'lotion' when compared letter for letter.

There's also the added complexity that comes from synonymity not being universal. Some words can mean the same thing in a given context, but not in another. ‘Lotion’ is not a good synonym for ‘cream’ when the context is food, for example. As with most challenges in natural language processing, synonyms are very context-specific.

HOW TO SOLVE IT

Traditionally, this is solved by entering all known synonyms manually, as searchable keywords on the product information entries themselves, in a purposely built synonym management system in the search engine – or in the product information management (PIM) system. However, this process is very labour-intensive and prone to human error.

The best way to decide which synonyms to add to this list is to pay attention to common null results in the search engine and manually check whether adding a synonym would help. If you keep at it, you'll be able to achieve a good synonym coverage for a small percentage of all the searches made by users.

Many open-source search engines have dedicated systems for adding synonyms, but they can be rather hard to configure. For instance, you need to decide the 'directionality' of the synonym. Do you want 'car' to translate to 'automobile', or the other way around? Or both? Or are there many words that you want to collapse into a common 'base' word?

It can also be tricky to implement multi-word synonyms correctly, especially if you want any of the words in the phrase to have synonyms too. For instance, setting 'car' as a synonym for 'automobile' while also setting 'car mirror' as synonymous with 'rear-view mirror' can really mess with your site-search's logic.

A far more elegant solution to this problem is to utilise machine learning algorithms that learn synonyms automatically based on user behaviour, though very few search engines offer this solution.

polysemy

2. Polysemy

WHAT IT IS AND WHY IT'S IMPORTANT

Polysemic words are the opposite of synonyms – they have multiple meanings. For example, the word 'squash' can be used to refer to a vegetable, a sport, a fruit drink, and so on, making it a highly polysemic word. Understanding polysemy can be an important factor in determining user intent.

It might not be much of a concern for smaller e-stores, but e-commerce sites with a wide catalogue of products will find that the same word can be used to describe a variety of different products. Successfully determining what particular meaning is intended by a user's search term can mean the difference between a sale and a lost opportunity.

WHY IS IT HARD TO SOLVE?

Most search engines work on a text, or 'token', basis. This means that words are the most atomic unit in the engine and all documents are defined in terms of the words they contain. Therefore, from the search engine’s point of view, two products that contain the same word are identical when it comes to calculating their relevance to that search term.

Meaning is not encoded in the word itself, which is the only information a search engine has at its disposal. In order to discern different meanings of the same word, search engines need to understand contexts, which is difficult to encode in a software model.

HOW TO SOLVE IT

Most search engines don't offer a fix for the problem of polysemic words, instead they display all matching products. A small number of smarter site-search solutions are able to account for users' search behaviour to learn the most popular products for a given search query and display those products first in the results.

However, this assumes that all users have the same intent, which we know isn't always true.

Search engines with personalisation functionality might be able to discern what context is most relevant for each individual user, but that's only possible if they're a returning customer who regularly searches for the same thing. Predicting contextual intent for new users, or users with a short purchase history, is a difficult problem that can lead to cold starts.

personalisation

3. Personalising results

WHAT IT IS AND WHY IT'S IMPORTANT

Personalisation in relation to site-search refers to the concept of tailoring search results based on individual users' unique characteristics, previous purchases, and search history. Even demographical data like age, sex, and geographical location may be considered to deliver search results as unique as each of your customers.

Since all your users will have different desires and needs, personalising search results can greatly improve sales and customer satisfaction.

WHY IS IT HARD TO SOLVE?

The biggest problem with personalisation is that most users don't visit the same store frequently enough to impart a strong data set. If all you know about a customer is that they once bought a pair of green pants from you, what can you do with that information? Just present them with loads of green products or a swathe of pants?

What’s more, in many cases, user preference changes drastically after purchase. A customer that has bought a pack of underwear is likely to want to buy the same type of underwear again, but a user that just bought a winter coat is unlikely to be interested in buying the same coat again – or indeed any winter coat for at least a year.

If you’re in the position of holding a lot of user data, that volume can become a problem in itself. The problem with big data is that, well, it's big. When handling data of that scale, issues like storage, indexing, efficient retrieval, security, and privacy become important factors to consider.

HOW TO SOLVE IT

Most open-source search solutions don't have personalisation features, and even when they do, they're rudimentary.

One approach to achieving great search result personalisation is to use machine learning to create clusters of common user behaviour, placing users in those clusters, and then recommending products that people in those clusters are likely to buy. This alleviates the lack of data problem since many users together define these clusters. However, this is limited for the same reason – it only works with a small number of customer personas.

Alternatively, a machine learning algorithm that works with small data sets - like Loop54 - can build a unique persona for every customer, no matter how infrequently they visit. In this case, the algorithm can decipher in-session purchase intent by looking at what area of the catalogue the visitor is currently interacting, and use that information to predict their next step. Additionally, it can use past purchases to build a preference profile of each visitor and use that information to show preferred products first upon their next visit.

Check out the video below for more information on how Loop54 applies predictive personalisation to the online shopping experience.

upselling

4. Up-selling/cross-selling/merchandising

WHAT IT IS AND WHY IT'S IMPORTANT

It's important to implement some kind of business logic into the sorting of products. If a user is searching for a toaster, you want them to find toasters they're interested in while also maximising revenue.

This means that you want the user to buy the toaster with the highest margin (while still matching the user’s preferences, of course), and you'll also want to recommend accessories that may interest them. You may even want to promote a toaster that's on its way out of the catalogue. Sometimes, what's valuable for the customer is aligned with what's valuable for the vendor.

A happy balance must be found between satisfying customers and maximising revenue.

WHY IS IT HARD TO SOLVE?

It's a challenge to find the right balance between displaying products because they're relevant to the user, and displaying products because their promotion is valuable to the retailer.

It’s easy to bump all high-margin products to the top of the results list, but many of them will probably not be very relevant. Applying a small 'boost' to the relevance score is better – or a 'bury' for products that you want to push further down the results – but it's not easy to determine the level of boost or bury to implement.

For example, when searching for 'monitor', a customer might be presented with PC monitors first and audio monitors second, assuming the search engine is smart enough to use user behaviour to determine that PC monitors are more relevant.

But what if one of the audio monitors has a huge margin and you want to boost that product? You’d want it to be the first audio monitor to show up, but you don’t want it to show before the PC monitors. Even if you tweak the boost value just right to accomplish this, that same boost value is not going to give you the same effect on other queries. If the search engine does not understand this contextual border, it's difficult to tweak the business rules so they don’t override the customer’s intent.

Another challenge is finding all the products that are relevant in the first place. This goes along with the problem of finding synonyms above – you need to find all the relevant products before you can choose how to present them.

HOW TO SOLVE IT

The best way to solve this is to make sure your search engine separates matching and sorting. When a customer searches for a product, the engine needs to understand which products are relevant to the query, and, if there are multiple possibly relevant contexts, the engine needs to be able to tell them apart.

This way, you can add boost rules to products so that they always appear at the top of the results in their context, but if their context is not relevant, they won't jump to the top. In the example above, the high-margin audio monitor would be boosted so that it shows up as the first audio monitor, but the search engine understands that the PC monitor context is more important, so the PC monitors will still come out on top.

Products should only be only boosted or buried within their context.

Engineering challenges

Here are a few engineering challenges you'll need to bear in mind when designing your site-search.

1. Data laundry

Duplicate removal – Is there only one index entry for each product?
Attribute normalisation – Do all categories, brands, etc. only have one representation, not different versions with different casing, spelling or artefacts like spaces, dots or other special characters differing?
Junk removal – Have all unused categories, mislabelled products, etc. been removed?
Repository merging – Is your product information spread between different systems, like editorial material in the PIM and prices in the accounting software?

2. Re-indexing

Delta detection – What products have changed, and what properties of those products have changed?
Atomicity – Are you modifying the data model while keeping a consistent state in the process?
Error handling – What happens if the re-indexing fails half-way? How does the client get information about this? Do we retry or fail fatally?
Transactions/rollback – If an error occurs, how do we keep the data in a consistent state?

3. Caching

Discriminator selection – What properties of the query do you use as a unique key to find cached results.
Invalidation – When do you clear the cache? There are many options - FIFO, TTL, LRU... Should we clear it when re-indexing? Do we need to clear the entire cache?
Memory/CPU balance – How much should you keep in the cache before the memory usage overshadows the CPU benefits? Or do we cache on disk?

4. Scaling/availability

Data synchronisation – When scaling across different nodes, how do we make sure the nodes have the same data?
Sharding – How do we split up a large data-set so it can be hosted on different nodes?
Load balancing – For high-traffic situations, how do you let multiple nodes share the traffic load?
Failover – What happens when one node fails? How do you define and detect a failure? How do you make sure that the clients switch to another node as fast as possible?
Geographic distribution – Can you place our nodes all over the world, and let clients automatically use the closest one? What about data privacy issues?
Backup – What do you do if all nodes fail, or the data becomes corrupted? Can you restore from a previous point? Or do you need to re-index all the data?

Building an effective site-search for your e-commerce store is like running a technical marathon. There are many challenges to work around, so advanced technical know-how and determination are a must.

If you need assistance getting your site-search up to standard and ready to handle all possible search scenarios, talk to Loop54. We'll be happy to help!

Introduction

Challenges in natural language processing

1. Stemming

WHAT IT IS AND WHY IT'S IMPORTANT

WHY IS IT HARD TO SOLVE?

HOW TO SOLVE IT

2. Partial matching

WHAT IT IS AND WHY IT'S IMPORTANT

WHY IS IT HARD TO SOLVE?

HOW TO SOLVE IT

3. Inverse partial matching

WHAT IT IS AND WHY IT'S IMPORTANT

WHY IS IT HARD TO SOLVE?

HOW TO SOLVE IT

4. Misspellings (fuzzy matching)

WHAT IT IS AND WHY IT'S IMPORTANT

WHY IS IT HARD TO SOLVE?

HOW TO SOLVE IT

Contextual challenges

1. Synonymity

WHAT IT IS AND WHY IT'S IMPORTANT

WHY IS IT HARD TO SOLVE?

HOW TO SOLVE IT

2. Polysemy

WHAT IT IS AND WHY IT'S IMPORTANT

WHY IS IT HARD TO SOLVE?

HOW TO SOLVE IT

3. Personalising results

WHAT IT IS AND WHY IT'S IMPORTANT

WHY IS IT HARD TO SOLVE?

HOW TO SOLVE IT

4. Up-selling/cross-selling/merchandising

WHAT IT IS AND WHY IT'S IMPORTANT

WHY IS IT HARD TO SOLVE?

HOW TO SOLVE IT

Engineering challenges

1. Data laundry

2. Re-indexing

3. Caching

4. Scaling/availability

Loop54 offers true personalised on-site product search

Product overview

Technology

Developer docs

About us

Resources

Contact us