Exploring the application of Information Retrieval system to distribution and manufacturing company databases

gzajaczkowski
May 30, 2024
7 min read

Introduction

Imagine a large manufacturing company with a global distribution network. This company produces thousands of products, or semi-finished products, from simple raw materials to complex components. These products are precisely described in the database, which often contains millions of records. On the other hand, orders can be extremely variable – a client can request a product using a vague description, detailed specification, or anything in between. Each customer can have, and often has, a different description of the same product from the supplier's catalog. How to ensure that the customer receives what they are asking for, regardless of how they described a product? Currently, a common approach in the industry is to have a team of sales specialists who know what a customer is ordering based on some context. This process, called information retrieval, can be automated using machine learning models called Sentence Transformers or Bi-Encoders. Those models can learn the context of the words, item groups, and even subtle nuances in business terminology, all in multiple languages.

Scenario - retrieve the item by a variety of descriptions

Let’s look at a real-life scenario: A factory in Europe needs to replace a specific bolt used in an assembly line, but they only have a vague description of its characteristics. The traditional approach would involve manually searching through databases, cross-referencing catalogues, and possibly contacting multiple suppliers – a time-consuming and error-prone process. With a fine-tuned product retrieval engine, the query "m8 bolt steel" can yield precise results, identifying the right bolt, its specifications, part number, and similar or substitute products. This drastically reduces downtime and ensures the assembly line stays productive.

Base models, trained on large amounts of general text, understand human language and can be used to determine the similarity between sentences and product descriptions. They can also distinguish synonyms but fail with domain-specific tasks like item retrieval from a manufacturer's database. For this reason, the model has to be further trained (fine-tuned) on company internal data. A complete, well-maintained database is good enough to fine-tune the model and achieve good retrieval results, but additional information about customer orders potentially can further improve performance. Such retrieval results in semantically similar to the query, but they can be further improved by re-ranking returned products based on additional context. This context can be anything – customer order history, geographical location, industry trends, etc. It’s case-specific and thus won’t be looked at in this article.

Machine Learning Models for Information Retrieval

Let’s start with the basic building block of language models – embeddings. Embeddings are numerical representations of words or parts of words. Those numerical representations – vectors – capture the semantic meaning of words, phrases, or sentences. Unlike traditional keyword-based retrieval, which relies on exact matches, embeddings allow comparing text based on their meanings, even if different words are used. Those embeddings respect directional relationships between meanings of words. Think about singulars and plurals of words – if You take embeddings of words man and woman, subtract one from another You’re left with difference between genders. We then could add this resulting vector to vector of king and the outcome will be very close to embedding vector of queen.

Source: https://medium.com/arvind-internet/applying-word2vec-on-our-catalog-data-2d74dfee419d

Embeddings can be compared using similarity metrics like cosine. The closer vectors are in this multidimensional space, the more semantically similar sentences are. Using UMAP to project this multidimensional concept into a 2D plane, we can visualize this concept. We can see that most groups have all items near each other. This figure has been created using the ERP items dataset. Some groups are all over the place because they are very numerous and have numerous, varying subgroups.

Figure 1. UMAP projection of Bi-Encoder embedding space to 2D. Each hexagon is a bin, with brightness corresponding to number of items in this item group bin. Purple is 0 and brighter bins have more items in them.

Some groups are all over the place because they are very numerous and have numerous, varying subgroups. Again, those subgroups are numerous and have other subdivisions, but we can see that subgroups' occurrence areas are narrower than those of groups. This process can be repeated until all sub[…]subgroups are in their own distinctive regions.

Figure 2. UMAP projection of Bi-Encoder embeddings space for subgroups of a particular group.

It’s important to note that the projection of 768 dimensions into just 2 loses information, so items that don’t look distinguishable from one another on 2D projection can be distinguishable in the original space.

Semantic embedding-based retrieval scales well to very big databases by using an Approximate Nearest Neighbor search algorithm, such as a Hierarchical Navigable Small World (HNSW). HNSW can usually achieve near exact results but in approximate search time, making it scalable and accurate.

A product retrieval engine's performance relies on high-quality data, so its source system, like Infor M3, has to be clean, consistent and well-maintained. Conflicting, repeating or missing information will decrease retrieval performance. For example we noticed field values with additional information in parentheses have negative impact on models performance, at least for semantic comparison between two phrases.

Setting up the system in a company can be broken down into two distinct steps – data preparation and model training. A Sentence Transformer requires pairs of sentences (pieces of text) and a similarity score or triplets of text – an anchor, similar and dissimilar sentence. When using a score, it’s important to include examples with lower scores, not only high similarity ones. Those examples can be created automatically from the database, but input from a company specialist is needed to decide which item attributes should be used to create pairs.

The next step is to fine-tune the model on this data to fit the needs of specific business areas. Plentiful and quality data is the most important part of creating a good model. There are also a few hyperparameters to tweak – most notably, learning rate, number of epochs, and batch size. The first two are responsible for how much the model fits the domain-specific data, while the latter impacts training time and stability. Let’s explore their impact on retrieval on an example snapshot of a real Infor M3 database. All product names and descriptions are renamed to generic descriptions.

The retrieval engine will be evaluated using two metrics – Normalized Discounted Cumulative Gain (NDCG) and recall at 10. This means we will only look at the top 10 results.

NDCG is a measure used to evaluate the effectiveness of a search engine or recommendation system. It tells us how well the system ranks the most relevant items at the top of the results.

Sentence-transformers/paraphrase-multilingual-mpnet-base-v2 was used as a base model as it has the highest performance (according to official documentation) out of available multilingual Sentence Transformers models.

Evaluation data was generated using gpt-4o, based on a random sample of the database it was trained on as an example of the data structure. GPT was tasked with generating a natural language query and expected values of Infor M3 Item attribute columns based on the query. Each query had to include information about a previously specified column, let’s call it a pivot, inclusion of all other columns was up to the GPT. This is useful as we can now tell if the model struggles to retrieve a particular parameter correctly, for example, size or color.

Impact of training on domain data on retrieval performance

Comparing NDCG@10 and recall@10 of base and trained models we can see significant improvements on two datasets – sample of food distribution items and sample of industrial manufacturing items. Note that no optimal hyperparameter search was done for industrial product dataset and data preparation step was minimal as we only wanted to see if the process can be applied to completely different domain. Hyperparameter selection is an important step for maximizing performance, as improvement in NDCG varies between +0.03 and +0.28. Recall improvement has a very similar range.

Figure 3. Food distribution company NDCG improvements.

Figure 4. Food distribution company recall improvement.

Since hyperparameters influence retrieval performance so much, let’s explore the best parameters in more detail. Batch size doesn’t have a big impact on retrieval performance but affects training time and hardware requirements – larger batch size allows for faster training but requires more memory. It was also shown to affect stability if it’s either too small or too large. Both NDCG and recall are increasing with the learning rate and the number of training epochs. This means that the system can be further improved, but there is a risk of catastrophic forgetting – the model can forget things that weren’t part of the domain training data. In the case of a search engine for some particular product catalog, this isn’t catastrophic, as the model doesn’t have to understand completely out-of-domain queries because there are no correct products to retrieve in the database anyway. In our experiments, we haven’t hit the point of catastrophic forgetting, so it could be trained even more. Key factors that we used to check if we encounter catastrophic forgetting:

The base model was multilingual. We’ve only trained in English item names. Queries in multiple languages should still give reasonable results
Model should handle minimum typos with grace. Interestingly, the longer the query context, i.e., the more product attributes we ask for, the more typos the model handles well, resulting in good responses. Our research queries with 5-6 words could handle up to 2-4 individual typos.
Synonyms – we’ve run queries asking for synonyms of items and their attributes. The overfitted model would not give good results. Our highest learning rate model still gives quality results

Figure 5. NDCG as a function of hyperparameters

Figure 6. Recall as a function of hyperparameters

Since batch size doesn’t significantly affect performance, let's zoom in on the other two main hyperparameters—learning rate and number of epochs. They both roughly follow the usual logarithmic shape. This means diminishing returns—the amount of used computational resources gradually decreases performance improvement.

Figure 7. NDCG as a function of training epochs and learning rate. Each dot represents one batch size.

Figure 8. Recall as a function of training epochs and learning rate. Each dot represents one batch size.

When looking at performance metrics per pivot, we can clearly see the impact of consistency, quality, and plentifulness of data. In our example Infor M3 database, the finetuned model can achieve NDCG@10 of 0.9 for Group field 1 and Dimension field 1, which are consistently kept fields. Group field 1 is a natural language field filled out in all but 3 rows, while Dimension field 1 is missing in 44% of records, but it’s an extremely consistent, numerical field, so the model can learn to retrieve an exact match. Fields with NDCG@10 below 0.8 all have over 74% of missing values in them. Additionally, Business area is a set of business names, which are not necessarily common. Dimension field 2 is an extremely inconsistent numerical field that sometimes has multiple values separated with a slash. It has also a lot of unique values, which is undesirable for a numerical field. Our data considered only exact matches for numerical field comparison, so 10 and 11 is as much a mismatch as 10 and 200. While it’s possible that incorporating distance information between numerical values would improve the performance it’s definitely not given, as transformer based models notoriously struggle with numbers due to how tokenization works. Key takeaway from pivot analysis is that data has to be consistent within a column, the columns have to be well maintained, which means filled out, and that text fields are easier for the model to grasp than numerical fields.

pivot	NDCG@10	recall@10	NDCG@10_pivot	missing values fraction	unique value fraction
Dimension field 1	0.90	0.85	0.91	0.44	0.07
Group field 1	0.90	0.82	0.96	<0.01	0.05
Group technology	0.83	0.66	0.83	<0.01	<0.01
Group field 2	0.80	0.72	0.90	0.31	0.09
Business area	0.77	0.80	0.81	0.87	0.10
Group field 2	0.76	0.65	0.88	0.74	0.09
Dimension field 2	0.44	0.57	0.45	0.91	0.19

Conclusions

In conclusion, semantic search trained on domain data is very accurate and can indeed retrieve mostly correct and most of correct products related to a query. It can handle synonyms and even multiple languages – one can query for a particular product in a language different to language of the database and still get correct results. Semantic search can handle typos, where longer queries with more context can tolerate more significant typos. Thanks to approximate search such retrieval systems are very fast even for large databases. Models can learn very varying domains, we’ve tested on food products and on industrial products database alike. As most of the deep learning models, performance grows logarithmically with increased computational resources used. To properly fine-tune the model consistent and quality data is necessary.

Artur Stopa & Greg Zajaczkowski