Customers

Finding a Scalable Data Model for Search @ bol.com

Bol.com started in 1999 and has grown from an online bookstore to an online superstore with a wide variety of products, including books, segways, shoes, saunas, swimming pools, and much more. Since 2010, Bol.com has also opted to become a platform for other sellers to (re)sell their products and/or sell the same products bol.com offers but with different conditions.

Right now we have over 11 million products available, 6.2 million active customers, and about 230,000 active sellers a month.

What's the problem?

Like any e-commerce site, we want to help our customers find what they are looking for and part of that comes in the form of a search engine.

Currently we use Endeca for site search and we have a 'flat' document model for products, which means that we join data relevant for search for each product in one flat document and for each product we select the offer (which contains information like price, availability, seller) from a set of offers we think is most relevant based on some rules. This means price and availability are part of the product document and we can only use one offer:

productid title author price availability review
1 Under the naked sun Isaac Asimov 10 within 24h 4 stars

Functionally this limits the customers. Let's say I have a product with an offer from bol.com for 20 Euros and an offer from Seller A for 18 Euros and our offer selection says bol.com has the best offer. The price-facet is based on the 'best buy' price so if a customer selects a price between 15-19 Euros the product will disappear because it does not apply to the filter. This is bad for our customers because there actually is a relevant offer, and it’s also bad for sellers because their offer isn't shown.

We want the customer to be in more control. One of the major things we need to solve is that we need a scalable way to model all offers for each product.

Other challenges

We've been working with Elasticsearch and the Elastic Stack for a while and recently also for an application for our professional sellers. Elasticsearch seems to be very scalable and flexible so we wanted to see if Elasticsearch could solve our modelling challenge and whether it comes close to the performance we get out of our current search engine. To give an indication, we had about 2,400 requests a second on our search engine during the holiday season, and expect a lot more this year.

Another crucial challenge is the number of updates due to the amount of products, offers, and sellers — about 500K offer updates (price or availability changes) and 200K content updates a day (and mostly during office hours). Sometimes we have peaks of 1.5 million product updates and 20 million offer updates a day. On average this will increase in the coming years because we are growing, with more products, more sellers, more offers, and more updates and inserts.

model-p3.png

So choosing the model is not just about IF we can model product offers correctly, it's also thinking about how we can set up Elasticsearch in such a way that it can handle all those updates in near real-time while processing the number of requests. For this article we'll stick mostly to modelling and performance from a request perspective.


Challenge recap

So to recap our challenges:

  • Hierarchical data model, multiple levels
  • High volume searches and data changes
  • Complex query requirements
    • Both product and offer fields in query
    • Aggregations on both levels

I'll get back to 'complex query requirements' in a moment but in general we have come up with a principle which states 'although we value performance at index time, we value performance at query time more'. We used this principle in the decision process.

Elasticsearch and data modelling

Elasticsearch has a few modelling options for our case:

  1. A single document for each offer with product information inline
  2. A single document for each product, where the best offer is added inline
  3. A single document for each production with all offers added inline
  4. A single document for each product with all offers added inline, but also marked as nested objects in the Elasticsearch mapping
  5. A separate document for each product and separate documents for each offer, using the parent-child functionality of Elasticsearch

bolcom-data-modelling.jpegApproach

Option 3 is the situation we have now so we skipped that one, option 2 was also quickly discarded because you lose the relation between the data of an offer if you have more than one offer. That left us with three models (1, 4 and 5). But how can you test which one is best? We first started by actually creating representative indices for all these options to test against.

We used a Node.js script for creating random data:

Product

  • Title: two random nouns from noun list
  • Category: pick one out of 26 nouns
  • Half have no offer, half between 1-4

Offer

  • Random price between 1-20
  • Seller: pick one out of 10k unique sellers

We also want to test with different index sizes so we flushed the test data out to disk in three flavors - each flavor keeping its own bulk size of 100K - 1M, 10M and 100M products. We ended up with nine indices to test with, for each model we had three flavors. Now that we've got the indices, let's talk about the queries.

Complex query requirements

For our test we came up with four use cases that we want to test, each one with increasing complexity but still a representative query that customers perform each day. Below is an overview of the first three use cases.

Use case A is a regular query where we need to display the cheapest offer and aggregate of some offer attributes. For use case B we ‘selected’ the deliverycode with value ‘0’ which means that we need to filter use case A and select the correct cheapest offer and aggregate on the correct offers. Use case C is ordering the selection on cheapest offer first. Here we need to display the cheapest offer and again aggregate on the correct offers.

Use Cases:

use-cases-1.png

Use case D is special case and can best be explained with the following example

rollup.PNG

What you see here is the ‘rollup’ functionality, which means that we want to display closely related products (we call them product families, products are the same except on 1 or 2 attributes) as 1 result. Here you can see the Apple iPad, all black but with different GB sizes.
This results in the following requirements for the fourth use case:

use-cases-4.png

The complexity here is doing the aggregation on the set of products within the family/rollup and selecting the best offer to display. Now we’ve defined our use cases, let’s start testing.

Testing

Again we used a Node.js script to query and good old Excel to generate some graphs. The Node.js script used an array of terms, they have an overlap with the test data but a percentage results in 0 results, just like on our website.

You could also specify how many queries to execute (we used 1,000 and 10,000) and it wrote out the response times to file. During the first run the response times were terrible, what was wrong? Well, rookie mistake, we only provided Elasticsearch with 1 GB of HEAP_SPACE, so after we set that to 32 GB (yes I know, more than 30 is an overkill but for testing it was fine) thinks were looking much better.

So after setting that straight the result of our tests can be seen in the graphs below, the left side is average time in ms for all requests during the run, on the bottom the 4 different use cases.

result-graph-1.png

result-graph-2.png

result-graph-3.png

Some remarks about the graphs and outcome:

  • Regarding the 100m run, we did several tests but forgot to run the 'doc' queries for use case C and D but we expect the same outcome
  • As you can also see for query C there are no parent-child results, that's because we couldn't get the query to work for that use case. It doesn't mean that it's not possible but not in the timeframe we had
  • Doc-type offers is slow because we needed to aggregate on productid and the counts needed to be on product level. This was especially clear with the 100 million document index
  • The 'clear' winner is the nested document model except for the use case D so we need to figure out a way of doing that with a faster query or in a completely different way

The downside of nested documents is the update behaviour, if only one offer changed for a doc with five offers, the whole document needs to be reindexed. But remember our principle? It's not a big issue if the data is behind for, let's say, 15-20 minutes. Not to say it's not a challenge...

Conclusions

Based on the outcome of the test we draw the following conclusions:

  1. Parent/Child has limitations when combining cross-level queries with aggregations in one go
  2. Doc (with offer being the main entity) not as fast as we’d expected but that's because we needed top_hits aggregation
  3. Nested was the 'winner' although 'rollup' was still slow
  4. Elasticsearch scales predictably

Based on the outcome, we decided to move forward and we are now in the process of building part of the search and browse of our webshop on Elasticsearch.

The modelling and testing was done in cooperation with Anne Veling (beyondtrees.com).


marteen-roosendaal.jpg

Maarten Roosendaal is an IT Architect at bol.com and his main focus is on scalable search and SEO solutions. He has a background in Java/J2EE and previously worked for Sogyo and Capgemini. When not spending time with family or working, you can find him either in the gym, watching NBA/NCAA basketball, or at a movie theater.