Insights into chess game trends: A detailed look at Lichess data

chess-elastic-stack-720x420.jpg

This is the second blog post in our chess series (don’t forget to check out the first post)! Chess is a fascinating game. What can happen on those 64 squares? 

Lichess is a platform that allows you to play chess; it publishes all rated games as archives, starting in 2013. There are a total of 4 billion games played. Yes, 4 billion matches.

Analyzing the data

We first want to look at a general overview and get an idea of the data we gathered.

First, take a look at Discover. Don’t forget to set the time picker accordingly. The histogram at the top is a nice hint to see the data distribution. We immediately noticed that Lichess had a slow start and picked up the page in early 2020. The top value of 4,095,987,675 represents all documents that fit the time picker range. This equals the games played since a single document represents each game.

Let’s use the `Field Statistics` button next to the Documents view. This will feature a sampler that runs over a particular document set and calculates the value for each field. This gives a quick indication of the values stored. The field statistics used roughly 620.000 documents, and the value for user.black.elo and user.white.elo is distributed from 600 to ~2.210. The Opening ECO (Encyclopaedia of Chess Openings) encodes chess openings in a letter and number system. D00–D69 represents the Double Queen Pawn Game (the first Move being d4 d5). In the sampled data, we have a total of 484 distinct values. Opening names like Philidor Defense: Exchange Variation have 3,053 unique values.

We can use the Lens icon in the Actions column to jump automatically into a visualization we can save to a new dashboard.

The first dashboard

What is the amount of game distributions? January 2023 had more games than the years 2015 and 2016 combined.

Are games predominantly won by white, black, or drawn? 

I wouldn’t say clearly, but white wins 2.04 billion games, black 1.9 billion, and only 162.29 million games are drawn.

What types of games are played? 

Blitz is the most popular, with nearly 2 billion played games. Rapid hasn’t existed since around 2016, and since then has taken over. Nobody wants to play Correspondence games — only 9 million games since 2013.

How are games terminated? 

Normal, or time forfeit. Normal makes out approximately 67%.

What are the most used time settings? 

The first number 60 represents the starting time in seconds. The + sign is used to indicate that there is an increment for each turn. 60+0 results in one minute per player without any increments. The entire game is over after 2 minutes.

Now it would be interesting to go deeper and see what games we have with different increments. Is there a popular increment? There are multiple ways to achieve this. The first would be to use a wildcard search on the field timecontrol . Using KQL in the search bar NOT timecontrol: *+0 quickly filters down to all time controls that use an increment. The most popular game with an increment is 180 seconds + 2 seconds per move.

The two other options would be to use a runtime field, dissect the data away, and store a field timecontrol_base and timecontrol_increment with the information. Don’t forget to set the type long.

Timecontrol_base:

String custom=dissect('%{custom}+%{}').extract($('timecontrol', ''))?.custom;
if (custom != null) {   
    emit(Long.parseLong(custom));
}


Timecontrol_increment:

String custom=dissect('%{}+%{custom}').extract($('timecontrol', ''))?.custom;
if (custom != null) {   
    emit(Long.parseLong(custom));
}


Don’t forget to use the formatter at the bottom and select duration, then seconds. This creates a human-readable interpretation!

The alternative is to use the include, exclude option within Lens itself. This supports regular expressions. I want to exclude all +0 games and only see increments greater than 9 seconds. Include values: \d*\+\d{2,} and exclude values: \d*\+0. In a regex, \d represents any digit, and the * sign signals 0 or more of the preceding token. The \+ escapes the + sign using the \ since the + is a restricted character, and finally, \d{2,} equals at least two digits.

Summary

In this blog post, we started our first analyses of the Lichess data and discovered some interesting facts about the game type. We recognized a trend in the number of games played, looked into the different time control values, and learned the different ways to filter the data.

Ready to get started? Begin a free 14-day trial of Elastic Cloud. Or download the self-managed version of the Elastic Stack for free.