Elo Win Probability Calculator

Step 1

Enter player ratings or pick two players from a list. Alternatively, enter an Elo difference or an expected score (and a draw probability for Chess). Dependent variables will automatically adjust.

Choose a rating source:
(players rated 2400 or better, source)
(source)
(source)
(players rated 4d or better, source)1,3
(players rated 4d or better, source)2,3
4
(source)5
(source)5

Rating 1

Rating 2

First to move (only has an effect for Chess):


Elo difference:

Elo formula:

Player 1 expected score: (for one game, or one set in Tennis)6

Draw probability: (see note)7

Step 2

Enter match details to obtain the probabilities of winning a match.

Competition format:

by a margin of two. (Draw probability must be zero.)

One-click settings:

Current score:

Result

OutcomeProbability
player 1 win
player 2 win
draw

Notes

  1. EGF ratings are not Elo ratings, but they can be converted to an Elo scale (logistic distribution).
  2. AGA ratings are not Elo ratings, but they can be converted to an Elo scale (normal distribution).
  3. For EGF and AGA you can type "rank 4d" as the name for a generic 4d, "rank 10k" for a generic 10k, etc.
  4. Every Go source is converted to the EGF scale and merged. For pros, both goratings and mamumamu are taken into account.
  5. Tennis ratings from Tennis Abstract are Elo ratings (logistic distribution) but for a best-of-3 match. The rating difference is converted to an Elo difference for a set in step 1 so that we get back probabilities for a match in step 2.
  6. The expected score is the win probability plus half of the draw probability.
  7. For Chess, the draw probability is estimated from Rating 1 and Rating 2 and the assumption that draw odds advantage is worth 0.6 pawns. The resulting draw probabilities agree quite well with the data on this page.

About the Elo scale

The starting point of the Elo rating system is a curve mapping rating differences to expected scores (which is the same as win probabilities if there are no draws). Unfortunately there is disagreement over which curve should be used. As an example, in a game with no draws if player A has an 80% chance of beating B, and if B has an 80% chance of beating C, then what is the probability of A beating C? If you think that the answer is approximately 94.1%, then you're in the "logistic distribution" camp. If you think that the answer is approximately 95.4%, then you're in the "normal distribution" camp.

Elo curves

Both are justified mathematically: In our example, the chances of winning are in a ratio of 4:1 between A and B and in a ratio of 4:1 between B and C, so it's reasonable to expect a ratio of 16:1 between A and C, giving a probability of A beating C of 16/17 ≈ 94.1%, the logistic distribution value. Another approach is to assume that if A has a (50+ε)% chance of beating B, and if B has a (50+ε)% chance of beating C, then A has a (50+2ε)% chance of beating C. Now, whatever is the game A, B and C are playing, best-of-n is a meta-game that the players can also play. If we find n such that (50+ε)% becomes 80% in a best-of-n, then (50+2ε)% becomes 95.4% in a best-of-n, the normal distribution value.

The normal distribution was apparently Arpad Elo's original suggestion, but it is quite harsh on the underdog and many people claim that the logistic distribution works better in practice, so what should we use? I don't know. This page automatically selects the distribution recommended by the rating source, but it lets you override it.


Elo per pawn in chess

In chess, material odds is a form of handicap where some pieces of the stronger player are removed from the starting position. A chess player is "one pawn stronger" than another player if he can give a pawn and have an expected score of 0.5. A ranking system based on material odds is perfectly possible: the strength of a player could be described by the material odds needed against a chess master, and actually that was the way chess strength was described in the mid-19th century according to the history section of Handicap (chess).

How do we translate between Elo and material odds? The first thing to realize is that the number of Elo points equivalent to one pawn varies with the strength of the players. Some estimates by GM Larry Kaufman are given in the "Rating equivalent" section of the article linked above. More data points are given by matches played with material odds or draw odds between two human players or between a human and a computer, some of which are discussed at the end of the "History" section. I fitted a curve as best as I could and below is a graph of the result. The calculator applies this curve to display material odds when given two Elo ratings.

Chess curve

Draws in chess

Note 7 above says "For Chess, the draw probability is estimated from Rating 1 and Rating 2 and the assumption that draw odds advantage is worth 0.6 pawns". In this section, I elaborate on what this means.

Prerequisite functions

Let's give names to the two functions corresponding to the two plots above. The first function is from FIDE, and the second function is from myself.

eloNormal(eloDiff) = erfc(-eloDiff / ((2000.0/7) * sqrt(2))) / 2
eloPerPawnAtElo(elo) = exp(elo/1020) * 26.59

Calculation

The input is rating1 and rating2, where rating1 ≤ rating2.

diff = rating1 - rating2
ave = (rating1 + rating2) / 2
expected_score = eloNormal(diff)

We compute what an advantage of 0.6 pawns corresponds to in Elo:

eloPerPawn = eloPerPawnAtElo(ave)
eloShift = eloPerPawn * 0.6

We consider the chess variant where a draw counts as a loss for player1 by penalizing rating1 by eloShift:

player1_win_probability = eloNormal(diff - eloShift)

Remember that the expected score is the probability of winning plus half of the probability of drawing, so:

draw_probability = (expected_score - player1_win_probability)*2

Example

rating1 = 2000
rating2 = 2400

diff = -400
ave = 2200
expected_score = eloNormal(-400) = 0.080757

eloPerPawn = eloPerPawnAtElo(2200) = 229.843
eloShift = 229.843 * 0.6 = 137.906

player1_win_probability = eloNormal(-400 - 137.906) = 0.029872
draw_probability = (0.080757 - 0.029872) * 2 = 0.101770

Remarks

  1. The numbers from the example don't exactly correspond to the output of the calculator because the calculator is more complicated: it also models White's first move advantage, and it averages the cases firstMove=player1 and firstMove=player2.
  2. The formula requires that ratings are ordered so that rating1 ≤ rating2. If you try rating1 = 2400 and rating2 = 2000, then you'll get a wrong result because the Elo penalty -eloShift must have the same sign as the Elo difference diff.

Elo per stone in Go

Handicap stones in Go are the rough equivalent of material odds in Chess since both are modifications of the start position so that both players have equal chances of winning. A ranking system based on handicap stones is the dan/kyu system which is the preferred way to communicate Go strength for amateurs. Pandanet provides a table giving the correct (fair) handicap for each difference in rank. To make the scale absolute, we traditionally define 7 dan to be the (amateur) rank of a newly accepted professional player. (This is enforced by the EGF, but other rankings don't enforce it and ranks can differ slightly.)

A rating system that reports a dan/kyu rank must internally map a rank difference in stones to a win probability (or equivalently to an Elo difference). Unfortunately, I believe that the mappings used by both the AGA and the EGF are very unrealistic, which is a real shame because other parts of their rating systems are very polished. The worst offender is the AGA which postulates a constant ~270 Elo per stone, which is clearly too much for weak players. The EGF system is better because it attempts to make the value rank-dependent, but the curve used is not ambitious enough and is not a good fit to the EGF's own winning statistics. To fix this, I created the source "Go (pros + EGF + AGA)" which I recommend using for your Go probability needs instead of the individual EGF or AGA sources. In this source I also added ratings for AlphaGo and DeepZenGo based on their recent results, although in the case of AlphaGo we don't really know its true strength.

Below is a graph plotting AGA's constant, EGF's curve, and my proposed homemade curve against actual EGF game statistics from 2006-2015. The actual statistics behave strangely below about 12 kyu, as noted by Geoff Kaniuk when also attempting to fit a model. I think that this is happening because the minimum EGF rank is artificially set to 20 kyu which distorts the winning statistics at the lower end, so we can safely ignore this range. Also the goal shouldn't be to blindly fit the experimental data because in practice we expect some error between a player's rating and their true rank, and we can reason that such noise in the ratings causes the experimental curves to be lower than the curve we would obtain if the players' ratings had time to converge to their true value.

Go curves

Below I've collected other estimates of the number of Elo points per stone that I could find or derive. I took some of these estimates into account when designing my curve. It's hard to say how reliable each of these is.

Elo per stone Source
~50 Pandanet IGS Ranking System. Points awarded imply values in the range 44-56.
226 for 2d+
148 for 30k-5k
KGS Rating Math
~160 according to the figure axis
230 according to the caption
AlphaGo Nature paper, Figure 4a, citing KGS
~187 for 6-10d KGS
153 for 2-6d KGS
AlphaGo Nature paper, Extended Data Table 6. From the Elo differences of CrazyStone, Zen and Pachi with 4 handicap stones and komi. Assuming that White is getting the komi, then this is the correct handicap for a 3-stone difference in strength, so the raw Elo differences were divided by 3.
191 Opening Keynote: "The Story of Alpha Go". "One stone stronger is about 75%". This converts to invEloLogistic(0.75) ≈ 191 (function available on the JavaScript console).
302 for pros Komi statistics. "a change of 1 point in the value of komi would produce a change of 3.1% in the percentage of games won by black." Assuming that one stone is worth 14 points, then some bold extrapolation gives invEloLogistic(0.5 + 0.031) * 14 ≈ 302 (function available on the JavaScript console).
~infinite for ~13d Hand of God. "most estimates place God three ranks above top professionals." The EGF scale sets 1p = 7d and 9p = 9.4d, so top pros might be close to 10d.

Sadly the AlphaGo video claims that some constant value works across all levels. My guess is that the AlphaGo team only deals with professional-level strength, so they never encountered any problem with using a constant. The IGS value may seem out of place, but the median player strength on IGS is about 4 kyu and 50 Elo per stone is a good match for that according to my curve.


Effect of best-of-n on Elo

Let's take tennis as our setting. If the probability of winning a set is p, then the probability of winning a best-of-n match can be obtained by summing the tail of a binomial distribution. The result is a function mapping a probability to another probability. We will call this function best_of_n, where n is an odd number.

Equation

Remember that the Elo formula (the first graph of the page) converts an Elo difference to a probability. By sandwiching the best_of_n function between the Elo formula and its inverse, we can observe the effect of a best-of-n match in units of Elo.

new_elo_diff = elo_formula-1(best_of_n(elo_formula(elo_diff)))

To explain what this means, suppose we run a rating system "A" based on the results of individual sets, and another rating system "B" based on the results of best-of-n matches. If the Elo difference between two tennis players is elo_diff in rating system A, then we expect their Elo difference to be new_elo_diff in rating system B.

The effect of best-of-n is close to multiplying elo_diff by a constant depending on n, but not quite. Below, we plot the ratio new_elo_diff / elo_diff for two best-of values (3 and 5) and for two Elo formulas (logistic and normal).

Plot

For elo_diff close to zero, the ratio is 3/2 for best-of-3, and 15/8 for best-of-5. These values can be derived by looking at the linear term of the Taylor expansion of the function best_of_n(p) at p=1/2.

Equation

This simplifies to the following expression, where the numerator is the integer A002457((n-1)/2).

Equation

It is unfortunate that the effect of best-of-n on Elo is not simply a scaling. When the calculator transforms Elo for tennis, it uses the formula for new_elo_diff given above. This works because only 2 players are involved. If there are 3 players or more with different Elo ratings, then it is impossible to map these Elo ratings to new ratings such that the formula giving new_elo_diff holds for all pairs of players, but a good approximation is to scale Elo by best_of_ratio(n). Another reasonable option is to estimate the average absolute Elo difference between players, to look up the corresponding multiplier in the graph above, and to scale Elo by this multiplier.

In the graph, we note that the curves using the normal distribution are more flat than when using the logistic distribution. This indicates that Elo using the normal distribution is a better model for games which are best-of-n of independent sub-games.

For comparison, another match format is to play games until a player wins by a specified point-margin. This match format has the disadvantage that there is no bound on the number of games that will be played. The probability of winning by an n-point margin is equivalent to the gambler's ruin problem with an unfair coin and is given by the equation below. This match format is a perfect fit for the logistic Elo formula, as shown by the plot below.

Equation
Plot

To illustrate what the normal Elo formula is really optimizing for, below is a similar graph for the function best_of_237 ∘ (best_of_105)-1. The strange numbers are simply large numbers such that best_of_ratio(237) / best_of_ratio(105) ≈ 1.500388, which is close to 1.5. In other words, if the game we're playing is already a best-of-105 of something more basic, then instead of doing a best-of-3 of best-of-105, how about changing the game to a best-of-237 of basic elements? This is intuitively a more accurate procedure, so it makes sense that the normal Elo formula is optimized for that. This explains why the normal Elo formula is not a perfect fit for best-of-n when n is small.

Plot

Page created: September 29, 2016
Page updated: January 24, 2021 (fixed a broken link, added a new section "Draws in chess")
Page updated: March 28, 2021 (added a new section "Effect of best-of-n on Elo")
Page updated: April 2, 2021 (for Tennis, adjusted calculator to the fact that source ratings are best-of-3, and not a mix)

back to François Labelle's homepage