Bayesian table tennis: Building a hierarchical table tennis ranking model

Here on the development team, we are big fans of iterative development for our software, for our ML models and especially for our office table tennis ranking system! In the past, our team wrote about Building A Table Tennis Ranking Model using the Bradley-Terry model and Google sheets. The model gives a single rating for each player (not unlike the Elo rating system in Chess). One drawback of this model is that it has no measure of “confidence”. For example, a new player could have a single lucky game against the reigning champion and suddenly soar to the top of the rankings, but should we be that confident in their ranking? Enter Bayesian Hierarchical Modelling.

Hierarchical Bayesian models

Many of the parametric models you know and love (such as linear regression or MLPs), use a maximum likelihood estimate (MLE) or maximum a posterior (MAP) to estimate the parameters $\theta$ of a model with observations $x$ :

θ ̂ M L E θ ̂ M A P = arg max θ \in Θ { (θ, x)} = arg max θ \in Θ { (θ, x) + g (θ)} (1)

$\begin{align} \hat{\theta}_{MLE} &= \underset{\theta \in \Theta}{\arg\max}\{\mathcal{L}(\theta, x)\} \\ \hat{\theta}_{MAP} &= \underset{\theta \in \Theta}{\arg\max}\{\mathcal{L}(\theta, x) + g(\theta)\}\\ \tag{1} \end{align}$

where $\mathcal{L}(\theta, x)$ is the log-likelihood function and regularizer (or prior) $g(\theta)$ on the parameters. In the case of our Bradley-Terry model, the parameters $\hat{\theta}$ are the estimates for our players’ ratings. The nice thing about this formulation is that we can apply your favorite optimization technique (e.g. gradient descent) to find the parameters that maximize Equation 1. The downside is that we only get a point estimate of the parameters, there’s no concept of how “confident” we are in the estimates. Or another way to put it: there’s no measure of uncertainty.

Bayesian hierarchical modelling takes a slightly different philosophical approach. Every parameter of interest in the model is a random variable with its own distribution. This means we can ask our usual questions of a random variable such as: what’s the mean? median? standard deviation? $P(a < \theta < b)$ ? With a distribution for each parameter, we can now quantify the uncertainty of our estimate in any way we wish. Cool right?

Bayesian modelling always starts off with a statement of Bayes theorem describing the relationship between the parameters of interest and the data:

P (θ | x) = P ( x | θ ) P ( θ ) P ( x ) = likelihood \cdot prior marginal likelihood (2)

$\begin{align} P(\theta|x) = \frac{P(x|\theta)P(\theta)}{P(x)} = \frac{\text{likelihood}\cdot\text{prior}}{\text{marginal likelihood}}\tag{2} \end{align}$

The likelihood and prior (aka regularizer) from our MAP estimate above are included there along with a constant factor in the denominator. The MAP estimate attempts to find a point-estimate that maximizes the RHS of Equation 2, while a full Bayesian analysis attempts to find the distribution specified by $\theta$ . The biggest downside of doing a fully Bayesian analysis is that it’s hard to compute. In general, there is no closed form solution and we must rely on approximation methods such as variational methods or MCMC methods that generate samples from the distribution.

Once we have a distribution for each parameter, we can generate the credible interval (or region for multi-variate distributions) which involves finding an interval $(a,b)$ such that $P(a<\theta<b)=1-\alpha$ . For example, we might want to find a 50% credible interval for each player’s rating.

A Bayesian ranking model

To define our Bayesian hierarchical model, we need to specify the likelihood and prior functions from Equation 2 (the marginal likelihood is a constant so we don’t need to specify it). We’re going to follow the Bradley-Terry model, where we assume that the probability of player $i$ beating player $j$ is:

P (i > j) = p i p i + p j (3)

$P(i > j) = \frac{p_i}{p_i + p_j} \tag{3}$

where $p_i$ is the rating for player $i$ . With this idea, we can define our likelihood function by modelling with player $i$ having $w_{i,j}$ wins against player $j$ in $N_{i,j}$ games as a binomial distribution. Additionally, we’re going to take the easy route and assume a flat prior with support $[0,1]$ . Putting it all together with observations $x$ (all the wins and loses):

P (x | p 1, \dots, p K) P (p 1, \dots, p K) = \sum i = 1 N \sum j = 1 i (N i , j w i , j) (p i p i + p j) w i, j (1 - p i p i + p j) N i, j - w i, j = \sum i = 1 N const (4)

$\begin{align} P(x|p_1,\ldots,p_K) &= \sum_{i=1}^N  \sum_{j=1}^i \binom{N_{i,j}}{w_{i,j}}\big(\frac{p_i}{p_i + p_j}\big)^{w_{i,j}}\big(1-\frac{p_i}{p_i + p_j}\big)^{N_{i,j}-w_{i,j}} \ \\ P(p_1,\ldots,p_K) &= \sum_{i=1}^N \text{const} \ \tag{4} \end{align}$

In this case, we’re assuming that each player’s rating has equal probability of being any possible rating in $[0,1]$ . In certain cases, this might cause some issues fitting the model (especially if we’re trying to compute a MAP estimate) but we’ll be using the data regularizing scheme from our original Rubikloud post Building A Table Tennis Ranking Model. The added benefit here is that the MAP estimate for the models are identical.

In recent years, Bayesian statistical packages have come a long way. There are simple interfaces with heavily optimized engines to fit Bayesian models. The most popular Bayesian statistics package is Stan, but we ended up using PyMC3 because we like the native Python interface. For a simple model like this, there’s not much difference in performance so usability is the main concern.

The nice thing about these packages is that you can take Equation 4 and basically translate it directly, here’s a code snippet of our model in PyMC3:

The official Rubikloud/Kinaxis table tennis rankings V2

We're happy to announce that we've since moved on to V2 of our table tennis rankings. To simplify the UI, instead of plotting each individual’s distribution, we simply show each person’s 50% credible interval centered on the MAP estimate:

Table Tennis Rankings Chart

The scores are put on a logarithmic scale and scaled from 1 to 1000. Visually it looks like this:

Table Tennis Whisker Chart

As you can see from this past example, some players had a very wide range while others had a very tight range. This shows the “confidence” the model has in the rating. Since this time, the changes were well-received, with players consistently showing more interest in trying to move up the rankings.

Editor's note: This blog is part of a series originally published on Rubikloud's blog, "kernel." Kinaxis acquired Rubikloud in 2020.

Bayesian table tennis: Building a hierarchical table tennis ranking model

Hierarchical Bayesian models

A Bayesian ranking model

The official Rubikloud/Kinaxis table tennis rankings V2

Leave a Reply