Simple Python implementation of a Bayesian multi-armed bandit algorithm

Multi-armed bandit

In the multi-armed bandit problem we try to maximise our gain over time by "gambling on slot-machines" that have different (but unknown) expected outcomes.

The concept is typically used as an alternative to A/B-testing used in marketing research or website optimization. The benefit of viewing website optimization as a multi-armed bandit problem instead of an A/B-testing problem is that no pre-defined sample sizes are needed and the algorithm will start optimizing the outcome (e.g. click rate) from the beginning, while the A/B-test needs to run all predefined samples to make a conclusion.

The following post will illustrate how to implement and solve a very simple multi-armed bandit problem with a probabilistic algorithm.

In [1]:

%matplotlib inline

import sys
import numpy as np
import scipy.stats as stats
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style('darkgrid')
np.random.seed(42)
# Print versions used
print('Python: {}.{}.{}'.format(*sys.version_info[:3]))
print('numpy: {}'.format(np.__version__))
print('matplotlib: {}'.format(matplotlib.__version__))
print('seaborn: {}'.format(sns.__version__))
#

Python: 3.6.6
numpy: 1.14.5
matplotlib: 2.2.3
seaborn: 0.9.0

Defining the bandit experiment

We will simulate 3 bandits with each an underlying probability of winning p_bandits . The pull(i) method will "pull the arm" of bandit $i$ and randomly decides based on the underlying probability if the pull was a win (1) or not (0).

In [2]:

# Define the multi-armed bandits
nb_bandits = 3  # Number of bandits
# True probability of winning for each bandit
p_bandits = [0.45, 0.55, 0.60]


def pull(i):
    """Pull arm of bandit with index `i` and return 1 if win, 
    else return 0."""
    if np.random.rand() < p_bandits[i]:
        return 1
    else:
        return 0

Bayesian interpretation

Each pull of a specifc bandit will give a win with a certain probability. The higher this probability the more likely pulling the arm of the bandit is going to result in a win. However we don't know what this probability is so we will have to model it based on our observations of a certain bandit resulting in a win or not. We are going to model the unkown probability with a parameter $\theta$. Based on oberving a single outcome $x$ we can model this as the distribution $P(\theta \mid x)$. Using Bayes rule we can write this as:

$$ P(\theta \mid x) = \frac{P(x \mid \theta)P(\theta)}{P(x)} $$

$P(\theta \mid x)$ is called the posterior distribution of $\theta$ after observing $x$. We will be able to compute this via the conditional distribution $P(x \mid \theta)$ and prior $P(\theta)$. The conditional distribution $P(x \mid \theta)$ in this case follows the Bernoulli distribution .

A good choice for the prior $P(\theta)$ would be the Beta distribution since it is a conjugate prior to the Bernoulli distribution. This means that if we choose the Beta distribution as a prior that the posterior will also be a Beta distribution. More specifically if the prior is $Beta(\alpha, \beta)$ then after observing a win $x=1$ the posterior would become $Beta(\alpha+1, \beta)$ or after observing a loss $x=0$ the posterior would become $Beta(\alpha, \beta+1)$.

This means that after every observation we can use the posterior as the prior for the next time we pull the bandit's arm. A more in-depth illustration can be found here .

Maximize reward / Minimize regret

The goal of the muli-armed bandit problem is to maximize reward (minimize regret). There is an exploitation-exploration tradeoff we have to make here. The more we pull the arm of our percieved best bandit the more certain we become of the probability of that bandit. But other bandits that we haven't pulled that often might have a lower expected probability but with higher uncertaintly. There is a chance that they are actually better than our percieved best bandit.

We will use Thompson sampling to overcome this problem. In Thompson sampling we will for each bandit sample the probability $\theta$ from the prior and pull the bandit with the highest sampled probability. And repeat this step until finished.

We will start with the prior $Beta(\alpha=1, \beta=1)$, which corresponds to a uniform prior between 0 and 1. The run is simulated for 1000 steps and the results at certain steps are plotted below.

In [3]:

# Define plotting functions
# Iterations to plot
plots = [1, 2, 5, 10, 25, 50, 100, 200, 500, 1000]


def plot(priors, step, ax):
    """Plot the priors for the current step."""
    plot_x = np.linspace(0.001, .999, 100)
    for prior in priors:
        y = prior.pdf(plot_x)
        p = ax.plot(plot_x, y)
        ax.fill_between(plot_x, y, 0, alpha=0.2)
    ax.set_xlim([0, 1])
    ax.set_ylim(ymin=0)
    ax.set_title(f'Priors at step {step:d}')
#

In [4]:

# Simulate multi-armed bandit process and update posteriors

# Setup plot
fig, axs = plt.subplots(5, 2, figsize=(12.0, 11))
axs = axs.flat

# The number of trials and wins will represent the prior for each
#  bandit with the help of the Beta distribution.
trials = [0, 0, 0]  # Number of times we tried each bandit
wins = [0, 0, 0]  # Number of wins for each bandit

n = 1000
# Run the trail for `n` steps
for step in range(1, n+1):
    # Define the prior based on current observations
    bandit_priors = [
        stats.beta(a=1+w, b=1+t-w) for t, w in zip(trials, wins)]
    # plot prior 
    if step in plots:
        plot(bandit_priors, step, next(axs))
    # Sample a probability theta for each bandit
    theta_samples = [
        d.rvs(1) for d in bandit_priors
    ]
    # choose a bandit
    chosen_bandit = np.argmax(theta_samples)
    # Pull the bandit
    x = pull(chosen_bandit)
    # Update trials and wins (defines the posterior)
    trials[chosen_bandit] += 1
    wins[chosen_bandit] += x

plt.tight_layout()
plt.show()

In [5]:

# Print final outcome and number of test needed per bandit
emperical_p_bandits = [(1+w) / (1+t) for t, w in zip(trials, wins)]
for i in range(nb_bandits):
    print((f'True prob={p_bandits[i]:.2f};  '
           f'Emperical prob={emperical_p_bandits[i]:.2f};  '
           f'Trials={trials[i]:d}'))
#

True prob=0.45;  Emperical prob=0.43;  Trials=43
True prob=0.55;  Emperical prob=0.47;  Trials=82
True prob=0.60;  Emperical prob=0.60;  Trials=875

Note above that the algorithm quickly converges to select the bandit with the highest probability of winning. After 1000 iterations it has chosen the winning bandit 10 times more often than one of the other bandits.

This post at peterroelants.github.io is generated from an Python notebook file. Link to the full IPython notebook file

Multi-armed bandit

Defining the bandit experiment

Bayesian interpretation

Maximize reward / Minimize regret

Recommend

How Triplebyte solved its office Wi-Fi problems

Electron Desktop chat

Space Cleaner - NPM

Immutable Data with Immer and React setState

GitHub - auchenberg/volkswagen: Volkswagen detects when your tests are being run...

Code that final mile: from big data analysis to slide presentation

Building Google Dataset Search and Fostering an Open Data Ecosystem

Episode 5: Blogging with Org mode and Emacs | EmacsCast

聚合碎片化信息的「一知」，能否成为新的社交流量场景？

微视凶猛：用微信养微视应该怎么做？

About Joyk