

Understanding Inverse Propensity Score for Contextual Bandits
source link: https://chanind.github.io/ai/2022/03/14/inverse-propensity-score-contextual-bandits.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Understanding Inverse Propensity Score for Contextual Bandits
Mar 14, 2022
One of the hardest concepts to grasp about contextual bandits is understanding how to evaluate a bandit policy without actually deploying it and seeing how it performs with users. Intuitively it seems impossible to know how a new policy will perform looking only at past data because in a bandit problem you can only observe the rewards for an action that was taken. You don’t have any data about the rewards for conterfactual actions that weren’t tried. Inverse Propensity Scoring (IPS) is a simple technique which can solve this problem by giving an unbiased estimate of how a new bandit policy would perform if it were deployed using only data recorded by a previous, potentially different bandit policy. If none of these terms make sense, don’t worry, we’ll do a quick overview of the contextual bandit problem, and then go in depth on how to use IPS to evaluate policies.
WTF is a “Contextual Bandit”?
Contextual Bandits are probably one of the most bizarrely named concepts in reinforcement learning. The idea is an extention of “multi-armed bandits”, which come from an old name for slot machines. Imagine you’re at a casino faced with 4 slot machines. Each machine has a different chance of giving a payout, but you don’t know the chance of winning for each machine. How should you play so as to maximize your winnings? You can spend time testing out each slot machine to get a better sense of the reward for each machine, but then you might be missing out on any potential rewards from just sticking with the machine that seems like it’s the best from what you’ve seen so far.
How should you play each machine to maximize your reward?
For multi-armed bandits, an action corresponds to a possible choice you can make. In the example of 4 slot machines above, there are 4 possible actions, each refering to pulling the lever of one of the 4 slot machines. Furthermore, a policy is an algorithm which determines how you play. Typically a policy is probabilistic, so a policy expresses a probability distribution of taking each action. For instance, in the case above you could try a completely random policy and just pull an arm uniformly at random. Or you could try a policy that pulls lever 1 60% of the time and the other 3 levels each 10% of the time. Or the probability can change over type, starting out more random and becoming more deterministic over time.
Contextual bandits are a type of multi-armed bandit problem where you have some extra information that might be useful in determining which action to take. For instance, if you have an online store and you want to recommend an item to a user who visits your website, the item you choose to recommend might depend on the age and location of the user. Contextual bandit problems are very common in digital products where you have a set of items that can be shown to a user and the goal is to choose the best item to show that will optimize some metric, for instance the chance the user will buy your product.
The counterfactual problem
Let’s say you have an online store with several items, and currently you show those items to your users at random. You know this isn’t optimal though, because you have some extra information about each of your users, like what they last purchased, their age, and their location. If someone just bought a pair of shoes from you, it probably doesn’t make sense to try to show them that same pair of shoes immediately after. You have a log of data from your store for each time you showed an item to a user, and whether or not the user ended up purchasing the item.
How can you use this to data to try out different policies for showing items to users? For each user you only what happened after you showed the user 1 item; you don’t have any way to know what would have happened if they had seen a different item instead. Maybe user X didn’t buy when you showed them shoes, but maybe they would have if you had shown them a shirt. Or maybe they wouldn’t have purchased anything regardless. You can try to come up with a new policy for how to pick which item to show to users on your website, but how can you tell how that policy would have performed given the data that’s been logged so far?
It seems like this should be impossible, but IPS offers a way to estimate how well any new policy would have performed given the log for how items were gifted and received in the past, with some caveats as we’ll see below.
Inverse Propensity Score (IPS)
In order for IPS to work, the policy used to generate the log data must be probabilistic and have a non-zero probability of generating picking every action that the new policy we want to test can also generate. In general, as long as the policy that generates the data never assigns a 0 probability to any action at all you should be fine. Contextual Bandit libraries like vowpal wabbit do this automatically. Also, in our data log where we record every action taken and the reward generated, we also need to record the probability of taking that action as output from the generating policy.
The idea behind IPS is to replay the log, and weigh each reward that shows up inversely to how likely the generating policy was to pick that action. This helps correct for the fact that actions that the generating policy selects more often will also show up more frequently in the logs than actions that aren’t selected often. The adjusted reward is then either multiplied by 1 if the policy we’re testing out also selected the same action as the generating policy given the context, or set to 0 if it selects a different action. Finally, these are results are averaged to give the expected reward of the policy we’re testing. In python, this would look something like the following:
def ips_estimate_avg_reward(new_policy, data_log):
total_reward = 0
for (reward, action, probability, context) in data_log:
new_action, new_probability = new_policy(context)
if new_action == action:
total_reward += reward / probability
return total_reward / len(data_log)
Not bad for 7 lines of code! Note that the new_probability
generated by our new policy isn’t needed for IPS, but if we were to deploy this policy we’d want to record it in the data log so we could continue running ips estimates of policies in the future.
Improving on IPS
IPS isn’t perfect, however. While it is an unbiased estimator of the expected reward of a new policy, it can have high variance. This is especially if the policy used to generate the data and the policy being tested are very different. IPS gives 0 reward for every case where the test policy and the generating policy select different actions, so if the policies have little overlap it will require a lot of data before IPS gives good estimates. There are other more complicated estimators that handle this better than IPS, such as Doubly Robust, or Importance-Weighted Regression.
All these techniques are implemented in the also bizarrely named but otherwise excellent Vowpal Wabbit library.
Recommend
-
75
为了文章的简洁性,本文省略了大量原文的文字和图片,只保留了我认为比较核心的内容,对原文有兴趣的同学欢迎点击原文链接。这篇文章讲述了Netflix对用户看到的视频封面进行个性化筛选的方法,但更具有普适性意义的是以此案例为载体的contextual band...
-
17
文章作者 :杨梦月、张露露
-
25
Solving Multi-Armed Bandits (MAB) problem via ε-greedy agents In this article, We’ll design a Multi-Armed Bandit problem (as described in Reinforcement Learning: An Introduction — Sutton [1]) & analyze how...
-
17
Understanding Quake’s Fast Inverse Square Root An article and research paper describe a fast, seemingly magica...
-
15
ML Platform Meetup: Infra for Contextual Bandits and Reinforcement LearningFaisal Siddiqi
-
9
01Feb 21 U.K. Arrest in ‘SMS Bandits’ Phishing Service Authorities in the United Kingdom have arrested a 20-year-old man for allegedly operating an online service for sending high-volume phishing...
-
62
Contextual Bandits in Python with Vowpal Wabbit Mar 15, 2022 Over the past few weeks I’ve...
-
4
Title: Bridging a Mental Health Crisis with Multi Armed Bandits Talk Abstract: The pandemic we are facing has brought with it another health care crisis: access to mental health care. Quartet Health is bridging the...
-
12
writeDescription: You can write,what can you byte.nc pwn.byteband.it 9000 Solution:程序保护如下: Arch:...
-
8
This article was published as a part of the Data Science Blogathon. Introduction Ever wondered what would be the conversion drop and consequent sales dro...
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK