Using Bayesian Games to Address the Exploration-Exploitation Dilemma in Deep Lea...

Artificial intelligence(AI) agents often operate in environments with partial or incomplete information. In those settings, agents are often forced to find a balance between exploring the environment or taking actions that yield an immediate reward. The exploration-exploitation dilemma is one of the fundamental frictions in modern AI systems particularly in heterogenous, multi-agents environments. Recently, researchers from Microsoft published a paper proposing a method of Bayesian incentives to find an adequate exploration-exploitation balance in multi-agent AI systems .

The friction between exploration and exploitation is at the core of the design of heterogenous AI systems. In AI theory, the exploration-exploitation friction is often explained using the famous multi-armed bandit problem in which a fixed limited set of resources must be allocated between competing (alternative) choices in a way that maximizes their expected gain, when each choice’s properties are only partially known at the time of allocation, and may become better understood as time passes or by allocating resources to the choice. Here is a textbook example:

In a casino, a one-armed bandit is a slot machine. An n-armed bandit has n levers. A gambler can insert a coin, pull a lever and collect the winnings(if any). At any given state, the gambler must decide which lever to play on each successive coin; the one with the best payoff or the one that has not been tried yet? With one decision, the gambler will choose to maximize the reward based on its current knowledge of the environment at the risk of never finding an optimal strategy( exploring new levers) while the other strategy optimizes the long term knowledge of the environment but sacrifices immediate rewards.

In the previous problem, each machine provides a random reward from a probability distribution specific to that machine. The objective of the gambler is to maximize the sum of rewards earned through a sequence of lever pulls. The crucial tradeoff the gambler faces at each trial is between “exploitation” of the machine that has the highest expected payoff and “exploration”.

The exploration-exploitation dilemma is directly related to the agents’ incentives in a specific AI environment. Information exploiters seek out the best choice given the information available to date. Information explorers on the other hand would seem willing to try a lesser known — or even unknown — option, for the sake of gathering more information. But an explorer does this at the real risk of a negative experience. Naturally, incentives in heterogenous, multi-agent AI systems are skewed in favor of exploitation as the outcomes and rewards are well known.

Another way of thinking about solving the exploration-exploitation dilemma is to naturally incentivize exploration. However, exploration incentives come with its own set of challenges. In real world AI environments, providing monetary incentives can be financially or technologically unfeasible while relying on voluntary exploration can lead to selection biases. Think about an AI system with a million agents in which each agent will be incentivized for rating a particular movie. If we use economic incentive model, let’s say paying each agent 1 cent per rating, we will quickly run out of out resources. Alternatively, if we rely on agents voluntarily rating the movies, then the ratings are likely to come only from the subset of the population that really likes that particular movie.

Information Asymmetry and Bayesian Exploration

In large multi-agent AI systems, we can neither rely on economic incentives nor let them evolve naturally. We need something in the middle which is the definition of statistics :wink:. To tackle the exploration-exploitation dilemma, the Microsoft Research team relied on the concept of information asymmetry which describes a setting in which the system always has more information than the individual agents. Using information asymmetry as a premise for the environment, the Microsoft model leverages a nascent technique known as Bayesian Exploration that looks to develop organic incentives while minimizing selection bias.

A Bayesian Exploration system is a game in which a recommendation system or “principal” interacts with a stream of self-interested “agents” arriving one by one. Each agent needs to make a decision: take an action from a given set of alternatives. The principal issues a recommendation, and observes the outcome, but cannot direct the agent to take a particular action. The problem is to design a “recommendation policy” for the principal that learns over time to make good recommendations and ensures that the agents are incentivized to follow this recommendation. To make their incentive model practical in real world environments, the Microsoft team focused on Bayesian Exploration in agents with heterogenous preferences. For instance, suppose that the preferences of an agent are encapsulated in her type, e.g., vegan vs meat-lover. When an agent takes a particular action, the outcome depends on the action itself (e.g., the selection of restaurant), the “state” of the world (e.g., the qualities of the restaurants), and the type of the agent. The state is persistent (does not change over time), but initially not known; a Bayesian prior on the state is common knowledge. In each round, the agent type is drawn independently from a fixed and known distribution. The principal strives to learn the best possible recommendation for each agent type.

One of the creative innovations of the Microsoft team is that they segmented the agents in an environment in three main types depending on whether and when the agent type is revealed to the principal:

Public Types: The type is revealed immediately after the agent arrives.
Reported Types: The type is revealed only after the principal issues a recommendation.
Private Types: The type is never revealed.

The main contribution of the Microsoft model is to design a near optimal recommendation policy for each agent type. The policy learning algorithm proceeded in phases. In each phase, the model explored all actions for each agent type that could be explored using information available at the start of the phase. Agents of different types learn separately, in per-type “threads”; the threads exchange information after each phase.

For the public type agents, policy matched the benchmark of “best explorable action” which is a policy that focuses on exploring all explorable type-action pairs. Exploration needs to proceed gradually, whereby exploring one action can enable the policy to explore another. For the reported types, the policy matched the public types in the long run. This starts to make sense if we consider that reported types are only relevant in multiple-round games and the Microsoft model tries to reduce exploration costs by approximating its policy to the public type. One of the main contributions of the Microsoft model is to develop a policy for private types. In this model, recommending one particular action to the current agent is not very meaningful because the agents’ type is not known to the principal. Instead, one can recommend a menu: a mapping from agent types to actions. Analogous to the case of public types, the model focuses on explorable menus and gradually explore all such menus, eventually matching the Bayesian-expected reward of the best explorable menu.

The segmentation of policies based on different agent types and the use of Bayesian Exploration games are important contributions of the Microsoft model to address the exploration-exploitation dilemma in multi-agent AI systems. While the friction between exploration and exploitation is likely to remain a core challenge of AI disciplines such as reinforcement learning, many of the ideas included in the Microsoft paper might help to design better policies for agents that can adapt to heterogenous environments.

Information Asymmetry and Bayesian Exploration

Recommend

Unicode & Character Encodings in Python: A Painless Guide

Ethereum-Based Stock Exchange Plans First Company Listing in June

简单梳理Redux的源码与运行机制

Kubernetes的三种外部访问方式：NodePort、LoadBalancer 和 Ingress

Why I’m completely rewriting my $5,735 MRR SaaS

Serverless?Node.js Puppeteer 渗透测试爬虫实践

基于Docker和Kubernetes的最佳架构实践

超实用的 Nginx 极简教程

Kubernetes核心概念总结

Nasa’s New UAV Is 80% 3D-Printed – Tech Briefs: Aerospace and Defense Technology

About Joyk