Multi-Agent RL: Nash Equilibria and Friend or Foe Q-Learning

Making robots tip the scales

Apr 21 ·7min read

nqiuuq2.jpg!web

Photo by Toa Heftiba on Unsplash

For whatever reason, humans innately possess the ability to collaborate. It’s become so commonplace that its nuances slip right under our noses. How do we just know how to coordinate when moving a heavy couch? How do we reason splitting up in a grocery store to minimize time? How are we able to observe others’ actions and understand how to best respond ?

Here’s an interpretation: we reach a balance. An equilibrium. Each person takes actions that not only best complements the others’ but altogether achieves the task at hand most efficiently. This application of equilibria comes up pretty often in game theory and extends to multi-agent RL (MARL). In this article, we explore two algorithms, Nash Q-Learning and Friend or Foe Q-Learning, both of which attempt to find multi-agent policies fulfilling this idea of “balance.” We assume basic knowledge of single-agent formulations and Q-learning.

bAVZBrA.jpg!web

Photo by Erik Mclean on Unsplash

What Makes an Optimal Policy…Optimal?

Multi-agent learning environments are typically represented by Stochastic Games. Each agent aims to find a policy that maximizes their own expected discounted reward. Together, the overall goal is to find a joint policy that gathers the most reward for each agent . This joint reward is defined below in the form of a value function:

This goal applies to both competitive and collaborative situations. Agents can find policies that best counter or complement others. We call this optimal policy the Nash Equilibrium. More formally, it is a policy such that has this property:

At first, it seems like we’re beating a dead horse. The best policy gathers the most reward, so what?

Underneath all the fancy greek letters and notation, the Nash Equilibrium tells us a bit more. It says that each agent’s policy in Nash Equilibrium is the best response to the other agents’ optimal policies. No agent is incentivized to change their policy because any tweak gives less reward. In other words, all of the agents are at a standstill. Landlocked. Kind of trapped in a sense.

imYZFbF.jpg!web

Photo by NeONBRAND on Unsplash

To give an example, imagine a competitive game between two small robots: C3PO and Wall-E. During each round, they each choose a number one through ten, and whoever selects the higher number wins. As expected, both pick the number ten every time as neither robot wants to risk losing. If C3PO were to choose any other number, he would risk losing against Wall-E’s optimal policy of always choosing ten and vice versa. In other words, the two are at an equilibrium.

Making robots tip the scales

What Makes an Optimal Policy…Optimal?

Recommend

Learning Process of a Deep Neural Network

Desed: A Debugger for Sed

Building a Scalable CSS Architecture With BEM and Utility Classes

Programming in Kotlin: Functions & Custom Types [SUBSCRIBER]

Enform - enjoyable forms with React

How I Learned my Computer to Play Spot it! using OpenCV and Deep Learning

Build a Secure Serverless Function with Netlify

懂语法还会修辞，英文写作辅助工具：ProWritingAid

JS 中的自定义事件和模拟事件

清华大学自动化系 2020 C++ 大作业引热议，网友：建议直接入职 BAT！

About Joyk