27

Differentiable Inter Agent Learning to Solve the Prisoners-Switch Riddle

 4 years ago
source link: https://www.tuicool.com/articles/R7n2AjY
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Backpropagating gradients across agents to learn messaging protocols

AfEFnqU.png!web

Reinforcement Learning is a popular research area. This is mainly because it aims to model systems that otherwise seem intractable. From the famous Atari paper by Deepmind, we have come far. The following post is from what I have learned from various papers, mainly these two: https://arxiv.org/pdf/1511.08779.pdf and https://arxiv.org/pdf/1605.06676.pdf

An interesting avenue of study in reinforcement learning is that of communicating agents: a setup where agents can send messages to each other in order to cooperate. A good case where communication will be essential is that of an environment that is only partially observable to each agent, whereas more information is required for the agents to complete the task cooperatively. These agents will have to pass meaningful messages to each other, where “meaningful” stands for having some correlation with useful information associated with their own environments or the actions they are about to take.

Such an environment where messages are passed will have some important requirements. Consider an agent A sending a message m to another agent B. The protocol that they develop to communicate, as they learn, will have to have certain properties.

  1. Given the architecture of the system, this messaging protocol should be optimal. A system where agents are spitting out random vectors to each other is obviously useless. These vectors that are exchanged ( m ) have to be rich in information
  2. The message that is sent by A has to be understood by B

Considering that we have multiple agents, we can go with the following options (probably even a mid-way-out):

  1. Parameter Sharing: Get a common set of parameters for all agents. That is, use a single network for all the agents, but use these parameters separately during execution, with each agent plugging in its own observation and received messages into the network. This method is generally more stable and easy to learn since if useful features are extracted, these are used for all agents. They do not have to be learned separately.
  2. Without Parameter Sharing: Use different weights for each network, and let them learn features independently. This is less stable than the previous option.

One early algorithm that was devised was RIAL. The idea behind RIAL is simple: just provide messages as action choices in the action selector of each agent. This message is passed to the other agents. That was it. Each agent is trained separately. This framework follows decentralized execution as well as independent update of parameters, that is, gradients are passed through each agent independently and updated. They are not passed through multiple agents end-to-end.

This practice of not passing the gradients through multiple agents is limiting in several aspects. It is difficult to land on an optimum policy since messages are similar to action selection.

Then came DIAL (Differential Inter Agent Learning). Let us introduce an output for a dedicated message in each agent. Add an input corresponding to the message sent by the other agent. With this framework, while training, connect all nets together with each step and let the gradients flow through all the agents, end to end while minimizing the loss function (which is the usual Q learning loss function that we get from the Bellman Equation). What we essentially end up with is an optimal protocol to exchange messages. While training, each agent learns to construct meaningful messages and also to understand the messages given by the other agent. This setup follows centralized learning and decentralized execution. Each agent individually acts in the environment while executing, but learn together during the training phase.

JbqYNjz.png!web

Source: https://arxiv.org/pdf/1605.06676.pdf

One important thing to note here is that each message can be discrete or continuous. There might be restrictions imposed by the problem definition to only allow discrete messages to be passed, such as one-bit messages. How can we discretize messages? Let us introduce a unit called a DRU (Discretize/Regularize Unit) , defined as:

Instead of directly plugging in the message from one agent to another, we will pass the message through this unit. This unit can discretize the message because of the noise. When we add noise during training, we force the messages ( m ) towards the extreme right or left of the decision boundary, so that adding noise does not affect the side the message is on (that is, positive or negative).

Let us try implementing this. Consider the Hundred-Prisoners riddle. I quote the riddle statement directly from the paper :

One hundred prisoners have been newly ushered into prison. The warden tells them that starting tomorrow, each of them will be placed in an isolated cell, unable to communicate amongst each other. Each day, the warden will choose one of the prisoners uniformly at random with replacement, and place him in a central interrogation room containing only a light bulb with a toggle switch. The prisoner will be able to observe the current state of the light bulb. If he wishes, he can toggle the light bulb. He also has the option of announcing that he believes all prisoners have visited the interrogation room at some point in time. If this announcement is true, then all prisoners are set free, but if it is false, all prisoners are executed. The warden leaves and the prisoners huddle together to discuss their fate. Can they agree on a protocol that will guarantee their freedom

bMJjumA.jpg!web

Source: https://github.com/iassael/learning-to-communicate

Let us now solve the riddle. We will be using TensorFlow, specifically 2.0. We will have to solve this for a small number of prisoners, such as 3 or 4 since the policy space increases exponentially with the number of agents. This problem can be modeled as prisoners being chosen at random from their cell each day, and allowed to pass a 1-bit message to the next one. This one-bit prisoner is the state of the light bulb. Hence, a prisoner in the room receives 1 bit from the previous prisoner (random bit if it is the first prisoner) and is allowed to send a bit to the next prisoner. Keeping a count of days and using that one bit to send to the next prisoner, they will have to solve the problem.

For this article, refer to my implementation .

The Environment

Refer to the class Jail , which implements the environment we will be using. Random prisoners are chosen for each step. Actions can be taken by calling the step() method. When a “Declare” action is taken, a reward of 0.5 or -0.5 is given to each prisoner, depending on whether the declaration is correct or not.

Defining the DRU

The following code snippet shows the definition of the DRU which will be used only during the training phase (I have directly discretized for the testing):

The Agent Network

Playing an Episode

Now the core part. First some important variables and initializations:

Using these, let us run a loop until we are not done:

This first part runs the messages through the agents. If an agent is in the room, the current bulb state is given to the agent, otherwise, a DUMMY_MESSAGE, indicating that the agent is in its room, and a day passed. Without knowing that a day has passed, agents will not be able to keep track of time, which gets encoded in the GRU’s hidden state when we push this DUMMY_MESSAGE. We log all useful variables into lists so that we can later compute the loss simply iterating through them. The target is computed using the same, familiar, famous Bellman Equation, giving us one term in the loss function:

uAzmyyA.png!web

Link: https://i.stack.imgur.com/XBsvH.png

Note that the protocols used by the main Q networks and the target Q networks will be slightly different, hence we have to pass a different set of messages through the target networks as well, in order to use this update rule. The Q values given by the target networks will make sense only if we let them follow the protocol they have devised until then.

The Main Training Loop

Run episodes as long as you want. Under a tf.GradientTape() object, call the play_episode method. Then we simply compute gradients and update weights, as is the usual practice.

Note that I have had to remove some details from these snippets to show the bare idea. You can get all those details in the implementation . One important detail is that in the above loop, we have to run test loops where we use the discrete message DRU and no exploration (no random action selection). They will reflect how well the system is doing. We also have to update the target network, for which a recommended interval is that of 100 episodes, as followed in the paper. My implementation does not use parameter sharing. It gets trained to the optimal policy in about 7k episodes.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK