Generally capable agents emerge from open-ended play

Blog post

Research

27 Jul 2021

In recent years, artificial intelligence agents have succeeded in a range of complex game environments. For instance, AlphaZero beat world-champion programs in chess, shogi, and Go after starting out with knowing no more than the basic rules of how to play. Through reinforcement learning (RL), this single system learnt by playing round after round of games through a repetitive process of trial and error. But AlphaZero still trained separately on each game — unable to simply learn another game or task without repeating the RL process from scratch. The same is true for other successes of RL, such as Atari, Capture the Flag, StarCraft II, Dota 2, and Hide-and-Seek. DeepMind’s mission of solving intelligence to advance science and humanity led us to explore how we could overcome this limitation to create AI agents with more general and adaptive behaviour. Instead of learning one game at a time, these agents would be able to react to completely new conditions and play a whole universe of games and tasks, including ones never seen before.

Today, we published "Open-Ended Learning Leads to Generally Capable Agents," a preprint detailing our first steps to train an agent capable of playing many different games without needing human interaction data. We created a vast game environment we call XLand, which includes many multiplayer games within consistent, human-relatable 3D worlds. This environment makes it possible to formulate new learning algorithms, which dynamically control how an agent trains and the games on which it trains. The agent’s capabilities improve iteratively as a response to the challenges that arise in training, with the learning process continually refining the training tasks so the agent never stops learning. The result is an agent with the ability to succeed at a wide spectrum of tasks — from simple object-finding problems to complex games like hide and seek and capture the flag, which were not encountered during training. We find the agent exhibits general, heuristic behaviours such as experimentation, behaviours that are widely applicable to many tasks rather than specialised to an individual task. This new approach marks an important step toward creating more general agents with the flexibility to adapt rapidly within constantly changing environments.

Open-Ended Learning Leads to Generally Capable Agents

The agent playing a variety of test tasks. The agent was trained across a vast variety of games and as a result is able to generalise to test games never seen before in training.

A universe of training tasks

A lack of training data — where “data” points are different tasks — has been one of the major factors limiting RL-trained agents’ behaviour being general enough to apply across games. Without being able to train agents on a vast enough set of tasks, agents trained with RL have been unable to adapt their learnt behaviours to new tasks. But by designing a simulated space to allow for procedurally generated tasks, our team created a way to train on, and generate experience from, tasks that are created programmatically. This enables us to include billions of tasks in XLand, across varied games, worlds, and players.

Our AI agents inhabit 3D first-person avatars in a multiplayer environment meant to simulate the physical world. The players sense their surroundings by observing RGB images and receive a text description of their goal, and they train on a range of games. These games are as simple as cooperative games to find objects and navigate worlds, where the goal for a player could be “be near the purple cube.” More complex games can be based on choosing from multiple rewarding options, such as “be near the purple cube or put the yellow sphere on the red floor,” and more competitive games include playing against co-players, such as symmetric hide and seek where each player has the goal, “see the opponent and make the opponent not see me.” Each game defines the rewards for the players, and each player’s ultimate objective is to maximise the rewards.

Because XLand can be programmatically specified, the game space allows for data to be generated in an automated and algorithmic fashion. And because the tasks in XLand involve multiple players, the behaviour of co-players greatly influences the challenges faced by the AI agent. These complex, non-linear interactions create an ideal source of data to train on, since sometimes even small changes in the components of the environment can result in large changes in the challenges for the agents.

wNKIDGJaeDXsJxxqSESS8rXxr2sLolzFEqZXYm4BHYcIyxj2pG1mDvptD_pOxfoIAH42DcP17f1RPRaFBnrwdT2kD0A5T5dKgdQGH_yuZFHVb5TosmQ=w1440

XLand consists of a galaxy of games (seen here as points embedded in 2D, coloured and sized based on their properties), with each game able to be played in many different simulated worlds whose topology and characteristics vary smoothly. An instance of an XLand task combines a game with a world and co-players.

Training methods

Central to our research is the role of deep RL in training the neural networks of our agents. The neural network architecture we use provides an attention mechanism over the agent’s internal recurrent state — helping guide the agent’s attention with estimates of subgoals unique to the game the agent is playing. We’ve found this goal-attentive agent (GOAT) learns more generally capable policies.

We also explored the question, what distribution of training tasks will produce the best possible agent, especially in such a vast environment? The dynamic task generation we use allows for continual changes to the distribution of the agent’s training tasks: every task is generated to be neither too hard nor too easy, but just right for training. We then use population based training (PBT) to adjust the parameters of the dynamic task generation based on a fitness that aims to improve agents’ general capability. And finally we chain together multiple training runs so each generation of agents can bootstrap off the previous generation.

This leads to a final training process with deep RL at the core updating the neural networks of agents with every step of experience:

the steps of experience come from training tasks that are dynamically generated in response to agents’ behaviour,
agents’ task-generating functions mutate in response to agents’ relative performance and robustness,
at the outermost loop, the generations of agents bootstrap from each other, provide ever richer co-players to the multiplayer environment, and redefine the measurement of progression itself.

The training process starts from scratch and iteratively builds complexity, constantly changing the learning problem to keep the agent learning. The iterative nature of the combined learning system, which does not optimise a bounded performance metric but rather the iteratively defined spectrum of general capability, leads to a potentially open-ended learning process for agents, limited only by the expressivity of the environment space and agent neural network.

RVAvfMDlsRwFMRXA-opajmHcROe7nNJfv8Z1OjJVapgRY1I8liHPaXRkC8GvmL5MSz3mUB9fmTHi9CGcHxfy4it_RDknqpG66uXobbeBsdQCai9qy38k=w1440

The learning process of an agent consists of dynamics at multiple timescales.

Measuring progress

To measure how agents perform within this vast universe, we create a set of evaluation tasks using games and worlds that remain separate from the data used for training. These “held-out” tasks include specifically human-designed tasks like hide and seek and capture the flag.

Because of the size of XLand, understanding and characterising the performance of our agents can be a challenge. Each task involves different levels of complexity, different scales of achievable rewards, and different capabilities of the agent, so merely averaging the reward over held out tasks would hide the actual differences in complexity and rewards — and would effectively treat all tasks as equally interesting, which isn’t necessarily true of procedurally generated environments.

To overcome these limitations, we take a different approach. Firstly, we normalise scores per task using the Nash equilibrium value computed using our current set of trained players. Secondly, we take into account the entire distribution of normalised scores — rather than looking at average normalised scores, we look at the different percentiles of normalised scores — as well as the percentage of tasks in which the agent scores at least one step of reward: participation. This means an agent is considered better than another agent only if it exceeds performance on all percentiles. This approach to measurement gives us a meaningful way to assess our agents’ performance and robustness.

More generally capable agents

After training our agents for five generations, we saw consistent improvements in learning and performance across our held-out evaluation space. Playing roughly 700,000 unique games in 4,000 unique worlds within XLand, each agent in the final generation experienced 200 billion training steps as a result of 3.4 million unique tasks. At this time, our agents have been able to participate in every procedurally generated evaluation task except for a handful that were impossible even for a human. And the results we’re seeing clearly exhibit general, zero-shot behaviour across the task space — with the frontier of normalised score percentiles continually improving.

jJBlSjUROBBHE8li6Lu9SG5QoqmzLrRiGSiSCcsRHndNtdem4ZKGKoKItRGFjXYuWwae3aPnurjCK1L6lhXeTqQ841bDdg-5_qxSti1W1sP-ghL-BihH=w1440

The learning progress of the final generation of our agents, shows how our test metrics progress through time, translating to zero-shot performance on hand-authored held-out test tasks as well.

Looking qualitatively at our agents, we often see general, heuristic behaviours emerge — rather than highly optimised, specific behaviours for individual tasks. Instead of agents knowing exactly the “best thing” to do in a new situation, we see evidence of agents experimenting and changing the state of the world until they’ve achieved a rewarding state. We also see agents rely on the use of other tools, including objects to occlude visibility, to create ramps, and to retrieve other objects. Because the environment is multiplayer, we can examine the progression of agent behaviours while training on held-out social dilemmas, such as in a game of “chicken”. As training progresses, our agents appear to exhibit more cooperative behaviour when playing with a copy of themselves. Given the nature of the environment, it is difficult to pinpoint intentionality — the behaviours we see often appear to be accidental, but still we see them occur consistently.

Km1ofE40BOrXZGdL5cmNpxC7omGtAwRTwQAtiIc_MHLAsnBvufnMAuUZ4wXC8xf5BHCK88LtbGVGlcI7-xpkqaFAW2KNn4ZfDZAmQcb53wlnuSPcoMe_=w1440

DV17so14dVNl87oAc2dFpuUaWfJmhLZ8MHWKCFIorYc07puyHi3KFDqMOBNwh4DIJ9tQEJST22kIXKhmmvSCPWAnz31Jiz5i1WX4BDA3a8hOSs8SgjeS=w1440

0zYPiTzWZQgmsNPtoLtl0d0zxjIIkZ9qtu9Ddqh2snj2L8EF73rC88IlcjYCiThkCuGLkRM_5QVD1Wb4VL8Hg5sSSgELVU_gtbgdf3R_sqCMODvFvg=w1440

Above: What types of behaviour emerge? (1) Agents exhibit the ability to switch which option they go for as the tactical situation unfolds. (2) Agents show glimpses of tool use, such as creating ramps. (3) Agents learn a generic trial-and-error experimentation behaviour, stopping when they recognise the correct state has been found. Below: Multiple ways in which the same agents manage to use the objects to reach the goal purple pyramid in this hand-authored probe task.

Multiple ways in which the same agents manage to use the objects to reach the goal purple pyramid in this hand-authored probe task.

Analysing the agent’s internal representations, we can say that by taking this approach to reinforcement learning in a vast task space, our agents are aware of the basics of their bodies and the passage of time and that they understand the high-level structure of the games they encounter. Perhaps even more interestingly, they clearly recognise the reward states of their environment. This generality and diversity of behaviour in new tasks hints toward the potential to fine-tune these agents on downstream tasks. For instance, we show in the technical paper that with just 30 minutes of focused training on a newly presented complex task, the agents can quickly adapt, whereas agents trained with RL from scratch cannot learn these tasks at all.

By developing an environment like XLand and new training algorithms that support the open-ended creation of complexity, we’ve seen clear signs of zero-shot generalisation from RL agents. Whilst these agents are starting to be generally capable within this task space, we look forward to continuing our research and development to further improve their performance and create ever more adaptive agents.

We hope the preprint of our technical paper — and videos of the results we’ve seen — help other researchers likewise see a new path toward creating more adaptive, generally capable AI agents. And if you’re excited by these advances, consider joining our team.

For more details, see our preprint “Open-Ended Learning Leads to Generally Capable Agents.”

This blog post is based on joint work by the Open-Ended Learning Team (listed alphabetically by first name): Adam Stooke, Anuj Mahajan, Catarina Barros, Charlie Deck, Jakob Bauer, Jakub Sygnowski, Maja Trebacz, Max Jaderberg, Michael Mathieu, Nat McAleese, Nathalie Bradley-Schmieg, Nathaniel Wong, Nicolas Porcel, Roberta Raileanu, Steph Hughes-Fitt, Valentin Dalibard, Wojciech Marian Czarnecki.

Capture the Flag: the emergence of complex cooperative agents

Mastering the strategy, tactical understanding, and team play involved in multiplayer video games represents a...

30 May 2019

Publication

Theory & foundations

Real World Games Look Like Spinning Tops

Wojciech Marian Czarnecki, Gauthier Gidel, et al. arXiv 2020

Blog post

Research

AlphaStar: Grandmaster level in StarCraft II using multi-agent reinforcement learning

AlphaStar is the first AI to reach the top league of a widely popular esport without any game restrictions.

30 Oct 2019

Publication

Reinforcement learning

Revisiting Peng's Q(λ) for modern reinforcement learning

T Kozuno, Y Tang, et al. ICML 2021

Publication

Reinforcement learning

Theory & foundations

Taylor expansions of discount factors

Y Tang, Remi Munos, et al. ICML 2021

Open Source

AlphaFold Protein Structure Database

We’ve partnered with Europe’s flagship laboratory for life sciences -...

Open Source

AlphaFold

An open source implementation of the AlphaFold v2.0 system

Blog post

Using JAX to accelerate our research

An introduction to our JAX ecosystem and why we find it useful for our AI research.

04 Dec 2020

Blog post

Research

Putting the power of AlphaFold into the world’s hands

In partnership with EMBL-EBI, we’re incredibly proud to be launching the AlphaFold Protein Structure Database.

22 Jul 2021

Open Source

Alchemy

Alchemy is an open-source benchmark environment for meta-reinforcement...

Blog post

Advancing sports analytics through AI research

Sports analytics is in the midst of a remarkably important era, offering interesting opportunities for AI researchers...

07 May 2021

Open Source

Jraph - library for graph neural networks in jax.

Jraph (pronounced "giraffe") is a lightweight library for working with...

Blog post

Game theory as an engine for large-scale data analysis

Our research explored a new approach to an old problem: we reformulated principal component analysis (PCA), a type of...

06 May 2021

Open Source

Generally capable agents emerge from open-ended play