Reinforcement Learning for everyone

If you work in technology, and even more so in AI/Machine Learning (ML), you are probably used to people not understanding what you do for a living. This has been my life, and I have to say I feel a bit proud of it: because it means that I have learnt many things that most people don’t know of, and I am deep enough in the topic to be able to understand things that are very complex for most. However, it is also a bit frustrating whenever you want to share an accomplishment with the people you care, because they won’t be able to understand what’s all the fuzz about.

This is why I’ve decided to write this article, in the hope that I can get my point across to anyone who’s interested in ML and in particular Reinforcement Learning (RL), no matter your background.

No, machine learning cannot build a Terminator like in the movies — yet [image from PublicDomainPictures .]

Pavlov’s Conditioning

Even though when you think of AI you think about the future, our story starts more than a 100 years ago, in the 1890s in the laboratory of Ivan Pavlov. He was studying salivation in dogs and to do this, he was measuring how much dogs were salivating when they would see food, before eating it. He was already excited with these experiments but he realised something unexpected: the dogs would salivate even before seeing any food. They would start salivating when they would notice Pavlov’s assistant was walking to them. After noticing this, he tested what happened if before feeding the dogs, he would ring a bell (or actually a metronome according to Wikipedia ), and probably you guessed it: they started salivating too, because they had learnt that after the bell, food would come . This bell is called a conditional stimulus because the dog does not salivate because the bell is ringing, but because it knows food will follow the bell.

In the end, Pavlov was conditioned by the dogs too [image from Flickr .]

Nice! But I came here for RL…

Well, as it turns out, RL is based on this basic principle of psychology. In RL, an agent learns how to behave based on a “conditioning stimulus” called a reward. The setting of RL is the following:

Framework of RL with its elements [homemade.]

We have an agent that is situated in and interacts with an environment , in which he can execute actions , while observing the state of the environment and receiving a reward for its actions. In addition, we solve our problem in discrete time, so our timeline is structured in steps ; in each step, the agent observes the state of the environment, executes an action on the environment that changes the environment state and receives a reward for its action.

To make it simpler, let’s think about an example: our agent will be a robot and we want it to learn to walk until a goal area, so when it arrives to this area we will give it some nice reward. To get to this goal area, the robot can use different actions such as: turn right, turn left, move forward and move back. The robot will start trying random combinations of actions in each step, until it arrives to the location we want it to arrive. Once this has happened, we change its location and start again, it’s like a “ game over ” because the robot has achieved its goal and no other actions are possible. This period that starts with the robot in a random position until it arrives to the goal area, it’s called an episode , and we will repeat these episodes until the robot learns what it needs to do to get a good reward. After this, the robot will always do the same: navigate to the goal area because it knows that by doing so, it will get a good reward.

But, how does the agent learn ?

You might be thinking in the right direction already: math ! The agent’s behavior is defined by its policy , which can be represented in different ways according to the method we use: a table, a function or even Neural Networks.

In the most basic case of RL, called Tabular Q-learning, the agent keeps a table in which there is one row per state and one column per action, like in the figure. This table tells the agent what’s the expected result of performing an action in a given state, so when the state of the environment changes, the agent checks the row corresponding to the given state and can choose the action that in the past returned the highest reward. The values for each action and state are called Q-values.

A policy when using tabular Q-learning.

This table is a policy in Tabular Q-learning. Each Q-value is initialised to 0 and then its value is updated after each step with an update rule which is based on the reward received after taking an action and “how good” is the new state. I skip the math for this article to avoid technicalities that might not interest many of you, but if you’d like to see the math and all the gritty details of how these values are updated, you can see them here .

The agent will repeat the episode several times, updating its policy in each step with its new experience (state, action, new state, reward). After some time, the agent will have learnt a policy that yields a good reward over the episode, just as a person would learn how to play a video game obtaining a good score.

In fact, video games are very good environments to try RL agents, and this is why they are one of the most common use cases for RL. However, the state of a video game is usually defined as each frame of the game, so we are dealing with state spaces that are too big to be managed with Tabular Q-learning so this is where neural networks are used instead of a Q-table. This is what we call Deep RL or Deep Q-learning, because deep neural networks are used. In the next video, from Two Minute Papers, you can see Google’s DeepMind’s Deep Q-learning’s agent playing Atari games.

DeepMind’s deep Q-learning

And that’s it!

It wasn’t too awful, right? I hope you now have a (very) broad idea of what RL is and how it works. If you’re still interested and want to go deeper in RL, there are amazing materials online that can help you to understand it properly and implement it in code! These are some of my favorite resources:

Thank you for reading!

Reinforcement Learning for everyone