Cracking Blackjack — Part 3 - JOYK Joy of Geek, Geek News, Link all geek

Outline for this Article

In this article, I will be explaining the key building blocks that are used in the Reinforcement Learning algorithm we will use to maximize Blackjack returns. These building blocks are also used in many other Reinforcement Learning algorithms, so it is worthwhile to understand them in a context we know and love: Blackjack!

Always keep this diagram of the Reinforcement Learning Cycle from Part 2 in your head as you read this article!

Image made by Author

The Building Blocks of Our RL Algorithm

In short, the only thing our Reinforcement Learning algorithm will do is define what the agent should do with state → action → reward tuples (explained in Part 2 ) after each episode. The building blocks described below help facilitate the updates our agent will make during the learning process to end up with the optimal policy for playing Blackjack.

Key Data Structures our Algorithm will Use and Update

Q Table: A table to keep track of the value (or Q-value) of choosing an action given some state. A Q-table is built by taking a cross-product of the observation_space and action_space defined in our Blackjack environment from Part 2 . The initial Q-value for all state/action pairs is 0.
Prob Table: A table created in the same way as the Q Table: a cross-product of the observation_space and action_space . This table contains the probability the agent will choose an action given some state. This reinforces the stochastic approach to policies described in Part 1 . The initial probabilities for actions for each state will be 50% hit / 50% stand.
The Q table and Prob table define a living, breathing, stochastic policy that our agent will constantly use to make decisions and update after getting rewards back from the environment.

Important Variables that Impact the Agent’s Learning Process

Alpha (α): This can be thought of as the learning rate . After our agent gets rewards from the environment for an action in some state, it will update the Q-value of the corresponding state-action pair in our Q-table. α is the weight (or coefficient) given to that change in Q-value. α must be > 0 and ≤ 1. A lower α means that each round of Blackjack has a smaller impact on the policy, and facilitates more accurate learning over a larger number of episodes.

Image Made by Author

Epsilon (ε): This can be thought of as an analogous “learning rate” for probabilities in the Prob table. When our agent gets a reward for some state + action, it will also tweak the probability of taking that same action in the future. ε applies a similar weight/coefficient as α to each of these changes. ε must be ≥ 0 and ≤ 1. A higher ε yields a smaller change in the probability of taking an action.

Image Made by Author

Epsilon Decay (ε-decay): This is the rate at which ε decays after each episode. At the beginning of the agent’s learning process, we would like ε to start high and make small changes to the Prob table because we want our agent to explore new actions. This helps ensure the final policy isn’t skewed heavily by randomness early in the learning process. For example, we don’t want a few successful “hit” actions for player-hand-value = 18 early in the learning process to make our agent decide that hit is correct in this position in the long run. We reduce ε as the learning process goes on using ε-decay because we want the agent to exploit the accurate insights it has gained in its previous exploration phase.

Image Made by Author

Epsilon Minimum (ε-min): The explore vs exploit dynamic is very delicate; the transition from explore to exploit can be very sudden if you are not careful. The ε-min variable sets a limit for how much one episode can alter the probability of an action for some state in the Prob table.
Gamma (γ): In a given episode (or round) of Blackjack, the AI agent will make more than just one decision in some cases. Let’s say our AI agent hits when the player hand value = 4, and also makes 2 more decisions after that. The agent gets a reward at the very end of this episode. How much is the initial “hit” action responsible for the final reward? γ helps explain this. We use γ as a discount rate on the final rewards of an episode to approximate the reward of the initial “hit” action. γ must be > 0 and ≤ 1.

Image Made by Author

The variables above should be thought of as levers: they can be increased or decreased to experiment with the agent’s learning process. Later, we will go over which combination of these levers yields the best policy and highest returns in Blackjack.

Distill is a visual, interactive journal for machinelearning research emphasizing human…

19. July 2020

Softmax Function

20. July 2020

How AI Is Creating A Better Insurance Vertical For Current And Future Clients

12. March 2020

Global Artificial Intelligence in Oil & Gas Status, Dynamic, Future Prospects, Growth, Industry Share And Foresight 2020-2029 – EnerCom Inc.

17. April 2020

← “Greener, Smarter, and Fairer”

Wind Energy Forecasting with Python →

Request for deletion

About

MC.AI – Aggregated news about artificial intelligence

MC.AI collects interesting articles and news about artificial intelligence and related areas. The contributions come from various open sources and are presented here in a collected form.

The copyrights are held by the original authors, the source is indicated with each contribution.

Contributions which should be deleted from this platform can be reported using the appropriate form (within the contribution).

MC.AI is open for direct submissions, we look forward to your contribution!

Search on MC.AI

mc.ai aggregates articles from different sources - copyright remains at original authors

Cracking Blackjack — Part 3

Outline for this Article

The Building Blocks of Our RL Algorithm

Key Data Structures our Algorithm will Use and Update

Important Variables that Impact the Agent’s Learning Process

Related Articles

Distill is a visual, interactive journal for machinelearning research emphasizing human…

Softmax Function

How AI Is Creating A Better Insurance Vertical For Current And Future Clients

Post navigation

Request for deletion

About

MC.AI – Aggregated news about artificial intelligence

Search on MC.AI

Recommend

LSTM终获「正名」，IEEE 2021神经网络先驱奖授予LSTM提出者Sepp Hochreiter

求志愿：西交南开华科上财东南中山上科，选哪个？

日本半导体“失去”的33年

单例设计模式

大疆无人机爆安全漏洞

谈谈 Redux

为什么说钱追人容易，人追钱很难

周鸿祎谈赴美上市公司：市值高低不重要海外监管要做好准备

使用Prometheus监控Flink

院线复工，我们与影院、片方、宣发、演员和经纪人聊了聊

About Joyk