25

Cracking Blackjack — Part 3

 3 years ago
source link: https://mc.ai/cracking-blackjack - part-3/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Outline for this Article

In this article, I will be explaining the key building blocks that are used in the Reinforcement Learning algorithm we will use to maximize Blackjack returns. These building blocks are also used in many other Reinforcement Learning algorithms, so it is worthwhile to understand them in a context we know and love: Blackjack!

Always keep this diagram of the Reinforcement Learning Cycle from Part 2 in your head as you read this article!

Image made by Author

The Building Blocks of Our RL Algorithm

In short, the only thing our Reinforcement Learning algorithm will do is define what the agent should do with state → action → reward tuples (explained in Part 2 ) after each episode. The building blocks described below help facilitate the updates our agent will make during the learning process to end up with the optimal policy for playing Blackjack.

Key Data Structures our Algorithm will Use and Update

  • Q Table: A table to keep track of the value (or Q-value) of choosing an action given some state. A Q-table is built by taking a cross-product of the observation_space and action_space defined in our Blackjack environment from Part 2 . The initial Q-value for all state/action pairs is 0.
  • Prob Table: A table created in the same way as the Q Table: a cross-product of the observation_space and action_space . This table contains the probability the agent will choose an action given some state. This reinforces the stochastic approach to policies described in Part 1 . The initial probabilities for actions for each state will be 50% hit / 50% stand.
  • The Q table and Prob table define a living, breathing, stochastic policy that our agent will constantly use to make decisions and update after getting rewards back from the environment.

Important Variables that Impact the Agent’s Learning Process

  • Alpha (α): This can be thought of as the learning rate . After our agent gets rewards from the environment for an action in some state, it will update the Q-value of the corresponding state-action pair in our Q-table. α is the weight (or coefficient) given to that change in Q-value. α must be > 0 and ≤ 1. A lower α means that each round of Blackjack has a smaller impact on the policy, and facilitates more accurate learning over a larger number of episodes.
Image Made by Author
  • Epsilon (ε): This can be thought of as an analogous “learning rate” for probabilities in the Prob table. When our agent gets a reward for some state + action, it will also tweak the probability of taking that same action in the future. ε applies a similar weight/coefficient as α to each of these changes. ε must be ≥ 0 and ≤ 1. A higher ε yields a smaller change in the probability of taking an action.
Image Made by Author
  • Epsilon Decay (ε-decay): This is the rate at which ε decays after each episode. At the beginning of the agent’s learning process, we would like ε to start high and make small changes to the Prob table because we want our agent to explore new actions. This helps ensure the final policy isn’t skewed heavily by randomness early in the learning process. For example, we don’t want a few successful “hit” actions for player-hand-value = 18 early in the learning process to make our agent decide that hit is correct in this position in the long run. We reduce ε as the learning process goes on using ε-decay because we want the agent to exploit the accurate insights it has gained in its previous exploration phase.
Image Made by Author
  • Epsilon Minimum (ε-min): The explore vs exploit dynamic is very delicate; the transition from explore to exploit can be very sudden if you are not careful. The ε-min variable sets a limit for how much one episode can alter the probability of an action for some state in the Prob table.
  • Gamma (γ): In a given episode (or round) of Blackjack, the AI agent will make more than just one decision in some cases. Let’s say our AI agent hits when the player hand value = 4, and also makes 2 more decisions after that. The agent gets a reward at the very end of this episode. How much is the initial “hit” action responsible for the final reward? γ helps explain this. We use γ as a discount rate on the final rewards of an episode to approximate the reward of the initial “hit” action. γ must be > 0 and ≤ 1.
Image Made by Author

The variables above should be thought of as levers: they can be increased or decreased to experiment with the agent’s learning process. Later, we will go over which combination of these levers yields the best policy and highest returns in Blackjack.

Post navigation

← “Greener, Smarter, and Fairer”

Wind Energy Forecasting with Python →

Request for deletion

About

MC.AI – Aggregated news about artificial intelligence

MC.AI collects interesting articles and news about artificial intelligence and related areas. The contributions come from various open sources and are presented here in a collected form.

The copyrights are held by the original authors, the source is indicated with each contribution.

Contributions which should be deleted from this platform can be reported using the appropriate form (within the contribution).

MC.AI is open for direct submissions, we look forward to your contribution!

Search on MC.AI

mc.ai aggregates articles from different sources - copyright remains at original authors


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK