Grokking Deep Reinforcement Learning
Miguel Morales
Foreword by Charles Isbell
  • October 2020
  • ISBN 9781617295454
  • 472 pages
  • printed in black & white
ePub + Kindle available Nov 3, 2020

This book is very well put together. It explains in tech­nical but clear language what machine learning is, what deep learning is, and what reinforce­ment learning is.

From the Foreword by Charles Isbell


We all learn through trial and error. We avoid the things that cause us to experience pain and failure. We embrace and build on the things that give us reward and success. This common pattern is the foundation of deep reinforcement learning: building machine learning systems that explore and learn based on the responses of the environment. Grokking Deep Reinforcement Learning introduces this powerful machine learning approach, using examples, illustrations, exercises, and crystal-clear teaching. You'll love the perfectly paced teaching and the clever, engaging writing style as you dig into this awesome exploration of reinforcement learning fundamentals, effective deep learning techniques, and practical applications in this emerging field.

About the Technology

We learn by interacting with our environment, and the rewards or punishments we experience guide our future behavior. Deep reinforcement learning brings that same natural process to artificial intelligence, analyzing results to uncover the most efficient ways forward. DRL agents can improve marketing campaigns, predict stock performance, and beat grand masters in Go and chess.

About the book

Grokking Deep Reinforcement Learning uses engaging exercises to teach you how to build deep learning systems. This book combines annotated Python code with intuitive explanations to explore DRL techniques. You'll see how algorithms function and learn to develop your own DRL agents using evaluative feedback.
Table of Contents detailed table of contents

1 Introduction to deep reinforcement learning

1.1 What is deep reinforcement learning?

1.1.1 Deep reinforcement learning is a machine learning approach to artificial intelligence

1.1.2 Deep reinforcement learning is concerned with creating computer programs

1.1.3 Deep reinforcement learning agents can solve problems that require intelligence

1.1.4 Deep reinforcement learning agents improve their behavior through trial-and-error learning

1.1.5 Deep reinforcement learning agents learn from sequential feedback

1.1.6 Deep reinforcement learning agents learn from evaluative feedback

1.1.7 Deep reinforcement learning agents learn from sampled feedback

1.1.8 Deep reinforcement learning agents utilize powerful non-linear function approximation

1.2 The past, present, and future of deep reinforcement learning

1.2.1 Recent history of artificial intelligence and deep reinforcement learning

1.2.2 Artificial intelligence winters

1.2.3 The current state of artificial intelligence

1.2.4 Progress in deep reinforcement learning

1.2.5 Opportunities ahead

1.3 The suitability of deep reinforcement learning

1.3.1 What are the pros and cons?

1.3.2 Deep reinforcement learning’s strengths

1.3.3 Deep reinforcement learning’s weaknesses

1.4 Setting clear two-way expectations

1.4.1 What to expect from the book?

1.4.2 How to get the most out of this book?

1.4.3 Deep reinforcement learning development environment

1.5 Summary

2 Mathematical foundations of reinforcement earning

2.1 Components of reinforcement learning

2.1.1 Examples of problems, agents, and environments

2.1.2 The agent: The decision-maker

2.1.3 The environment: Everything else

2.1.4 Agent-environment interaction cycle

2.2 MDPs: The engine of the environment

2.2.1 States: Specific configurations of the environment

2.2.2 Actions: A mechanism to influence the environment

2.2.3 Transition function: Consequences of agent actions

2.2.4 Reward signal: Carrots and sticks

2.2.5 Horizon: Time changes what’s optimal

2.2.6 Discount: The future is uncertain, value it less

2.2.7 Extensions to MDPs

2.2.8 Putting it all together

2.3 Summary

3 Balancing immediate and long-term goals

3.1 The objective of a decision-making agent

3.1.1 Policies: Per-state action prescriptions

3.1.1 State-value function: What to expect from here?

3.1.2 Action-value function: What to expect from here if I do this?

3.1.3 Action-advantage function: How much better if I do that?

3.1.4 Optimality

3.2 Planning optimal sequences of actions

3.2.1 Policy Evaluation: Rating policies

3.2.2 Policy Improvement: Using ratings to get better

3.2.3 Policy Iteration: Improving upon improved behaviors

3.2.4 Value Iteration: Improving behaviors early

3.3 Summary

4 Balancing the gathering and utilization of information

4.1 The challenge of interpreting evaluative feedback

4.1.1 Bandits: Single state decision problems

4.1.2 Regret: The cost of exploration

4.1.3 Approaches to solving MAB environments

4.1.4 Greedy: Always exploit

4.1.5 Random: Always explore

4.1.6 Epsilon-Greedy: Almost always greedy and sometimes random

4.1.7 Decaying Epsilon-Greedy: First maximize exploration, then exploitation

4.1.8 Optimistic Initialization: Start off believing it’s a wonderful world

4.2 Strategic exploration

4.2.1 SoftMax: Select actions randomly in proportion to their estimates

4.2.2 UCB: It’s not about just optimism; it’s about realistic optimism

4.2.3 Thompson Sampling: Balancing reward and risk

4.3 Summary

5 Evaluating agents' behaviors

5.1 Learning to estimate the value of policies

5.1.1 First-visit Monte-Carlo: Improving estimates after each episode

5.1.2 Every-visit Monte-Carlo: A different way of handling state visits

5.1.3 Temporal-Difference Learning: Improving estimates after each step

5.2 Learning to estimate from multiple steps

5.2.1 N-step TD Learning: Improving estimates after a couple of steps

5.2.2 Forward-view TD(λ): Improving estimates of all visited states

5.2.3 TD(λ): Improving estimates of all visited states after each step

5.3 Summary

6 Improving agents' behaviors

6.1 The anatomy of reinforcement learning agents

6.1.1 Most agents gather experience samples

6.1.2 Most agents estimate something

6.1.3 Most agents improve a policy

6.1.4 Generalized Policy Iteration

6.2 Learning to improve policies of behavior

6.2.1 Monte-Carlo Control: policies after each episode

6.2.2 Sarsa: Improving policies after each step

6.3 Decoupling behavior from learning

6.3.1 Q-Learning: Learning to act optimally, even if we choose not to

6.3.2 Double Q-Learning: a max of estimates for an estimate of a max

6.4 Summary

7 Achieving goals more effectively and efficiently

7.1 Learning to improve policies using robust targets

7.1.1 Sarsa(λ): Improving policies after each step based on multi-step estimates

7.1.2 Watkins’s Q(λ): Decoupling behavior from learning, again

7.2 Agents that interact, learn and plan

7.2.1 Dyna-Q: Learning sample models

7.2.2 Trajectory Sampling: Making plans for the immediate future

7.3 Summary

8 Introduction to value-based deep reinforcement learning

8.1 The kind of feedback deep reinforcement learning agents use

8.2 Deep reinforcement learning agents deal with sequential feedback

8.2.1 But, if it is not sequential, what is it?

8.2.2 Deep reinforcement learning agents deal with evaluative feedback

8.2.3 But, if it is not evaluative, what is it?

8.2.4 Deep reinforcement learning agents deal with sampled feedback

8.2.5 But, if it is not sampled, what is it?

8.3 Introduction to function approximation for reinforcement learning

8.3.1 Reinforcement learning problems can have high-dimensional state and action spaces

8.3.2 Reinforcement learning problems can have continuous state and action spaces

8.3.3 There are advantages when using function approximation

8.4 NFQ: The first attempt to value-based deep reinforcement learning

8.4.1 First decision point: Selecting a value function to approximate

8.4.2 Second decision point: Selecting a neural network architecture

8.4.3 Third decision point: Selecting what to optimize

8.4.4 Fourth decision point: Selecting the targets for policy evaluation

8.4.5 Fifth decision point: Selecting an exploration strategy

8.4.6 Sixth decision point: Selecting a loss function

8.4.7 Seventh decision point: Selecting an optimization method

8.4.8 Things that could (and do) go wrong

8.5 Summary

9 More stable value-based methods

9.1 DQN: Making reinforcement learning more like supervised learning

9.1.1 Common problems in value-based deep reinforcement learning

9.1.2 Using target networks

9.1.3 Using larger networks

9.1.4 Using experience replay

9.1.5 Using other exploration strategies

9.2 Double DQN: Mitigating the overestimation of action-value functions

9.2.1 The problem of overestimation, take two

9.2.2 Separating action selection and action evaluation

9.2.3 A solution

9.2.4 A more practical solution

9.2.5 A more forgiving loss function

9.2.6 Things we can still improve on

9.3 Summary

10 Sample-efficient value-based methods

10.1 Dueling DDQN: A reinforcement-learning-aware neural network architecture

10.1.1 Reinforcement learning is not a supervised learning problem

10.1.2 Nuances of value-based deep reinforcement learning methods

10.1.3 Advantage of using advantages

10.1.4 A reinforcement-learning-aware architecture

10.1.5 Building a dueling network

10.1.6 Reconstructing the action-value function

10.1.7 Continuously updating the target network

10.1.8 What does the dueling network bring to the table?

10.2 PER: Prioritizing the replay of meaningful experiences

10.2.1 A smarter way to replay experiences

10.2.2 Then, what is a good measure of "important" experiences?

10.2.3 Greedy prioritization by TD error

10.2.4 Sampling prioritized experiences stochastically

10.2.5 Proportional prioritization

10.2.6 Rank-based prioritization

10.2.7 Prioritization bias

10.3 Summary

11 Policy-gradient and actor-critic methods

11.1 REINFORCE: Outcome-based policy learning

11.1.1 Introduction to policy-gradient methods

11.1.2 Advantages of policy-gradient methods

11.1.3 Learning policies directly

11.1.4 Reducing the variance of the policy gradient

11.2 VPG: Learning a value function

11.2.1 Further reducing the variance of the policy gradient

11.2.2 Learning a value function

11.2.3 Encouraging exploration

11.3 A3C: Parallel policy updates

11.3.1 Using actor-workers

11.3.2 Using n-step estimates

11.3.3 Non-blocking model updates

11.4 GAE: Robust advantage estimation

11.4.1 Generalized advantage estimation

11.5 A2C: Synchronous policy updates

11.5.1 Weight-sharing model

11.5.2 Restoring order in policy updates

11.6 Summary

12 Advanced actor-critic methods

12.1 DDPG: Approximating a deterministic policy

12.1.1 DDPG uses lots of tricks from DQN

12.1.2 Learning a deterministic policy

12.1.3 Exploration with deterministic policies

12.2 TD3: State-of-the-art improvements over DDPG

12.2.1 Double learning in DDPG

12.2.2 Smoothing the targets used for policy updates

12.2.3 Delaying updates

12.3 SAC: Maximizing the expected return and entropy

12.3.1 Adding the entropy to the Bellman equations

12.3.2 Learning the action-value function

12.3.3 Learning the policy

12.3.4 Automatically tuning the entropy coefficient

12.4 PPO: Restricting optimization steps

12.4.1 Using the same actor-critic architecture as A2C

12.4.2 Batching experiences

12.4.3 Clipping the policy updates

12.4.4 Clipping the value function updates

12.5 Summary

13 Towards artificial general intelligence

13.1 What was covered, and what notably wasn’t?

13.1.1 Markov Decision Processes

13.1.2 Planning methods

13.1.3 Bandit methods

13.1.4 Tabular reinforcement learning

13.1.5 Value-based deep reinforcement learning

13.1.6 Policy-based and actor-critic deep reinforcement learning

13.1.7 Advanced actor-critic techniques

13.1.8 Model-based deep reinforcement learning

13.1.9 Derivative-free optimization methods

13.2 More advanced concepts towards AGI

13.2.1 What is AGI, again?

13.2.2 Advanced exploration strategies

13.2.3 Inverse reinforcement learning

13.2.4 Transfer learning

13.2.5 Multi-task learning

13.2.6 Curriculum learning

13.2.7 Meta learning

13.2.8 Hierarchical reinforcement learning

13.2.9 Multi-agent reinforcement learning

13.2.10 Explainable AI, Safety, Fairness, and Ethical Standards

13.3 What happens next?

13.3.1 How to use DRL to solve custom problems?

13.3.2 Going forward

13.3.3 Get yourself out there! Now!

13.4 Summary

What's inside

  • An introduction to reinforcement learning
  • DRL agents with human-like behaviors
  • Applying DRL to complex situations

About the reader

For developers with basic deep learning experience.

About the author

Miguel Morales works on reinforcement learning at Lockheed Martin and is an instructor for the Georgia Institute of Technology’s Reinforcement Learning and Decision Making course.

placing your order...

Don't refresh or navigate away from the page.
print book $37.49 $49.99 pBook + eBook + liveBook
Additional shipping charges may apply
Grokking Deep Reinforcement Learning (print book) added to cart
continue shopping
go to cart

eBook $29.99 $39.99 3 formats + liveBook
Grokking Deep Reinforcement Learning (eBook) added to cart
continue shopping
go to cart

Prices displayed in rupees will be charged in USD when you check out.
customers also reading

This book 1-hop 2-hops 3-hops

FREE domestic shipping on three or more pBooks