Grokking Deep Reinforcement Learning
Miguel Morales
  • MEAP began May 2018
  • Publication in November 2020 (estimated)
  • ISBN 9781617295454
  • 472 pages (estimated)
  • printed in black & white

The must-have book, for anyone that wants to have a profound understanding of deep reinforcement learning.

Julien Pohie

We all learn through trial and error. We avoid the things that cause us to experience pain and failure. We embrace and build on the things that give us reward and success. This common pattern is the foundation of deep reinforcement learning: building machine learning systems that explore and learn based on the responses of the environment.

Grokking Deep Reinforcement Learning introduces this powerful machine learning approach, using examples, illustrations, exercises, and crystal-clear teaching. You'll love the perfectly paced teaching and the clever, engaging writing style as you dig into this awesome exploration of reinforcement learning fundamentals, effective deep learning techniques, and practical applications in this emerging field.

About the Technology

Deep reinforcement learning is a form of machine learning in which AI agents learn optimal behavior on their own from raw sensory input. The system perceives the environment, interprets the results of its past decisions and uses this information to optimize its behavior for a maximum long-term return. It has been said that deep reinforcement learning, which is the use of deep learning and reinforcement learning techniques to solve decision-making problems, is the solution to the full artificial intelligence problem.

Deep reinforcement learning famously contributed to the success of AlphaGo and all its successors (AlphaGo, AlphaGo Zero, and AlphaZero, etc), which recently beat the world’s best human player in the world’s most difficult board game. But that is not the only thing you can do with deep reinforcement learning. These are some of the most notable applications:

Learn to play ATARI games just by looking at the raw image.
Learn to trade and manage portfolios effectively.
Learn low-level control policies for a variety of real-world models.
Discover tactics and collaborative behavior for improved campaign performance.
From low-level control, to high-level tactical actions, deep reinforcement learning can solve large, complex decision-making problems.

But deep reinforcement learning is an emerging approach, so the best ideas are still yours to discover. We can’t wait to see how you apply deep reinforcement learning to solve some of the most challenging problems in the world.

About the book

Grokking Deep Reinforcement Learning is a beautifully balanced approach to teaching, offering numerous large and small examples, annotated diagrams and code, engaging exercises, and skillfully crafted writing. You'll explore, discover, and learn as you lock in the ins and outs of reinforcement learning, neural networks, and AI agents. You will go from small grid world environments and some of the foundational algorithms to some of the most challenging environments out there today and cutting-edge techniques to solve these environments.

Exciting, fun, and maybe even a little dangerous. Let's get started!

Table of Contents detailed table of contents

1 Introduction to deep reinforcement learning

1.1 What is deep reinforcement learning?

1.1.1 Deep reinforcement learning is a machine learning approach to artificial intelligence

1.1.2 Deep reinforcement learning is concerned with creating computer programs

1.1.3 Deep reinforcement learning agents can solve problems that require intelligence

1.1.4 Deep reinforcement learning agents improve their behavior through trial-and-error learning

1.1.5 Deep reinforcement learning agents learn from sequential feedback

1.1.6 Deep reinforcement learning agents learn from evaluative feedback

1.1.7 Deep reinforcement learning agents learn from sampled feedback

1.1.8 Deep reinforcement learning agents utilize powerful non-linear function approximation

1.2 The past, present, and future of deep reinforcement learning

1.2.1 Recent history of artificial intelligence and deep reinforcement learning

1.2.2 Artificial intelligence winters

1.2.3 The current state of artificial intelligence

1.2.4 Progress in deep reinforcement learning

1.2.5 Opportunities ahead

1.3 The suitability of deep reinforcement learning

1.3.1 What are the pros and cons?

1.3.2 Deep reinforcement learning’s strengths

1.3.3 Deep reinforcement learning’s weaknesses

1.4 Setting clear two-way expectations

1.4.1 What to expect from the book?

1.4.2 How to get the most out of this book?

1.4.3 Deep reinforcement learning development environment

1.5 Summary

2 Mathematical foundations of reinforcement earning

2.1 Components of reinforcement learning

2.1.1 Examples of problems, agents, and environments

2.1.2 The agent: The decision-maker

2.1.3 The environment: Everything else

2.1.4 Agent-environment interaction cycle

2.2 MDPs: The engine of the environment

2.2.1 States: Specific configurations of the environment

2.2.2 Actions: A mechanism to influence the environment

2.2.3 Transition function: Consequences of agent actions

2.2.4 Reward signal: Carrots and sticks

2.2.5 Horizon: Time changes what’s optimal

2.2.6 Discount: The future is uncertain, value it less

2.2.7 Extensions to MDPs

2.2.8 Putting it all together

2.3 Summary

3 Balancing immediate and long-term goals

3.1 The objective of a decision-making agent

3.1.1 Policies: Per-state action prescriptions

3.1.1 State-value function: What to expect from here?

3.1.2 Action-value function: What to expect from here if I do this?

3.1.3 Action-advantage function: How much better if I do that?

3.1.4 Optimality

3.2 Planning optimal sequences of actions

3.2.1 Policy Evaluation: Rating policies

3.2.2 Policy Improvement: Using ratings to get better

3.2.3 Policy Iteration: Improving upon improved behaviors

3.2.4 Value Iteration: Improving behaviors early

3.3 Summary

4 Balancing the gathering and utilization of information

4.1 The challenge of interpreting evaluative feedback

4.1.1 Bandits: Single state decision problems

4.1.2 Regret: The cost of exploration

4.1.3 Approaches to solving MAB environments

4.1.4 Greedy: Always exploit

4.1.5 Random: Always explore

4.1.6 Epsilon-Greedy: Almost always greedy and sometimes random

4.1.7 Decaying Epsilon-Greedy: First maximize exploration, then exploitation

4.1.8 Optimistic Initialization: Start off believing it’s a wonderful world

4.2 Strategic exploration

4.2.1 SoftMax: Select actions randomly in proportion to their estimates

4.2.2 UCB: It’s not about just optimism; it’s about realistic optimism

4.2.3 Thompson Sampling: Balancing reward and risk

4.3 Summary

5 Evaluating agents' behaviors

5.1 Learning to estimate the value of policies

5.1.1 First-visit Monte-Carlo: Improving estimates after each episode

5.1.2 Every-visit Monte-Carlo: A different way of handling state visits

5.1.3 Temporal-Difference Learning: Improving estimates after each step

5.2 Learning to estimate from multiple steps

5.2.1 N-step TD Learning: Improving estimates after a couple of steps

5.2.2 Forward-view TD(λ): Improving estimates of all visited states

5.2.3 TD(λ): Improving estimates of all visited states after each step

5.3 Summary

6 Improving agents' behaviors

6.1 The anatomy of reinforcement learning agents

6.1.1 Most agents gather experience samples

6.1.2 Most agents estimate something

6.1.3 Most agents improve a policy

6.1.4 Generalized Policy Iteration

6.2 Learning to improve policies of behavior

6.2.1 Monte-Carlo Control: policies after each episode

6.2.2 Sarsa: Improving policies after each step

6.3 Decoupling behavior from learning

6.3.1 Q-Learning: Learning to act optimally, even if we choose not to

6.3.2 Double Q-Learning: a max of estimates for an estimate of a max

6.4 Summary

7 Achieving goals more effectively and efficiently

7.1 Learning to improve policies using robust targets

7.1.1 Sarsa(λ): Improving policies after each step based on multi-step estimates

7.1.2 Watkins’s Q(λ): Decoupling behavior from learning, again

7.2 Agents that interact, learn and plan

7.2.1 Dyna-Q: Learning sample models

7.2.2 Trajectory Sampling: Making plans for the immediate future

7.3 Summary

8 Introduction to value-based deep reinforcement learning

8.1 The kind of feedback deep reinforcement learning agents use

8.2 Deep reinforcement learning agents deal with sequential feedback

8.2.1 But, if it is not sequential, what is it?

8.2.2 Deep reinforcement learning agents deal with evaluative feedback

8.2.3 But, if it is not evaluative, what is it?

8.2.4 Deep reinforcement learning agents deal with sampled feedback

8.2.5 But, if it is not sampled, what is it?

8.3 Introduction to function approximation for reinforcement learning

8.3.1 Reinforcement learning problems can have high-dimensional state and action spaces

8.3.2 Reinforcement learning problems can have continuous state and action spaces

8.3.3 There are advantages when using function approximation

8.4 NFQ: The first attempt to value-based deep reinforcement learning

8.4.1 First decision point: Selecting a value function to approximate

8.4.2 Second decision point: Selecting a neural network architecture

8.4.3 Third decision point: Selecting what to optimize

8.4.4 Fourth decision point: Selecting the targets for policy evaluation

8.4.5 Fifth decision point: Selecting an exploration strategy

8.4.6 Sixth decision point: Selecting a loss function

8.4.7 Seventh decision point: Selecting an optimization method

8.4.8 Things that could (and do) go wrong

8.5 Summary

9 More stable value-based methods

9.1 DQN: Making reinforcement learning more like supervised learning

9.1.1 Common problems in value-based deep reinforcement learning

9.1.2 Using target networks

9.1.3 Using larger networks

9.1.4 Using experience replay

9.1.5 Using other exploration strategies

9.2 Double DQN: Mitigating the overestimation of action-value functions

9.2.1 The problem of overestimation, take two

9.2.2 Separating action selection and action evaluation

9.2.3 A solution

9.2.4 A more practical solution

9.2.5 A more forgiving loss function

9.2.6 Things we can still improve on

9.3 Summary

10 Sample-efficient value-based methods

10.1 Dueling DDQN: A reinforcement-learning-aware neural network architecture

10.1.1 Reinforcement learning is not a supervised learning problem

10.1.2 Nuances of value-based deep reinforcement learning methods

10.1.3 Advantage of using advantages

10.1.4 A reinforcement-learning-aware architecture

10.1.5 Building a dueling network

10.1.6 Reconstructing the action-value function

10.1.7 Continuously updating the target network

10.1.8 What does the dueling network bring to the table?

10.2 PER: Prioritizing the replay of meaningful experiences

10.2.1 A smarter way to replay experiences

10.2.2 Then, what is a good measure of "important" experiences?

10.2.3 Greedy prioritization by TD error

10.2.4 Sampling prioritized experiences stochastically

10.2.5 Proportional prioritization

10.2.6 Rank-based prioritization

10.2.7 Prioritization bias

10.3 Summary

11 Policy-gradient and actor-critic methods

11.1 REINFORCE: Outcome-based policy learning

11.1.1 Introduction to policy-gradient methods

11.1.2 Advantages of policy-gradient methods

11.1.3 Learning policies directly

11.1.4 Reducing the variance of the policy gradient

11.2 VPG: Learning a value function

11.2.1 Further reducing the variance of the policy gradient

11.2.2 Learning a value function

11.2.3 Encouraging exploration

11.3 A3C: Parallel policy updates

11.3.1 Using actor-workers

11.3.2 Using n-step estimates

11.3.3 Non-blocking model updates

11.4 GAE: Robust advantage estimation

11.4.1 Generalized advantage estimation

11.5 A2C: Synchronous policy updates

11.5.1 Weight-sharing model

11.5.2 Restoring order in policy updates

11.6 Summary

12 Advanced actor-critic methods

12.1 DDPG: Approximating a deterministic policy

12.1.1 DDPG uses lots of tricks from DQN

12.1.2 Learning a deterministic policy

12.1.3 Exploration with deterministic policies

12.2 TD3: State-of-the-art improvements over DDPG

12.2.1 Double learning in DDPG

12.2.2 Smoothing the targets used for policy updates

12.2.3 Delaying updates

12.3 SAC: Maximizing the expected return and entropy

12.3.1 Adding the entropy to the Bellman equations

12.3.2 Learning the action-value function

12.3.3 Learning the policy

12.3.4 Automatically tuning the entropy coefficient

12.4 PPO: Restricting optimization steps

12.4.1 Using the same actor-critic architecture as A2C

12.4.2 Batching experiences

12.4.3 Clipping the policy updates

12.4.4 Clipping the value function updates

12.5 Summary

13 Towards artificial general intelligence

13.1 What was covered, and what notably wasn’t?

13.1.1 Markov Decision Processes

13.1.2 Planning methods

13.1.3 Bandit methods

13.1.4 Tabular reinforcement learning

13.1.5 Value-based deep reinforcement learning

13.1.6 Policy-based and actor-critic deep reinforcement learning

13.1.7 Advanced actor-critic techniques

13.1.8 Model-based deep reinforcement learning

13.1.9 Derivative-free optimization methods

13.2 More advanced concepts towards AGI

13.2.1 What is AGI, again?

13.2.2 Advanced exploration strategies

13.2.3 Inverse reinforcement learning

13.2.4 Transfer learning

13.2.5 Multi-task learning

13.2.6 Curriculum learning

13.2.7 Meta learning

13.2.8 Hierarchical reinforcement learning

13.2.9 Multi-agent reinforcement learning

13.2.10 Explainable AI, Safety, Fairness, and Ethical Standards

13.3 What happens next?

13.3.1 How to use DRL to solve custom problems?

13.3.2 Going forward

13.3.3 Get yourself out there! Now!

13.4 Summary

What's inside

  • Foundational reinforcement learning concepts and methods
  • The most popular deep reinforcement learning agents solving complex environments
  • Cutting-edge agents that emulate human-like behavior and techniques for artificial general intelligence

About the reader

Written for developers with some understanding of deep learning. Experience with reinforcement learning is not required. Perfect for readers of Deep Learning in Python or Grokking Deep Learning.

About the author

Miguel Morales works on reinforcement learning at Lockheed Martin, Missiles and Fire Control, Autonomous Systems, in Denver, CO. He is a part-time Instructional Associate at Georgia Institute of Technology for the course in Reinforcement Learning and Decision Making. Miguel has worked for Udacity as a Machine Learning Project Reviewer, a Self-driving Car Nanodegree Mentor, and a Deep Reinforcement Learning Nanodegree Content Developer. He graduated from Georgia Tech with a Master’s degree in Computer Science specializing in Interactive Intelligence.

placing your order...

Don't refresh or navigate away from the page.
Manning Early Access Program (MEAP) Read chapters as they are written, get the finished eBook as soon as it’s ready, and receive the pBook long before it's in bookstores.
print book $29.99 $49.99 pBook + eBook + liveBook
Additional shipping charges may apply
Grokking Deep Reinforcement Learning (print book) added to cart
continue shopping
go to cart

eBook $24.99 $39.99 3 formats + liveBook
Grokking Deep Reinforcement Learning (eBook) added to cart
continue shopping
go to cart

Prices displayed in rupees will be charged in USD when you check out.

FREE domestic shipping on three or more pBooks