Applied Reinforcement Learning you own this product

Business optimization and LLM fine-tuning

Hadi Aghazadeh

MEAP began September 2025
Last updated April 2026
Publication in Fall 2026 (estimated)

ISBN 9781633434844
375 pages (estimated)

Included with a Manning Online subscription

printed in black & white

catalog / Data Science / Deep Learning / Deep Reinforcement Learning

resources: Source code Book forum Source code on GitHub

table of content

Part 1: Fundamentals: Building a Reinforcement Learning Toolkit

1 Real-world Decision Making with Reinforcement Learning

1.1 What reinforcement learning really enables?

1.2 Different types of business analysis

1.3 Business optimization definition

1.4 Examples of business optimization problems

1.5 Challenges in business optimization problems

1.6 Classical business optimization models

1.6.1 Operations research

1.6.2 Stochastic simulation

1.6.3 System dynamics

1.6.4 Game theory

1.7 Reinforcement learning for business optimization

1.8 Limitations in classical models and reinforcement learning

1.9 Summary

2 Markov Decision Process: Turning Problems intro Solvable Models

2.1 State: Anatomy of sequential decision making

2.2 Markov chain and Markov property

2.3 Markov decision process

2.4 Examples of Markov decision processes

2.5 Build a Markov Decision Process for Production Planning

2.6 Reward engineering and constraint handling strategies

2.6.1 Design rewards to be stepwise, whenever possible

2.6.2 Inject constraint information into the state

2.6.3 Handle soft constrains with stepwise penalties

2.6.4 Use action masking with penalties to handle hard constraints

2.6.5 Avoid mismatched scales with reward normalization / balancing

2.6.6 Avoid deceptive shortcuts in reward function

2.7 Summary

3 Design Custom Environments for Reinforcement Learning Algorithms

3.1 Conceptual framework for designing business environment

3.2 Warehouse order picking environment

3.3 Perishable product dynamic pricing environment

3.4 Trailer loading and packing environment

3.5 Summary

Part 2: Reinforcement Learning for Business Optimization

4 Perfect Knowledge, Optimal Policy: Dynamic Programming

4.1 Paradigms on solving Markov decision process

4.2 The domino decision rule: Bellman equations

4.3 Solving bellman equations: Generalized Policy Iteration

4.4 Hands-on code: solving a resource allocation problem

4.5 Limitations of dynamic programming

4.6 Summary

5 Contextual Bandit: Optimizing Stochastic One-step Decisions

5.1 Bandits as lightweight reinforcement learning

5.2 Tradeoff between exploitation and exploration

5.3 Simulating an Ad campaign problem with bandit algorithms

5.4 Quantifying bandit algorithms performance with Regret

5.5 Dynamic personalized discounting with contextual bandits

5.6 Beyond stationary bandit problems

5.7 Summary

6 Tabular Reinforcement Learning

6.1 Temporal difference learning

6.2 A concrete example: restaurant table scheduling

6.3 Off policy vs on policy learning

6.4 Tabular Reinforcement Learning: Q-learning and SARSA

6.4.1 SARSA: learning from what you actually do

6.4.2 Q-learning: learning from what you should do

6.5 TD(λ) and Eligibility traces

6.6 Gas station fuel purchase scheduling with tabular methods

6.7 Summary

Monte Carlo Tree Search: Searching with Reinforcement Learning Principles

7.1 Tree data structure and its applications in business

7.2 Fundamental tree search algorithms

7.3 Monte Carlo tree search theory and algorithm

7.4 Solving capacitated vehicle routing problem with Monte Carlo tree search

7.5 Summary

Part 3: Deep Reinforcement Learning for Business Optimization

8 Deep Q-Networks for High-dimensional Data

8.1 From tabular to linear function approximation

8.2 Deep learning fundamentals

8.3 DQN theory: Q-learning meets neural networks

8.4 Implementation of DQN for treatment optimization

8.5 DQN beyond treatment: when and how to use DQN

8.6 Summary

9 The Calculus of Decisions: Policy Gradient Methods

Part 4: Reinforcement Learning for Large Language Models Fine Tuning

10 Fine-Tuning Large Language Models with PPO

11 Reinforcement Learning with Human Feedback Using GRPO

12 RLVR and Advanced PPO Methods for LLM Reasoning

Overview

1 Real-world Decision Making with Reinforcement Learning

Businesses operate with limited resources in uncertain, competitive environments, so the core managerial task is making sequential decisions that balance present realities with the cumulative impact of past choices. Reinforcement learning (RL) is introduced as a natural fit for this setting: rather than learning labels or patterns, it learns how to act through interaction, optimizing long-term value under uncertainty. Framed alongside broader analytics, the chapter distinguishes questions about external and internal factors across time—what happened, what will happen, why it happened, and what should we do—highlighting that the book’s emphasis is on optimization, where decisions can be quantified and repeatedly improved in practice.

The chapter defines a practical blueprint for business optimization: take external inputs as parameters, choose actions as decision variables, optimize one or more objectives, respect real-world constraints, and produce actionable outputs and performance metrics. It grounds this framework in common, quantifiable, operational problems—inventory replenishment, vehicle routing, production scheduling, workforce shifts, bike-share rebalancing, and dynamic pricing—while underlining the real challenge: handling constraints and managing the bias–variance trade-off across changing conditions. To evaluate solutions, it proposes criteria such as robustness, resilience, real-time responsiveness, adaptability, flexibility, generalizability, customizability, effort to build and operationalize, lifecycle cost, and interpretability. Classical methods—operations research (LP, MIP, NLP), stochastic simulation (including queueing, Monte Carlo, and discrete-event), system dynamics, and game theory—form the foundation for modeling and reasoning about these decisions.

Classical approaches excel when systems are stable and fully specified, but they can falter as markets shift and assumptions break; RL complements them by learning policies from experience, handling delayed rewards, and adapting in real time. Still, RL is not a silver bullet: it demands data or simulators, can be computationally intensive, and raises explainability and operationalization challenges. The chapter concludes that the most effective path is often hybrid—combining structured models and domain knowledge with learning systems that improve through interaction—setting the stage for practical guidance on building simulators, selecting algorithms, and deploying RL to extend, not replace, proven optimization techniques.

Reinforcement learning in the context of machine learning.

two types of questions and analytical approaches for analyzing external factors.

two types of questions and analytical approaches for analyzing internal factors.

Framework for business optimization models.

Variance and bias trade off in business optimization models.

Linear programming formulation of bakery shop problem.

Overview of reinforcement learning framework.

Summary

Businesses must make smart decisions under uncertainty with limited resources.
Understanding external (uncontrollable) and internal (controllable) factors is key to effective analysis.
Business analysis types include descriptive, predictive, explanatory, and optimization.
Optimization focuses on shaping internal factors to improve future outcomes.
Decisions in business problems vary by level (strategic/tactical/operational), frequency, scale, and measurability.
Optimization models include inputs (parameters and decisions), objectives, constraints, objective outputs, and decision values.
Major challenge in optimization is bias-variance trade-offs in the operational process
Classical models like operations research, simulation, and system dynamics are powerful but often rigid and static.
Reinforcement learning extends classical models by enabling adaptive, sequential decision-making.
Reinforcement learning learns through trial-and-error, using feedback to improve policies over time.
A comparison shows reinforcement learning excels in adaptability, real-time learning, and dynamic environments.
Reinforcement learning downsides include training cost, data needs, and explainability—but it's improving rapidly.
Reinforcement learning is not a replacement but a powerful extension and complement of classical optimization models.

FAQ

What is reinforcement learning and why is it useful for business optimization?

Reinforcement learning (RL) teaches an agent to make sequential decisions through trial and error, optimizing long‑term outcomes via feedback (rewards/penalties). It is well‑suited to business because many decisions unfold over time under uncertainty (pricing, routing, inventory), conditions change, and the “best” action depends on both current context and past choices. RL adapts as it learns, helping decision systems remain effective in dynamic environments.

How does reinforcement learning differ from supervised and unsupervised learning?

- Unsupervised learning finds patterns or clusters without labels.
- Supervised learning maps inputs to known labels to predict outcomes (e.g., churn).
- Reinforcement learning learns how to act: it chooses actions, observes delayed rewards, and solves the credit‑assignment problem to maximize cumulative value over time.

What does “sequential decision‑making under uncertainty” mean in a business context?

Business outcomes today are shaped by yesterday’s choices, while tomorrow’s conditions remain uncertain (competition, demand, supply shocks). Sequential decision‑making under uncertainty means planning and acting across time, updating actions as new information arrives, and managing trade‑offs between short‑term gains and long‑term value—exactly the setting RL models.

What types of business analysis map to external vs. internal factors?

- External factors (limited or no control):
• Past: Descriptive analysis (“What happened?”)
• Future: Predictive/forecasting (“What will happen?”)
- Internal factors (some control):
• Past: Explanatory (“Why did it happen?”)
• Future: Optimization (“What should we do?”)
Internal variables can still be influenced by external forces, so it’s common to ask “external‑type” questions about them as well.

When is business optimization the right tool?

Optimization typically fits operational, recurring, multi‑entity, and quantifiable problems (e.g., daily fleet routing). As you move toward strategic, one‑off, qualitative decisions, “optimal” becomes subjective and frameworks (e.g., 5 Forces, 7S, Blue Ocean) are often more appropriate.

What are the core components of a business optimization model?

- Inputs:
• External factors as parameters (e.g., demand forecasts, lead times)
• Decision variables (actions to choose)
- Problem structure:
• Objective(s) to maximize/minimize (often multi‑objective)
• Constraints (resources, policies, SLAs, regulations)
- Outputs:
• Objective value(s) and recommended actions (e.g., routes, schedules) with metrics for evaluation.

What real‑world problems exemplify business optimization in the chapter?

- Inventory replenishment (minimize total cost while meeting service levels)
- Vehicle routing (minimize distance/time/fuel across customers and fleets)
- Production scheduling (sequence/quantities to meet demand and capacity)
- Workforce shift scheduling (coverage, labor cost, fairness/legal rules)
- Bike‑sharing rebalancing (reduce network imbalance with routing cost)
- Dynamic pricing for perishables (maximize revenue over time windows)

What practical challenges and evaluation criteria should we consider?

Models face a bias‑variance trade‑off and must be judged on: robustness, resilience, real‑time responsiveness, adaptability, flexibility, generalizability, customizability, effort to build and operationalize, lifecycle cost, and interpretability. Ignoring these leads to brittle, slow, costly, or untrusted systems.

Which classical methods does the chapter cover, and where do they shine?

- Operations Research (LP/MIP/NLP + solvers): precise optimization with explicit objectives/constraints.
- Stochastic Simulation (queues, Monte Carlo, discrete‑event): explore performance under uncertainty and variability.
- System Dynamics: long‑horizon feedback loops, stocks/flows, strategic policy effects.
- Game Theory: multi‑agent strategic interaction, equilibria when competitors’ actions matter.

How does RL complement classical models, and what are its limitations?

RL relaxes the need for a fully specified model, learns policies from interaction, adapts to change, and offers fast inference once trained. Limitations include data/interaction hunger (often requiring simulators), training instability and compute cost, operational complexity, and explainability gaps. The chapter positions RL as a complement—often in hybrid setups—rather than a universal replacement.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$55.99 $39.19

you save $16.80 (30%)

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$55.99 $39.19

you save $16.80 (30%)

eBook

pdf, ePub, online

$55.99 $39.19

you save $16.80 (30%)

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more