Applied Reinforcement Learning you own this product

Business optimization and LLM fine-tuning

Hadi Aghazadeh

MEAP began September 2025
Last updated October 2025
Publication in Summer 2026 (estimated)

ISBN 9781633434844
375 pages (estimated)

Included with a Manning Online subscription

printed in black & white

catalog / Data Science / Deep Learning / Deep Reinforcement Learning

resources: Source code Book forum Source code on GitHub

table of content

PART 1: BUILDING REINFORCEMENT LEARNING TOOLKITS FOR BUSINESS OPTIMIZATION

1 Reinforcement learning and business optimization: core concepts

1.1 What reinforcement learning really enables?

1.2 Different types of business analysis

1.3 Business optimization definition

1.4 Examples of business optimization problems

1.5 Challenges in business optimization problems

1.6 Classical business optimization models

1.6.1 Operations research

1.6.2 Stochastic simulation

1.6.3 System dynamics

1.6.4 Game theory

1.7 Reinforcement learning for business optimization

1.8 Limitations in classical models and reinforcement learning

1.9 Summary

2 Formulate business problems with Markov decision process

2.1 State: Anatomy of sequential decision making

2.2 Markov chain and Markov property

2.3 Markov decision process

2.4 Examples of Markov decision processes

2.5 Build a Markov Decision Process for Production Planning

2.6 Reward engineering and constraint handling strategies

2.6.1 Design rewards to be stepwise, whenever possible

2.6.2 Inject constraint information into the state

2.6.3 Handle soft constrains with stepwise penalties

2.6.4 Use action masking with penalties to handle hard constraints

2.6.5 Avoid mismatched scales with reward normalization / balancing

2.6.6 Avoid deceptive shortcuts in reward function

2.7 Summary

3 Design custom environments for reinforcement learning algorithms

3.1 Conceptual framework for designing business environment

3.2 Warehouse order picking environment

3.3 Perishable product dynamic pricing environment

3.4 Trailer loading and packing environment

3.5 Summary

PART 2: FUNDAMENTAL REINFORCEMENT LEARNING ALGORITHMS FOR BUSINESS OPTIMIZATION

4 Perfect knowledge, optimal policy: dynamic programming

4.1 Paradigms on solving Markov decision process

4.2 The domino decision rule: Bellman equations

4.3 Solving bellman equations: Generalized Policy Iteration

4.4 Hands-on code: solving a resource allocation problem

4.5 Limitations of dynamic programming

4.6 Summary

5 Bandit algorithms for personalized marketing

5.1 Bandits as lightweight reinforcement learning

5.2 Tradeoff between exploitation and exploration

5.3 Simulating an Ad campaign problem with bandit algorithms

5.4 Quantifying bandit algorithms performance with Regret

5.5 Dynamic personalized discounting with contextual bandits

5.6 Beyond stationary bandit problems

5.7 Summary

6 Scheduling with tabular reinforcement learning

6.1 Temporal difference learning

6.2 A concrete example: restaurant table scheduling

6.3 Off policy vs on policy learning

6.4 Tabular Reinforcement Learning: Q-learning and SARSA

6.4.1 SARSA: learning from what you actually do

6.4.2 Q-learning: learning from what you should do

6.5 TD(λ) and Eligibility traces

6.6 Gas station fuel purchase scheduling with tabular methods

6.7 Summary

7 Monte Carlo tree search for vehicle routing

PART 3: DEEP REINFORCEMENT LEARNING FOR BUSINESS OPTIMIZATION

8 Deep Q-networks for production line scheduling

9 Policy based reinforcement learning for large scale vehicle route planning

10 Actor-critic models for multi-echelon supply chain optimization

11 Deep determinstic poilicy gradient for dynamic pricing

PART 4: REINFORCEMENT LEARNING WITH HUMAN FEEDBACK FOR BUSINESS APPLICATIONS

12 Reinforcement learning with human feedback for building custom chatbot with fine tuned answers

Overview

2 Formulate business problems with Markov decision process

This chapter frames complex business optimization as a sequential decision-making problem and shows how Markov decision processes (MDPs) provide the structure to both decompose and reassemble it coherently. It begins by contrasting analysis with synthesis, arguing that breaking problems into parts only works if we can reconnect them under uncertainty and dynamics. The Markov property is introduced through Markov chains as the key simplification: if the present state captures all relevant information, the future depends only on now, not on the full past. That insight makes modeling tractable and elevates state design—what to include and how to summarize history—into a first-class concern for making real processes effectively Markovian.

Building on that foundation, the chapter formalizes the core MDP components—state, action, transition dynamics, and reward—and explains how policies map states to actions to optimize long-term outcomes rather than short-term gains. It illustrates the stakes of good state engineering with concrete domains such as dynamic pricing, call-center staffing, preventive maintenance, and customer retention, highlighting how omitting critical variables (e.g., inventory, fatigue, maintenance age, or engagement recency) breaks the Markov assumption and degrades decision quality. The message is consistent: durable performance comes from policies that look ahead, respect uncertainty and constraints, and balance competing objectives over time.

A hands-on production-planning case ties it all together: a factory chooses daily whether to produce scooters, bikes, or idle under stochastic demand, with a reward that nets revenue against production, setup, holding, and backorder costs. Several intuitive policies (greedy, always-produce, never-switch, setup-aware) are compared, showing that strategies mindful of switching costs and future risk consistently outperform myopic rules. The chapter then codifies practical reward engineering and constraint-handling tactics: prefer stepwise feedback over sparse returns, expose constraints in the state, use graded penalties for soft limits, apply action masking for hard constraints (optionally with gentle penalties near boundaries), normalize and balance reward scales, and avoid proxy incentives that invite shortcuts. These practices prepare the ground for building realistic environments and, ultimately, for learning scalable, high-quality policies automatically in subsequent chapters.

Three different types on how a system can evolve moving from one state to another.

Conditional probability of dependency of next state to all previous states.

Software service transitions graph and its matrix representation.

Transition distributions in three different scenarios

Markov chain and Markov decision process.

Two important probability distributions widely used for modeling business processes.

Cumulative profit ($) of following each policy in the planning horizon.

Two sides of reward function and some examples for each part.

An example of reward normalization for a multi objective portfolio optimization

Summary

Complex business problems require more than decomposition — they demand structured reassembly, and Markov decision processes offer just that.
Markov decision processes break decision-making into states, actions, rewards, and transitions, enabling sequential decision making under uncertainty.
State design is the key: a good state captures all essential information to predict what happens next (Markov property). Poor state design breaks predictive power.
Markov chains model how systems evolve probabilistically; Markov decision processes extend this by adding control — letting agents make decisions that influence outcomes.
A policy maps each state to an action, not for short-term gains but for long-term success — balancing current choices against future consequences.
Markov decision process modeling requires clear thinking about business objectives and constraints, which together guide optimal decisions.
Most real-world business problems can be modeled as Markov decision processes if states are crafted wisely.
Reward functions in business must encode multiple objectives and penalties — they are not just feedback signals but behavioral steering tools.
Effective reward engineering means giving feedback at the right time (stepwise rewards), balancing objective and constraint terms, and avoiding scale mismatches.
Hard constraints should be enforced using action masking and penalty, while soft constraints can be handled just by penalties and dynamic trade-offs.
Designing rewards requires careful monitoring to avoid unintended shortcuts, where agents optimize proxy metrics instead of real goals.

FAQ

What problem does an MDP solve for business optimization?

An MDP gives a structured way to make sequential decisions under uncertainty. It breaks a complex business process into states (snapshots), actions (choices), transition dynamics (how the world evolves), and rewards (feedback). Unlike passive prediction with Markov chains, MDPs let you intervene to shape outcomes and optimize long-term performance across routing, pricing, inventory, staffing, retention, and more.

What is a “state” and how do I make it Markovian?

A state is the minimal snapshot that contains all information needed to choose an action and predict what comes next. To make it Markovian, embed all critical internal and external factors (e.g., inventory, setup status, capacity, demand context, time index). If past information changes the next-state probabilities after conditioning on your current state, your state is missing features. Use domain knowledge first; statistical tests can help confirm sufficiency.

How do systems evolve in this framework: deterministic vs. stochastic transitions?

Deterministic: Next state is fixed given the current state (rare in business).
Stochastic with known transitions: You know or estimate transition probabilities (e.g., from data).
Stochastic with unknown transitions: Most realistic; you learn/estimate dynamics as you go.

Markov chains model stochastic evolution compactly by assuming the next state depends only on the current state, not the full history.

What is the Markov property and how can I check it in practice?

The Markov property means the future depends only on the present state, not on past states. Practically, you can simulate or analyze data to compare next-state distributions conditioned on (a) the current state vs. (b) current + previous states. If they’re statistically the same, your state is sufficient. If not, add the missing variables (e.g., time-since-last-maintenance, backlog, recency) to the state.

What are the core components of an MDP in business terms?

State (S): Current situation (e.g., inventory, capacity, open tickets, budget, time).
Action (A): Decision you control (e.g., dispatch, restock level, price, schedule).
Transition (P): How uncertainty moves you to a next state given the action (e.g., demand variability, delays, failures).
Reward (R): Immediate outcome signal (e.g., profit, cost, penalties, retention).
Policy (π): Strategy mapping states to actions to maximize long-run return.

How do I model transitions—probability matrices or stochastic simulators?

Small problems can use transition matrices. Real business problems are large and dynamic, so you typically write transition functions and sample from distributions (e.g., Poisson for discrete event counts like orders; Normal for aggregated continuous variability). This keeps models tractable while capturing key uncertainties.

What is a policy and why is long-term thinking essential?

A policy is a rule that picks an action in each state to maximize future cumulative reward. Good policies trade off immediate gains against future costs/risks (e.g., setup costs, backorders, churn). Myopic “greedy” choices often look good now but underperform over a horizon. The chapter’s factory case shows setup-aware, forward-looking policies winning over simplistic ones.

How should I design rewards (reward engineering) for business MDPs?

Prefer stepwise (dense) rewards to guide learning each step, not just at the end.
Encode objectives and constraints: revenue minus production/holding/backorder costs; penalties for violations.
Normalize/balance components (e.g., to a common scale or [-1,1]) to avoid dominance by one term.
Avoid deceptive proxies (e.g., clicks without revenue) that the agent can game.
Advanced: Lagrangian methods and potential-based shaping to balance multiple goals.

How do I handle constraints: soft vs. hard?

Expose constraints in the state (capacity, availability, temperatures, budgets) so the agent can plan around them.
Soft constraints: use graded, stepwise penalties (e.g., lateness bands, mild energy overuse penalties).
Hard constraints: use action masking to hide illegal moves (e.g., overweight loads, unavailable staff) and optionally add gentle penalties near limits to encourage safer choices.

What were the key takeaways from the production planning case study?

State: (day, scooter inventory, bike inventory, current setup) captured essential information to be Markovian.
Transitions: stochastic demand via Poisson; functional transitions instead of massive matrices.
Reward: revenue minus production, setup, holding, and backorder costs to reflect real trade-offs.
Policies: setup-aware outperformed greedy/random/always-one-product by balancing short-term sales with future costs.
Process: define S/A/P/R, encode constraints, simulate, baseline with heuristics, then scale to learning optimal policies.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$55.99 $33.59

you save $22.40 (40%)

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$55.99 $33.59

you save $22.40 (40%)

eBook

pdf, ePub, online

$55.99 $33.59

you save $22.40 (40%)

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more