Overview

1 The Drug Discovery Process

This chapter introduces the modern drug discovery landscape and the role computational methods play in it. It frames discovery as a long, risky, and expensive journey from idea to market, with high attrition rates and timelines stretching a decade or more. Conceptually, the core challenge is a massive search across chemical and biological spaces: finding molecules with the right properties that bind the right targets while avoiding off-target effects. The chapter builds a shared vocabulary for targets, hits, leads, and key pharmacological concepts, and positions machine learning as a practical way to triage candidates early, reduce costly failures, and accelerate iteration.

The value proposition of machine learning and deep learning is illustrated through four high-impact areas: virtual screening and molecular property prediction, generative chemistry for de novo design of novel chemical entities, chemical reaction prediction and retrosynthesis for synthesizability and manufacturing, and protein structure prediction exemplified by advances in folding models. ML-based virtual screening scales faster than physics-heavy docking for early filtering; generative models invert the problem to propose structures that meet property goals, addressing novelty pressures highlighted by Eroom’s Law; retrosynthesis planning links design to makeability and process chemistry; and learned representations surpass hand-crafted features, enabling discoveries that escape human bias. Together, these methods expand coverage of chemical space, prioritize safer and more potent compounds, and tighten the loop between design and synthesis.

With that backdrop, the chapter surveys the end-to-end pipeline: target identification and validation; hit discovery; hit-to-lead and lead optimization guided by ADMET, PK/PD, efficacy, selectivity, and bioavailability; preclinical studies; and the clinical development phases, including pathways for expedited approval. It also lays the technical foundation for applying ML: supervised and unsupervised learning, generalization, molecular representations such as SMILES (canonical and isomeric), and practical tooling with RDKit to compute features, explore chemical space (e.g., PCA), and build simple classifiers (e.g., logistic regression) on real drug classes. The result is a cohesive map of where AI contributes most leverage across discovery, alongside terminology and data resources to begin building effective, ML-enabled workflows.

Drug discovery can be thought of as a difficult search problem that exists at the intersection of the chemical search space of 1063 drug-like compounds and the biological search space of 105 targets.
Using AI to guide early prediction and optimization of drug-like molecules, we can broaden the number of considered candidate molecules, identify failures earlier when they are relatively inexpensive, and accelerate delivery of novel therapeutics to the clinic for patient benefit.
In virtual screening, we start with a large, diverse library of compounds that we can filter using a predictive model that has learned to predict what properties each compound has. Our predictive model has learned how to map the chemical space to the functional space. If the compound is predicted to have optimal properties, we carry it over for further experiments. In de novo design, we start with a defined set of property criteria that we can use along with a generative model to generate the structure of our ideal drug candidate. Our generative model knows how to map the functional space to the chemical space.
New drugs per billion USD of R&D reflects a downward trajectory. You may have heard of Moore’s Law, which is the observation that the number of transistors on an integrated circuit doubles approximately every two years. Moore’s Law implies that computing power doubles every couple of years while cost decreases. Eroom’s Law (Moore spelled backwards) is the observation that the inflation-adjusted R&D cost of developing new drugs doubles roughly every nine years. Eroom’s Law reflects diminishing returns in developing new drugs, including factors such as lower risk tolerance by regulatory agencies (the “cautious regulator” problem), the “throw money at it” tendency, and need to show more than a modest incremental benefit over current successful drugs (the “better than the Beatles” problem). The plot was constructed with data from Scannell et al., which discusses the trend in greater detail [6].
If we know both the structure of our ligand or compound and the target, we can use structured-based design methods. If we only know the ligand structure, we are restricted to ligand-based design methods. Alternatively, if we only know the target structure, we can use de novo design to guide generation of a suitable drug candidate.
Artificial intelligence, ML, and deep learning are all related to each other.
Example pairs of isomeric SMILES.
Example drug molecules for each USAN stem classification within our data set.
Chemical space exploration in a reduced, 4-dimensional space.
Decision boundary of our logistic regressor for classifying “-cillin” (left) and “-olol” (right) USAN stems. For each plot, colored samples belong to the positive class and uncolored samples belong to the negative class.
We can breakdown drug design into target identification and validation, hit discovery, hit-to-lead (lead identification), lead optimization, and preclinical development. Once a drug candidate has progressed to the drug development stage, it will need to pass multiple phases of clinical trials testing safety and efficacy prior to submission to and review by the FDA and launch to market.
We can break down the ADMET properties into the following broad descriptions. Absorption refers to the process by which a drug enters the bloodstream from its administration site, such as the gastrointestinal tract for oral drugs or the respiratory system for inhalation drugs. Distribution pertains to the movement of a drug within the body once it has entered the bloodstream. Metabolism refers to the biochemical transformation of a drug within the body, primarily carried out by enzymes. Metabolic processes aim to convert drugs into more polar and water-soluble metabolites, facilitating their elimination from the body. Excretion involves the removal of drugs and their metabolites from the body. Toxicity assessment aims to evaluate the potential adverse effects of a drug candidate on various organs, tissues, or systems.
We can segment the early drug discovery pipeline into four main phases: target identification, hit discovery, hit-to-lead or lead identification, and lead optimization. Target identification designates a valid target whose activity is worth modulating to address some disease or disorder. Hit discovery uncovers chemical compounds with activity against the target. Lead identification selects the most promising hits and lead optimization improves their potency, selectivity, and ADMET properties to be suitable for preclinical study.
In virtual screening, we conducted our search across a chemical space consisting of an enormous set of molecules. In de novo design, we are still conducting an (informal) search, just not across the chemical space. We are now searching across the functional space of potential molecular properties. If our model “learns” which section of the functional space maps to molecules that have ideal binding affinity and safety, then perhaps it can reverse-engineer novel molecule structures in the chemical space that match our functional criteria.
Preclinical trials evaluate drug candidate safety and efficacy on model organisms. Phase I clinical trials evaluate drug candidate safety in its first exposure to humans. Phase II and Phase III clinical trials continue to collect data on safety while measuring drug candidate efficacy on larger groups of patients. The pass rate of our lead compounds decreases drastically as they progress beyond preclinical stages, along with an increase in the associated time to test them.

Summary

  • Developing therapeutics entails a long, arduous process. Traditional development from ideation to market is costly (magnitude of billions of dollars), lengthy (10 to 15 years), and risky (attrition of over 90%). Through advances in AI, we can discover cures that have better safety profiles, address medical conditions or diseases with low coverage, and can reach patients quicker.
  • Drug discovery can be thought of as a difficult search problem that exists at the intersection of the chemical search space of 1063 medicinal compounds and the biological search space of 105 targets.
  • Applications of AI to drug design include molecule property prediction for virtual screening, creation of compound libraries with de novo molecule generation, synthesis pathway prediction, and protein folding simulation.
  • ML is a subfield of AI that enables computers to learn from and make decisions based on data, automatically and without explicit programming or rules on how to behave. Example ML algorithms include logistic regression and random forests. Deep learning is a subfield of ML that uses deep neural networks to extract complex patterns and representations from data.
  • We can segment the early drug discovery pipeline into four main phases: target identification, hit discovery, hit-to-lead or lead identification, and lead optimization. Target identification designates a valid target whose activity is worth modulating to address some disease or disorder. Hit discovery uncovers chemical compounds with activity against the target. Lead identification selects the most promising hits and lead optimization improves their potency, selectivity, and ADMET properties to be suitable for preclinical study.
  • Popular, well-maintained chemical data repositories include ChEMBL, ChEBI, PubChem, Protein Data Bank (PDB), AlphaFoldDB, and ZINC. When using a new data source, learn how it was assembled and how quality is maintained. Garbage data in, garbage model out. See “Appendix B: Chemical Data Repositories” for more information.

FAQ

What’s the difference between drug discovery and drug development?Drug discovery covers identifying and optimizing candidate molecules: target identification/validation, hit discovery, hit-to-lead, lead optimization, and preclinical testing. Drug development starts once a candidate enters human studies and includes clinical trials (Phases I–III), regulatory review (e.g., FDA), and market launch. Overall timelines are long (often 10–15 years) and costly, with high attrition.
Why is drug discovery often described as a “needle in a haystack” problem?The search space is enormous: estimates suggest around 10^63 drug‑like molecules and roughly 10^5 human protein targets. Experimental screening throughput (about 10^5–10^7 compounds/day) cannot cover this space, making exhaustive testing infeasible. Machine learning helps triage by prioritizing promising regions to explore.
Where does AI/ML deliver the most value in this chapter’s context?High‑impact areas include molecular property prediction, virtual screening, de novo generative chemistry, chemical reaction prediction and retrosynthesis, and protein structure prediction. These methods reduce cost and time, broaden candidate funnels, flag failures earlier, and guide more targeted experiments.
What is virtual screening, and how do physics‑based and ML‑based approaches differ?Virtual screening prioritizes compounds computationally before lab testing. Physics‑based methods (e.g., docking, molecular dynamics) simulate interactions using force fields but can be slow for large libraries. ML‑based approaches directly predict properties (like binding affinity or toxicity) from molecular structure, enabling much higher throughputs (up to ~10^9–10^12 compounds/day) to focus experiments on top candidates.
What is generative chemistry (de novo design), and why does novelty matter?Generative chemistry produces new molecular structures that meet desired property criteria by mapping from “functional space” (target properties) back to “chemical space” (structures). Novelty matters because incremental “me‑too” designs can limit therapeutic impact and patentability. Addressing declining R&D efficiency (Eroom’s Law) requires tools that propose innovative, high‑quality candidates rather than small variations.
What ML fundamentals are most relevant here?Supervised learning uses labeled data for classification (e.g., toxic vs non‑toxic) or regression (e.g., solubility). Unsupervised learning explores structure in unlabeled data (e.g., clustering, dimensionality reduction). Good models generalize beyond training data; overfitting harms performance on novel chemistries. Data context matters: in vitro results may not translate directly in vivo.
What are chemical reaction prediction and retrosynthesis, and how can ML help?Forward synthesis connects starting reactants to products; retrosynthesis works backward from a target molecule to simpler precursors. The combinatorial explosion of possible transformations makes manual planning difficult. Data‑driven models can rank plausible steps and routes, supporting both discovery of new compounds and efficient, scalable process chemistry.
Why is protein structure prediction important, and what changed recently?Protein 3D structure underpins biological function and druggability, but experimental determination is costly and slow. Deep learning systems (e.g., AlphaFold‑class models) bridged the sequence‑to‑structure gap at scale, accelerating target understanding, binding‑site analysis, and structure‑based design.
How are molecules represented for ML, and what tools are used?Common textual encodings include SMILES, with canonical SMILES providing a unique standardized form and isomeric SMILES capturing stereochemistry. Models often use descriptors or fingerprints (e.g., ECFP) as numerical features. RDKit is a widely used open‑source toolkit for parsing structures, computing features, visualization, and integrating with ML libraries.
What are the key stages and properties optimized in early discovery, and how do clinical trials proceed?Early stages: target identification/validation, hit discovery, hit‑to‑lead, and lead optimization. Optimization focuses on potency, selectivity, safety, and ADMET (absorption, distribution, metabolism, excretion, toxicity). PK (what the body does to the drug) and PD (what the drug does to the body) guide these decisions. After successful preclinical studies, candidates enter clinical trials: Phase I (safety, dose in healthy volunteers), Phase II (safety and preliminary efficacy in patients), and Phase III (confirm efficacy, safety, and benefit–risk at scale), followed by regulatory review. Some drugs can qualify for expedited pathways under specific criteria.

pro $24.99 per month

  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose one free eBook per month to keep
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime

lite $19.99 per month

  • access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more


choose your plan

team

monthly
annual
$49.99
$399.99
only $33.33 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Machine Learning for Drug Discovery ebook for free
choose your plan

team

monthly
annual
$49.99
$399.99
only $33.33 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Machine Learning for Drug Discovery ebook for free