Overview

1 The Drug Discovery Process

This chapter introduces the landscape of drug discovery in contrast to drug development, framing the immense cost, time, and risk involved in bringing a therapeutic to market (often 10–15 years, billions of dollars, and high attrition). It motivates the need for computation by highlighting the dual “needle-in-a-haystack” challenge: a chemical search space of astronomic size and a large, complex biological target space. With targets defined as biomolecules whose activity we aim to modulate, the chapter positions machine learning and deep learning as tools to prioritize safe, effective candidates earlier, reduce experimental burden, and provide a common terminology and data foundation for modern, AI-enabled discovery.

The value proposition of deep learning is illustrated across key tasks: molecular property prediction and virtual screening (contrasting physics-based docking with fast ML predictors), generative chemistry for de novo design guided by desired properties, reaction prediction and automated retrosynthesis for feasible synthesis routes, and protein structure prediction that accelerates target understanding. The text discusses throughput advantages (orders-of-magnitude more candidates triaged virtually), the persistent need for novelty beyond “privileged” scaffolds, and how deep models learn task-specific, previously unknown features that reduce human bias. Case studies (e.g., antibiotic discovery) exemplify how these methods surface structurally novel, pre-optimized candidates and strengthen the rationale for AI-driven pipelines.

Finally, the chapter surveys the end-to-end discovery pipeline—target identification and validation; target-to-hit screening; hit-to-lead (lead identification); and lead optimization with attention to potency, selectivity, safety, and ADMET—before proceeding to preclinical evaluation and phased clinical development, including pathways for expedited approval. Alongside this process view, it lays foundational ML concepts (supervised vs. unsupervised learning, generalization), molecular representations (canonical and isomeric SMILES), and core tools (RDKit, fingerprints), illustrated through dimensionality reduction and simple classification over labeled drug families. Together, these elements show where AI can most effectively compress timelines, lower cost and risk, and improve decision quality throughout discovery.

Drug discovery can be thought of as a difficult search problem that exists at the intersection of the chemical search space of 1063 drug-like compounds and the biological search space of 105 targets.
Using AI to guide early prediction and optimization of drug-like molecules, we can broaden the number of considered candidate molecules, identify failures earlier when they are relatively inexpensive, and accelerate delivery of novel therapeutics to the clinic for patient benefit.
In virtual screening, we start with a large, diverse library of compounds that we can filter using a predictive model that has learned to predict what properties each compound has. Our predictive model has learned how to map the chemical space to the functional space. If the compound is predicted to have optimal properties, we carry it over for further experiments. In de novo design, we start with a defined set of property criteria that we can use along with a generative model to generate the structure of our ideal drug candidate. Our generative model knows how to map the functional space to the chemical space.
New drugs per billion USD of R&D reflects a downward trajectory. You may have heard of Moore’s Law, which is the observation that the number of transistors on an integrated circuit doubles approximately every two years. Moore’s Law implies that computing power doubles every couple of years while cost decreases. Eroom’s Law (Moore spelled backwards) is the observation that the inflation-adjusted R&D cost of developing new drugs doubles roughly every nine years. Eroom’s Law reflects diminishing returns in developing new drugs, including factors such as lower risk tolerance by regulatory agencies (the “cautious regulator” problem), the “throw money at it” tendency, and need to show more than a modest incremental benefit over current successful drugs (the “better than the Beatles” problem). The plot was constructed with data from Scannell et al., which discusses the trend in greater detail [6].
If we know both the structure of our ligand or compound and the target, we can use structured-based design methods. If we only know the ligand structure, we are restricted to ligand-based design methods. Alternatively, if we only know the target structure, we can use de novo design to guide generation of a suitable drug candidate.
Artificial intelligence, ML, and deep learning are all related to each other.
Example pairs of isomeric SMILES.
Example drug molecules for each USAN stem classification within our data set.
Chemical space exploration in a reduced, 4-dimensional space.
Decision boundary of our logistic regressor for classifying “-cillin” (left) and “-olol” (right) USAN stems. For each plot, colored samples belong to the positive class and uncolored samples belong to the negative class.
We can breakdown drug design into target identification and validation, hit discovery, hit-to-lead (lead identification), lead optimization, and preclinical development. Once a drug candidate has progressed to the drug development stage, it will need to pass multiple phases of clinical trials testing safety and efficacy prior to submission to and review by the FDA and launch to market.
We can break down the ADMET properties into the following broad descriptions. Absorption refers to the process by which a drug enters the bloodstream from its administration site, such as the gastrointestinal tract for oral drugs or the respiratory system for inhalation drugs. Distribution pertains to the movement of a drug within the body once it has entered the bloodstream. Metabolism refers to the biochemical transformation of a drug within the body, primarily carried out by enzymes. Metabolic processes aim to convert drugs into more polar and water-soluble metabolites, facilitating their elimination from the body. Excretion involves the removal of drugs and their metabolites from the body. Toxicity assessment aims to evaluate the potential adverse effects of a drug candidate on various organs, tissues, or systems.
We can segment the early drug discovery pipeline into four main phases: target identification, hit discovery, hit-to-lead or lead identification, and lead optimization. Target identification designates a valid target whose activity is worth modulating to address some disease or disorder. Hit discovery uncovers chemical compounds with activity against the target. Lead identification selects the most promising hits and lead optimization improves their potency, selectivity, and ADMET properties to be suitable for preclinical study.
In virtual screening, we conducted our search across a chemical space consisting of an enormous set of molecules. In de novo design, we are still conducting an (informal) search, just not across the chemical space. We are now searching across the functional space of potential molecular properties. If our model “learns” which section of the functional space maps to molecules that have ideal binding affinity and safety, then perhaps it can reverse-engineer novel molecule structures in the chemical space that match our functional criteria.
Preclinical trials evaluate drug candidate safety and efficacy on model organisms. Phase I clinical trials evaluate drug candidate safety in its first exposure to humans. Phase II and Phase III clinical trials continue to collect data on safety while measuring drug candidate efficacy on larger groups of patients. The pass rate of our lead compounds decreases drastically as they progress beyond preclinical stages, along with an increase in the associated time to test them.

Summary

  • Developing therapeutics entails a long, arduous process. Traditional development from ideation to market is costly (magnitude of billions of dollars), lengthy (10 to 15 years), and risky (attrition of over 90%). Through advances in AI, we can discover cures that have better safety profiles, address medical conditions or diseases with low coverage, and can reach patients quicker.
  • Drug discovery can be thought of as a difficult search problem that exists at the intersection of the chemical search space of 1063 medicinal compounds and the biological search space of 105 targets.
  • Applications of AI to drug design include molecule property prediction for virtual screening, creation of compound libraries with de novo molecule generation, synthesis pathway prediction, and protein folding simulation.
  • ML is a subfield of AI that enables computers to learn from and make decisions based on data, automatically and without explicit programming or rules on how to behave. Example ML algorithms include logistic regression and random forests. Deep learning is a subfield of ML that uses deep neural networks to extract complex patterns and representations from data.
  • We can segment the early drug discovery pipeline into four main phases: target identification, hit discovery, hit-to-lead or lead identification, and lead optimization. Target identification designates a valid target whose activity is worth modulating to address some disease or disorder. Hit discovery uncovers chemical compounds with activity against the target. Lead identification selects the most promising hits and lead optimization improves their potency, selectivity, and ADMET properties to be suitable for preclinical study.
  • Popular, well-maintained chemical data repositories include ChEMBL, ChEBI, PubChem, Protein Data Bank (PDB), AlphaFoldDB, and ZINC. When using a new data source, learn how it was assembled and how quality is maintained. Garbage data in, garbage model out. See “Appendix B: Chemical Data Repositories” for more information.

FAQ

What is the difference between drug discovery and drug development?Drug discovery focuses on finding and optimizing candidate molecules: target identification and validation, hit discovery, hit-to-lead (lead identification), lead optimization, and preclinical testing. Drug development begins once a candidate enters human studies and covers clinical trials (Phases I–III), regulatory review (e.g., FDA), and market launch. End-to-end timelines are long (often 10–15 years), costly (roughly $1–3B), and risky (roughly 90% of candidates that reach clinical trials fail).
Why is drug discovery considered a “needle in a haystack” problem?The chemical search space is enormous—on the order of 1063 drug-like molecules—far beyond what can be experimentally screened. Even with high-throughput facilities testing ~105–107 compounds/day, exhaustive coverage is impossible. The biological search space is also vast (~105 human proteins and variants). Machine learning helps prioritize where to search, dramatically reducing the experimental burden.
What is virtual screening, and how do ML-based methods compare to docking and simulations?Virtual screening (VS) computationally prioritizes compounds by predicting target interactions or other relevant properties before lab testing. Physics-based methods (e.g., molecular docking, molecular dynamics) can be accurate but are computationally expensive, as they explore conformations and binding energetics. ML-based VS learns directly from data (e.g., known binding affinities or toxicity) to predict properties from molecular structure, enabling much higher throughput (on the order of 109–1012 compounds/day) and earlier triage.
What is de novo (generative) chemistry, and how does it complement virtual screening?Generative chemistry uses AI to design novel chemical entities that satisfy desired property criteria (e.g., potency, solubility, low toxicity). Conceptually, VS maps from chemical space to functional space (properties), while de novo design maps from functional space back to chemical structures. Benefits include novelty (addressing “me-too” trends and Eroom’s Law) and focused exploration; challenges include ensuring synthesizability and realistic ADMET profiles.
How does AI help with chemical reaction prediction and retrosynthesis?Retrosynthesis plans feasible routes from a target molecule back to simpler precursors, but the branching factor can exceed ~104 transformations at each step. Deep learning models can propose reaction outcomes, rank disconnections, and guide search efficiently. This accelerates making novel AI-designed molecules and also improves process chemistry for existing drugs (cost, safety, scalability).
What role do protein folding predictions and simulations play in drug discovery?Protein 3D structure informs function, binding sites, and mechanism—key to rational drug design. Deep learning models (e.g., AlphaFold2) provide high-accuracy structures at scale, narrowing targets, informing docking and ligand design, and complementing experimental methods. Structure-aware pipelines can better prioritize compounds for targets with limited structural data.
What are hits, leads, and the main stages of early discovery?- Target identification and validation: choose and confirm a disease-relevant biomolecular target. - Hit discovery (target-to-hit): find molecules that bind/affect the target (e.g., VS, high-throughput screening). - Hit-to-lead (lead identification): confirm activity, begin ADMET screens, and pick promising hits (“leads”). - Lead optimization: modify leads to improve potency, selectivity, and ADMET while keeping core scaffolds. - Preclinical: evaluate safety/efficacy in vitro and in vivo and finalize formulation and dosing hypotheses.
What are ADMET, PK, and PD, and which properties are optimized during lead optimization?- ADMET: Absorption, Distribution, Metabolism, Excretion, Toxicity—determinants of exposure and safety. - PK (pharmacokinetics): what the body does to the drug (e.g., absorption, clearance, half-life). - PD (pharmacodynamics): what the drug does to the body (e.g., efficacy, mechanism). Key optimization targets include efficacy, potency (achieve effect at low dose), selectivity/safety (minimize off-targets and adverse events), and bioavailability (sufficient exposure at the site of action).
What are the phases of clinical trials, and how can timelines be expedited?- Preclinical: safety/efficacy in model systems. - Phase I: first-in-human safety, dose range (typically 20–100 healthy volunteers). - Phase II: safety and preliminary efficacy in patients (~100–500). - Phase III: large, multi-site efficacy and safety (often 1,000–5,000). Expedited pathways exist for serious or rare conditions and transformative therapies (e.g., first-in-class, orphan, breakthrough, accelerated approvals), sometimes granting conditional market access while confirmatory studies continue.
How are molecules represented for ML, and what tools and data sources are commonly used?- Representations: SMILES strings (including canonical and isomeric forms) encode 2D structure; learned or engineered features (e.g., ECFP/Morgan fingerprints) and graph-based representations feed ML models. - Tools: RDKit parses SMILES/SDF, computes descriptors/fingerprints, visualizes structures, and integrates with ML libraries. - Public data sources: PubChem, ChEMBL, DrugBank, ZINC (virtual libraries), PDB (protein structures), UniProt (protein sequences), and assay repositories—useful for training, benchmarking, and VS/generative workflows.
How do agonists, antagonists, and inhibitors differ, and why does selectivity matter?- Agonist: binds a receptor and triggers a response (activates signaling). - Antagonist: binds without activating and blocks agonist binding (inhibits signaling). - Inhibitor: often targets enzymes, reducing catalytic activity. Selectivity ensures the compound acts on the intended target while minimizing off-target interactions that can reduce efficacy or cause side effects—central to safety and clinical success.

pro $24.99 per month

  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose one free eBook per month to keep
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime

lite $19.99 per month

  • access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more


choose your plan

team

monthly
annual
$49.99
$399.99
only $33.33 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Machine Learning for Drug Discovery ebook for free
choose your plan

team

monthly
annual
$49.99
$399.99
only $33.33 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Machine Learning for Drug Discovery ebook for free