Overview

1 The data science process

Data science is presented as a practical, cross‑disciplinary craft aimed at improving decisions in business and science through data engineering, statistics, machine learning, and analytics. Success depends less on exotic tools and more on clear, quantitative goals, sound methodology, and a repeatable workflow carried out by a collaborative team. Key roles include a project sponsor who defines success and provides sign‑off, a client who represents day‑to‑day users and domain context, a data scientist who drives strategy and analysis, and supporting data architecture and operations partners to secure data and deploy results. A motivating example follows a bank seeking to reduce losses from bad loans, illustrating how problem framing and stakeholder alignment shape the entire effort.

The chapter outlines an iterative lifecycle: defining a measurable goal, collecting and managing data, modeling, evaluation and critique, presentation and documentation, and deployment and maintenance. Goals must be specific and testable to bound scope and enable acceptance criteria. Data work typically dominates effort and can reveal issues such as sample bias that force reframing of objectives or features. Modeling tasks span classification, scoring, ranking, clustering, relation finding, and characterization; method choice is influenced by assumptions, data representation, and user needs for interpretability and confidence. Evaluation goes beyond overall accuracy to consider baselines and operationally relevant metrics (for example, recall, precision, and false positive rate), with frequent loops back to earlier stages to refine data, features, or goals.

Deliverables must match audiences: sponsors need business impact framed in their metrics; end users need guidance on interpreting outputs, confidence scores, and when to override; operations needs clarity on performance, constraints, and maintenance. Deployment often starts with a pilot to surface unanticipated issues, and models require monitoring and updates as conditions change. The chapter closes by stressing expectation setting: quantify what “good enough” means, establish lower bounds via a null model or existing process, and ensure improvements are statistically meaningful and aligned to business priorities. Throughout, the process emphasizes transparent communication, measurable targets, trade‑offs between competing metrics, and a plan for ongoing stewardship of the model in production.

Figure 1.1. The lifecycle of a data science project: loops within loops
The lifecycle of a data science project: loops within loops
Defining the goal
Data collection and management
Figure 1.2. The fraction of defaulting loans by credit history category. The dark region of each bar represents the fraction of loans in that category that defaulted.
The fraction of defaulting loans by credit history category. The dark region of each bar represents the fraction of loans in that category that defaulted.
Modeling
Figure 1.3. A decision tree model for finding bad loan applications, with confidence scores
A decision tree model for finding bad loan applications, with confidence scores
Model evaluation and critique
Presentation and documentation
Figure 1.4. Example slide from an executive presentation
Example slide from an executive presentation
Model deployment and maintenance

 Summary

The data science process involves a lot of back-and-forth—between the data scientist and other project stakeholders, and between the different stages of the process. Along the way, you’ll encounter surprises and stumbling blocks; this book will teach you procedures for overcoming some of these hurdles. It’s important to keep all the stakeholders informed and involved; when the project ends, no one connected with it should be surprised by the final results.

In the next chapters, we’ll look at the stages that follow project design: loading, exploring, and managing the data. Chapter 2 covers a few basic ways to load the data into R, in a format that’s convenient for analysis.

In this chapter you have learned:

  • That a successful data science project involves more than just statistics. It also requires a variety of roles to represent business and client interests, as well as operational concerns.
  • That you should make sure you have a clear, verifiable, quantifiable goal.
  • That you should make sure you’ve set realistic expectations for all stakeholders.

FAQ

What does “data science” mean in this chapter’s context?Data science is a cross‑disciplinary practice that applies data engineering, descriptive statistics, data mining, machine learning, and predictive analytics to make data‑driven decisions and manage their consequences. The chapter focuses on solving business and scientific problems using these techniques.
Which roles are involved in a data science project and what do they do?- Project sponsor: represents business interests and decides success or failure.
- Client: represents end users and serves as domain expert.
- Data scientist: sets analytic strategy, executes the work, and communicates with stakeholders.
- Data architect: manages data and storage (often outside the core team).
- Operations: manages infrastructure and deployment (often outside the core team). Roles may overlap in practice.
Why is the project sponsor so important, and how do you secure their sign-off?The sponsor defines success and champions the project. Keep them informed with understandable plans, progress, and outcomes. Elicit precise, quantitative goals through interviews (for example, targets on recall and false positives). Meeting agreed, measurable criteria becomes the organizing objective for sign-off.
How should you define a measurable project goal?Understand context first: why the project is needed, current approaches and gaps, available data and resources, and deployment constraints. Then write specific, testable goals tied to business impact (for example, “reduce loan charge-offs by at least 10% via a default-prediction model”). Time-box exploratory work and convert findings into concrete hypotheses.
What are the stages of a data science project and why are they iterative?The stages are: defining the goal; data collection and management; modeling; model evaluation and critique; presentation and documentation; deployment and maintenance. Boundaries are fluid—expect loops within and across stages, with feedback from stakeholders and findings in the data prompting refinement of goals, data, and models.
What should you watch for during data collection and initial exploration?Identify what data exists, whether it’s relevant, sufficient, and of adequate quality. Prefer directly measured variables over proxies when possible. Be alert to sampling bias—for example, only having data on accepted loan applications can distort relationships and limit generalization. Use early findings to refine goals and data needs.
What modeling tasks are common, and how do you choose among methods?Common tasks: classification, scoring (numeric prediction), ranking, clustering, finding relations, and characterization/reporting. Choose methods based on data properties and stakeholder needs (for example, interpretability and confidence estimates). In the loan example, a decision tree was chosen for transparent rules and confidence indications for end users.
How do you evaluate a classification model effectively?Assess whether it meets business and technical goals, generalizes well, and beats the obvious guess. Use a confusion matrix to compute accuracy, precision, recall, and false positive rate; decide acceptable trade-offs with stakeholders. If results fall short, iterate on modeling or adjust goals/data/resources.
What is a null model and how does it inform performance expectations?The null model is the baseline you must outperform—either the current process or the simplest plausible model (for example, always predicting the majority class). It sets a lower bound (base error rate). Aim for statistically significant improvement and prioritize metrics that matter (often recall, precision, and false positive rate over accuracy alone).
How should you present results and plan for deployment and maintenance?Tailor communications: executives want business impact (e.g., potential reduction in losses); end users need how to interpret outputs, confidence scores, and when to override; operations needs resource, latency, and integration details. Deploy via a pilot when possible, monitor behavior (including overrides), and plan for updates as data and conditions change.

pro $24.99 per month

  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose one free eBook per month to keep
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime

lite $19.99 per month

  • access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more


choose your plan

team

monthly
annual
$49.99
$399.99
only $33.33 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Practical Data Science with R, Second Edition ebook for free
choose your plan

team

monthly
annual
$49.99
$399.99
only $33.33 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Practical Data Science with R, Second Edition ebook for free