table of content

Part 1: Introduction to data science

1 The data science process

1.1 The roles in a data science project

1.1.1 Project roles

1.2 Stages of a data science project

1.2.1 Defining the goal

1.2.2 Data collection and management

1.2.3 Modeling

1.2.4 Model evaluation and critique

1.2.5 Presentation and documentation

1.2.6 Model deployment and maintenance

1.3 Setting expectations

1.3.1 Determining lower bounds on model performance

Summary

2 Starting with R and data

2.1 Starting with R

2.1.1 Installing R, tools, and examples

2.1.2 R programming

2.2 Working with data from files

2.2.1 Working with well-structured data from files or URLs

2.2.2 Using R with less-structured data

2.3 Working with relational databases

2.3.1 A production-size example

Summary

3 Exploring data

3.1 Using summary statistics to spot problems

3.1.1 Typical problems revealed by data summaries

3.2 Spotting problems using graphics and visualization

3.2.1 Visually checking distributions for a single variable

3.2.2 Visually checking relationships between two variables

Summary

4 Managing data

4.1 Cleaning data

4.1.1 Domain-specific data cleaning

4.1.2 Treating missing values

4.1.3 The `vtreat` package for automatically treating missing variables

4.2 Data transformations

4.2.1 Normalization

4.2.2 Centering and scaling

4.2.3 Log transformations for skewed and wide distributions

4.3 Sampling for modeling and validation

4.3.1 Test and training splits

4.3.2 Creating a sample group column

4.3.3 Record grouping

4.3.4 Data provenance

Summary

5 Data engineering and data shaping

5.1 Data selection

5.1.1 Subsetting rows and columns

5.1.2 Removing records with incomplete data

5.1.3 Ordering rows

5.2 Basic data transforms

5.2.1 Adding new columns

5.2.2 Other simple operations

5.3 Aggregating transforms

5.3.1 Combining many rows into summary rows

5.4 Multitable data transforms

5.4.1 Combining two or more ordered data frames quickly

5.4.2 Principal methods to combine data from multiple tables

5.5 Reshaping transforms

5.5.1 Moving data from wide to tall form

5.5.2 Moving data from tall to wide form

5.5.3 Data coordinates

Summary

Part 2: Modeling methods

6 Choosing and evaluating models

6.1 Mapping problems to machine learning tasks

6.1.1 Classification problems

6.1.2 Scoring problems

6.1.3 Grouping: working without known targets

6.1.4 Problem-to-method mapping

6.2 Evaluating models

6.2.1 Overfitting

6.2.2 Measures of model performance

6.2.3 Evaluating classification models

6.2.4 Evaluating scoring models

6.2.5 Evaluating probability models

6.3 Local interpretable model-agnostic explanations (LIME) for explaining model predictions

6.3.1 LIME: Automated sanity checking

6.3.2 Walking through LIME: A small example

6.3.3 LIME for text classification

6.3.4 Training the text classifier

6.3.5 Explaining the classifier’s predictions

Summary

7 Linear and logistic regression

7.1 Using linear regression

7.1.1 Understanding linear regression

7.1.2 Building a linear regression model

7.1.3 Making predictions

7.1.4 Finding relations and extracting advice

7.1.5 Reading the model summary and characterizing coefficient quality

7.1.6 Linear regression takeaways

7.2 Using logistic regression

7.2.1 Understanding logistic regression

7.2.2 Building a logistic regression model

7.2.3 Making predictions

7.2.4 Finding relations and extracting advice from logistic models

7.2.5 Reading the model summary and characterizing coefficients

7.2.6 Logistic regression takeaways

7.3 Regularization

7.3.1 An example of quasi-separation

7.3.2 The types of regularized regression

7.3.3 Regularized regression with glmnet

Summary

8 Advanced data preparation

8.1 The purpose of the vtreat package

8.2 KDD and KDD Cup 2009

8.2.1 Getting started with KDD Cup 2009 data

8.2.2 The bull-in-the-china-shop approach

8.3 Basic data preparation for classification

8.3.1 The variable score frame

8.3.2 Properly using the treatment plan

8.4 Advanced data preparation for classification

8.4.1 Using mkCrossFrameCExperiment()

8.4.2 Building a model

8.5 Preparing data for regression modeling

8.6 Mastering the vtreat package

8.6.1 The vtreat phases

8.6.2 Missing values

8.6.3 Indicator variables

8.6.4 Impact coding

8.6.5 The treatment plan

8.6.6 The cross-frame

Summary

9 Unsupervised methods

9.1 Cluster analysis

9.1.1 Distances

P9.1.2 Preparing the data

9.1.3 Hierarchical clustering with hclust

9.1.4 The k-means algorithm

9.1.5 Assigning new points to clusters

9.1.6 Clustering takeaways

9.2 Association rules

9.2.1 Overview of association rules

9.2.2 The example problem

9.2.3 Mining association rules with the arules package

9.2.4 Association rule takeaways

Summary

10 Exploring advanced methods

10.1 Tree-based methods

10.1.1 A basic decision tree

10.1.2 Using bagging to improve prediction

10.1.3 Using random forests to further improve prediction

10.1.4 Gradient-boosted trees

10.1.5 Tree-based model takeaways

10.2 Using generalized additive models (GAMs) to learn non-monotone relationships

10.2.1 Understanding GAMs

10.2.2 A one-dimensional regression example

10.2.3 Extracting the non-linear relationships

10.2.4 Using GAM on actual data

10.2.5 Using GAM for logistic regression

10.2.6 GAM takeaways

10.3 Solving “inseparable” problems using support vector machines

10.3.1 Using an SVM to solve a problem

10.3.2 Understanding support vector machines

10.3.3 Understanding kernel functions

10.3.4 Support vector machine and kernel methods takeaways

Summary

Part 3: Working in the real world

11 Documentation and deployment

11.1 Predicting buzz

11.2 Using R markdown to produce milestone documentation

11.2.1 What is R markdown?

11.2.2 knitr technical details

11.2.3 Using knitr to document the Buzz data and produce the model

11.3 Using comments and version control for running documentation

11.3.1 Writing effective comments

11.3.2 Using version control to record history

11.3.3 Using version control to explore your project

11.3.4 Using version control to share work

11.4 Deploying models

11.4.1 Deploying demonstrations using Shiny

11.4.2 Deploying models as HTTP services

11.4.3 Deploying models by export

11.4.4 What to take away

Summary

12 Producing effective presentations

12.1 Presenting your results to the project sponsor

12.1.1 Summarizing the project’s goals

12.1.2 Stating the project’s results

12.1.3 Filling in the details

12.1.4 Making recommendations and discussing future work

12.1.5 Project sponsor presentation takeaways

12.2 Presenting your model to end users

12.2.1 Summarizing the project goals

12.2.2 Showing how the model fits user workflow

12.2.3 Showing how to use the model

12.2.4 End user presentation takeaways

12.3 Presenting your work to other data scientists

12.3.1 Introducing the problem

12.3.2 Discussing related work

12.3.3 Discussing your approach

12.3.4 Discussing results and future work

12.3.5 Peer presentation takeaways

Summary

Appendixes

Appendix A: Starting with R and other tools

A.1 Installing the tools

A.1.1 Installing Tools

A.1.2 The R package system

A.1.3 Installing Git

A.1.4 Installing RStudio

A.1.5 R resources

A.2 Starting with R

A.2.1 Primary features of R

A.2.2 Primary R data types

A.3 Using databases with R

A.3.1 Running database queries using a query generator

A.3.2 How to think relationally about data

A.4 The takeaway

Appendix B: Important statistical concepts

B.1 Distributions

B.1.1 Normal distribution

B.1.2 Summarizing R’s distribution naming conventions

B.1.3 Lognormal distribution

B.1.4 Binomial distribution

B.1.5 More R tools for distributions

B.2 Statistical theory

B.2.1 Statistical philosophy

B.2.2 A/B tests

B.2.3 Power of tests

B.2.4 Specialized statistical tests

B.3 Examples of the statistical view of data

B.3.1 Sampling bias

B.3.2 Omitted variable bias

B.4 The takeaway

Appendix C: Bibliography

Overview

1 The data science process

Data science is presented as a practical, cross‑disciplinary craft focused on making better decisions in business and science. The chapter emphasizes that success stems less from exotic tools than from clear, quantitative goals, sound methodology, collaboration, and a repeatable workflow. Grounded by a loan‑risk example, it introduces an iterative project lifecycle in which insights at any stage can trigger revisiting earlier steps, and where completed efforts often lead to follow‑on projects.

Successful projects rely on well‑defined roles and sustained stakeholder engagement. The project sponsor owns the business outcome and defines success, so keeping them informed and securing measurable acceptance criteria is paramount. The client represents end users and workflow realities; the data scientist designs strategy and analysis; and data architects and operations ensure data stewardship, infrastructure, deployment, and run‑time constraints. Regular communication, intelligible reporting, and aligning technical choices with business and user needs are treated as core competencies.

The lifecycle progresses from defining specific, measurable goals to collecting, exploring, and conditioning data, with attention to representativeness, quality, and bias. Modeling then extracts signal through tasks such as classification, scoring, ranking, clustering, and relationship discovery, selected to balance performance with interpretability and operational fit. Models are critiqued against business goals using metrics beyond accuracy, compared to null or existing baselines, and refined as needed. Results are documented and tailored to executives, end users, and operations, then deployed, monitored, piloted, and maintained. The chapter closes by stressing expectation‑setting: establish realistic lower bounds and success thresholds early, verify that available data and resources can meet them, and use measures like precision, recall, and false‑positive rate to guide trade‑offs.

Figure 1.1. The lifecycle of a data science project: loops within loops

Figure 1.2. The fraction of defaulting loans by credit history category. The dark region of each bar represents the fraction of loans in that category that defaulted.

$The fraction of defaulting loans by credit history category. The dark region of each bar represents the fraction of loans in that category that defaulted.$

Figure 1.3. A decision tree model for finding bad loan applications, with confidence scores

Figure 1.4. Example slide from an executive presentation

Summary

The data science process involves a lot of back-and-forth—between the data scientist and other project stakeholders, and between the different stages of the process. Along the way, you’ll encounter surprises and stumbling blocks; this book will teach you procedures for overcoming some of these hurdles. It’s important to keep all the stakeholders informed and involved; when the project ends, no one connected with it should be surprised by the final results.

In the next chapters, we’ll look at the stages that follow project design: loading, exploring, and managing the data. Chapter 2 covers a few basic ways to load the data into R, in a format that’s convenient for analysis.

In this chapter you have learned:

That a successful data science project involves more than just statistics. It also requires a variety of roles to represent business and client interests, as well as operational concerns.
That you should make sure you have a clear, verifiable, quantifiable goal.
That you should make sure you’ve set realistic expectations for all stakeholders.

FAQ

What does this chapter mean by “data science”?

Data science is a cross‑disciplinary practice that uses data engineering, descriptive statistics, data mining, machine learning, and predictive analytics to drive decisions and manage their consequences. In this book it is applied to business and scientific problems, and success comes from measurable goals, sound methodology, collaboration, and repeatable workflows—not from any single tool.

Who are the core roles in a data science project and what do they do?

- Project sponsor: Owns the business need and decides success; champions the project. - Client (end‑user representative): Domain expert who ensures the solution fits day‑to‑day work. - Data scientist: Sets strategy, selects data and tools, builds and evaluates models, and communicates with stakeholders. - Data architect: Manages data sources and storage; may oversee data collection. - Operations: Owns infrastructure and deployment constraints; runs the model in production. Roles can overlap, and some may sit outside the core project team.

Why is the project sponsor so important, and how should you work with them?

The sponsor defines success. Get explicit, quantifiable goals through directed interviews and keep the sponsor informed with clear, business‑level updates. Aim for precise acceptance criteria (for example, target recall, false positive rate, and lead time) and seek periodic sign‑off to keep the project aligned.

What are the stages of the data science lifecycle, and why are they iterative?

- Defining the goal - Data collection and management - Modeling - Model evaluation and critique - Presentation and documentation - Deployment and maintenance Expect iteration and feedback loops between stages (similar to CRISP‑DM). You often revisit earlier steps as you learn from data and stakeholders.

How do you define a good project goal?

Make it specific, measurable, and tied to deployment. Clarify the problem context, available resources (data, staff, compute), and constraints. Establish concrete acceptance criteria and stopping conditions. For exploratory work, time‑box it and convert findings into testable goals for a full project.

What should you look for during data collection and management?

- What data exists, whether it’s sufficient, and its quality. - Prefer direct measures over proxies when possible. - Explore, visualize, and clean early; expect to refine goals as you learn. - Watch for sampling bias (e.g., having only accepted loans can distort patterns) and involve domain experts to interpret surprising findings.

Which modeling tasks are common, and what drove the model choice in the loan example?

Common tasks: classification, scoring, ranking, clustering, finding relations, and characterization. In the loan example (a classification problem), a decision tree was chosen for interpretability and confidence estimates, though logistic regression and Naive Bayes were also viable options.

How do you evaluate and critique a model?

Check whether it meets goals, generalizes, and makes domain sense. Use confusion matrices and metrics like accuracy, precision, recall, and false positive rate; compare performance to the “obvious guess” or current process. Prefer evaluation on held‑out test data, and balance trade‑offs with stakeholders.

What does “setting expectations” involve, and how do null models help?

Align performance targets with business goals and available resources early. Define a null model (existing process or simplest baseline) to set lower bounds (e.g., base error rate) and test that your model is significantly better. Go beyond accuracy—precision, recall, and false positive rate are often more relevant to business impact.

How should you present results and plan for deployment and maintenance?

- For sponsors: emphasize business impact (e.g., potential reduction in losses). - For end users: explain outputs, confidence scores, when to override, and how the model helps their workflow. - For operations: detail dependencies, latency, data volume, and failure modes. Pilot first, monitor performance and overrides, update models as conditions change, and capture feedback for follow‑up projects.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$54.99 $41.24

you save $13.75 (25%)

include audio $19.99 $14.99

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$54.99 $41.24

you save $13.75 (25%)

include audio $19.99 $14.99

eBook

pdf, ePub, online

$54.99 $41.24

you save $13.75 (25%)

include audio $19.99 $14.99

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more