Welcome to Manning India!

We are pleased to be able to offer regional eBook pricing for Indian residents.
All eBook prices are discounted 40% or more!
Practical Data Science with R, Second Edition
Nina Zumel and John Mount
  • MEAP began August 2018
  • Publication in November 2019 (estimated)
  • ISBN 9781617295874
  • 586 pages (estimated)
  • printed in black & white
free previous edition included
An eBook copy of the previous edition of this book is included at no additional cost. It will be automatically added to your Manning Bookshelf within 24 hours of purchase.

Great start to the subject matter - kept me engaged and fascinated as I worked through the initial chapters.

Taylor Dolezal

Practical Data Science with R, Second Edition takes a practice-oriented approach to explaining basic principles in the ever-expanding field of data science. You’ll jump right to real-world use cases as you apply the R programming language and statistical analysis techniques to carefully explained examples based in marketing, business intelligence, and decision support.

Numerous updates in this brand new edition include: an introduction to the vtreat data preparation tool, a section on model explanation, and additional modeling techniques such as boosting and regularized regression!

Table of Contents detailed table of contents

Part 1: Introduction to data science

1. The data science process

1.1. The roles in a data science project

1.1.1. Project roles

1.2. Stages of a data science project

1.2.1. Defining the goal

1.2.2. Data collection and management

1.2.3. Modeling

1.2.4. Model evaluation and critique

1.2.5. Presentation and documentation

1.2.6. Model deployment and maintenance

1.3. Setting expectations

1.3.1. Determining lower bounds on model performance

1.4. Summary

2. Starting with R and data

2.1. Starting with R

2.1.1. Installing R

2.1.2. R programming

2.2. Working with data from files

2.2.1. Working with well-structured data from files or URLs

2.2.2. Using R with less-structured data

2.3. Working with relational databases

2.3.1. A production-size example

2.4. Summary

3. Exploring data

3.1. Using summary statistics to spot problems

3.1.1. Typical problems revealed by data summaries

3.2. Spotting problems using graphics and visualization

3.2.1. Visually checking distributions for a single variable

3.2.2. Visually checking relationships between two variables

3.3. Summary

4. Managing data

4.1. Cleaning data

4.1.1. Domain-specific data cleaning

4.1.2. Treating missing values (NAs)

4.1.3. The vtreat package for automatically treating missing variables

4.2. Data transformations

4.2.1. Normalization

4.2.2. Centering and scaling

4.2.3. Log transformations for skewed and wide distributions

4.3. Sampling for modeling and validation

4.3.1. Test and training splits

4.3.2. Creating a sample group column

4.3.3. Record grouping

4.3.4. Data provenance

4.4. Summary

5. Data Engineering and Data Shaping

5.1. Data Selection

5.1.1. Subsetting Rows and Columns

5.1.2. Removing records with incomplete data

5.1.3. Ordering rows

5.2. Basic Data Transforms

5.2.1. Add new columns

5.2.2. Other simple operations

5.2.3. Parametric programming

5.3. Aggregating Transforms

5.3.1. Scenario

5.3.2. Combining many rows into summary rows

5.4. Multi-Table Data Transforms

5.4.1. Combining two or more ordered data.frame’s quickly

5.4.2. Principled methods to combine data from multiple tables

5.5. Reshaping Transforms

5.5.1. Moving data from wide to tall form

5.5.2. Moving data from tall to wide form

5.5.3. Data Coordinates

5.6. Summary

Part 2: Modeling methods

6. Choosing and evaluating models

6.1. Mapping problems to machine learning tasks

6.1.1. Classification problems

6.1.2. Scoring problems

6.1.3. Grouping: Working without known targets

6.1.4. Problem-to-method mapping

6.2. Evaluating models

6.2.1. Overfitting

6.2.2. Measures of model performance

6.2.3. Evaluating classification models

6.2.4. Evaluating scoring models

6.2.5. Evaluating probability models

6.3. Local Interpretable Model-Agnostic Explanations (LIME) for explaining model predictions

6.3.1. LIME: automated sanity checking

6.3.2. Walking through LIME: a small example

6.3.3. LIME for Text Classification

6.3.4. Train the text classifier

6.3.5. Explaining the classifier’s predictions

6.4. Summary

7. Linear and logistic regression

7.1. Using linear regression

7.1.1. Understanding linear regression

7.1.2. Building a linear regression model

7.1.3. Making predictions

7.1.4. Finding relations and extracting advice

7.1.5. Reading the model summary and characterizing coefficient quality

7.1.6. Linear regression takeaways

7.2. Using logistic regression

7.2.1. Understanding logistic regression

7.2.2. Building a logistic regression model

7.2.3. Making predictions

7.2.4. Finding relations and extracting advice from logistic models

7.2.5. Reading the model summary and characterizing coefficients

7.2.6. Logistic regression takeaways

7.3. Regularization

7.3.1. An Example of Quasi-separation

7.3.2. The types of regularized regression

7.3.3. Regularized regression with glmnet

7.4. Summary

8. Advanced Data Preparation

8.1. The purpose of the vtreat package

8.2. KDD and KDD Cup 2009

8.2.1. Getting started with KDD Cup 2009 data

8.2.2. The Bull in The China Shop Approach

8.3. Basic data preparation for classification

8.3.1. The variable score frame

8.3.2. Properly using the treatment plan

8.4. Advanced data preparation for classification

8.4.1. Using mkCrossFrameCExperiment()

8.4.2. Building a model

8.5. Preparing data for regression modeling

8.6. Mastering the vtreat package

8.6.1. The vtreat phases

8.6.2. Missing values

8.6.3. Indicator variables

8.6.4. Impact coding

8.6.5. The treatment plan

8.6.6. The cross-frame

8.7. Summary

9. Unsupervised methods

9.1. Cluster analysis

9.1.1. Distances

9.1.2. Preparing the data

9.1.3. Hierarchical clustering with hclust

9.1.4. The k-means algorithm

9.1.5. Assigning new points to clusters

9.1.6. Clustering takeaways

9.2. Association rules

9.2.1. Overview of association rules

9.2.2. The example problem

9.2.3. Mining association rules with the arules package

9.2.4. Association rule takeaways

9.3. Summary

10. Exploring advanced methods

10.1. Tree-based Methods

10.1.1. A basic decision tree

10.1.2. Using bagging to improve prediction

10.1.3. Using random forests to further improve prediction

10.1.4. Gradient-boosted trees

10.1.5. Tree-based model takeaways

10.2. Using generalized additive models (GAMs) to learn non-monotone relationships

10.2.1. Understanding GAMs

10.2.2. A one-dimensional regression example

10.2.3. Extracting the non-linear relationships

10.2.4. Using GAM on actual data

10.2.5. Using GAM for logistic regression

10.2.6. GAM takeaways

10.3. Solving "Inseparable" Problems Using Support Vector Machines

10.3.1. Using a SVM to solve a problem

10.3.2. Understanding support vector machines

10.3.3. Understanding kernel functions

10.3.4. Support vector machine and kernel methods takeaways

10.4. Summary

Part 3: Working in the real world

11. Documentation and deployment

11.1. Predicting Buzz

11.2. Using R markdown to produce milestone documentation

11.2.1. What is R markdown?

11.2.2. knitr technical details

11.2.3. Using knitr to document the Buzz data and produce the model

11.3. Using comments and version control for running documentation

11.3.1. Writing effective comments

11.3.2. Using version control to record history

11.3.3. Using version control to explore your project

11.3.4. Using version control to share work

11.4. Deploying models

11.4.1. Deploying demonstrations using Shiny

11.4.2. Deploying models as HTTP services

11.4.3. Deploying models by export

11.4.4. What to take away

11.5. Summary

12. Producing effective presentations

12.1. Presenting your results to the project sponsor

12.1.1. Summarizing the project’s goals

12.1.2. Stating the project’s results

12.1.3. Filling in the details

12.1.4. Making recommendations and discussing future work

12.1.5. Project sponsor presentation takeaways

12.2. Presenting your model to end users

12.2.1. Summarizing the project’s goals

12.2.2. Showing how the model fits the users’ workflow

12.2.3. Showing how to use the model

12.2.4. End user presentation takeaways

12.3. Presenting your work to other data scientists

12.3.1. Introducing the problem

12.3.3. Discussing your approach

12.3.4. Discussing results and future work

12.3.5. Peer presentation takeaways

12.4. Summary


Appendix A: Starting with R and other tools

A.1. Installing the tools

A.1.1. Installing Tools

A.1.2. The R package system

A.1.3. Installing Git

A.1.4. Installing RStudio

A.1.5. R resources

A.2. Starting with R

A.2.1. Primary features of R

A.2.2. Primary R data types

A.3. Using databases with R

A.3.1. Running database queries using a query generator

A.3.2. How to think relationally about data

A.4. The take away

Appendix B: Important statistical concepts

B.1. Distributions

B.1.1. Normal distribution

B.1.2. Summarizing R’s distribution naming conventions

B.1.3. Lognormal distribution

B.1.4. Binomial distribution

B.1.5. More R tools for distributions

B.2. Statistical theory

B.2.1. Statistical philosophy

B.2.2. A/B tests

B.2.3. Power of tests

B.2.4. Specialized statistical tests

B.3. Listings of the statistical view of data

B.3.1. Sampling bias

B.3.2. Omitted variable bias

B.4. The take away

Appendix C: Bibliography

About the Technology

Business analysts and developers are increasingly collecting, curating, analyzing, and reporting on crucial business data. The R language and its associated tools provide a straightforward way to tackle day-to-day data science and machine learning tasks.

About the book

This invaluable addition to any data scientist’s library shows you how to apply the R programming language and useful statistical techniques to everyday business situations as well as how to effectively present results to audiences of all levels. To answer the ever-increasing demand for machine learning and analysis, this new edition boasts additional R tools, modeling techniques, and more.

What's inside

  • Data science and statistical analysis for the business professional
  • Numerous instantly familiar real-world use cases
  • Keys to effective data presentations
  • Modeling and analysis techniques like boosting, regularized regression, and quadratic discriminant analysis
  • Additional R tools including data.table and vtreat
  • A new section on interpreting predictions of complicated models

About the reader

While some familiarity with basic statistics and R is assumed, this book is accessible to readers with or without a background in data science.

About the author

Nina Zumel and John Mount are co-founders of Win-Vector LLC, a San Francisco-based data science consulting firm. Both hold PhDs from Carnegie Mellon and blog on statistics, probability, and computer science at win-vector.com.

Manning Early Access Program (MEAP) Read chapters as they are written, get the finished eBook as soon as it’s ready, and receive the pBook long before it's in bookstores.
MEAP combo $49.99 pBook + eBook + liveBook
includes previous edition
MEAP eBook $39.99 pdf + ePub + kindle + liveBook
includes previous edition
Prices displayed in rupees will be charged in USD when you check out.

placing your order...

Don't refresh or navigate away from the page.

FREE domestic shipping on three or more pBooks