Welcome to Manning India!

We are pleased to be able to offer regional eBook pricing for Indian residents.
All eBook prices are discounted 40% or more!
Machine Learning with R, tidyverse, and mlr
Hefin I. Rhys
  • MEAP began March 2019
  • Publication in April 2020 (estimated)
  • ISBN 9781617296574
  • 398 pages (estimated)
  • printed in black & white

A wonderfully-written and intensely readable introduction to the world of machine learning, model-building, and modelevaluation for those looking to work in R.

Erik Sapper
Machine learning is a collection of programming techniques for discovering relationships in data. With ML algorithms, you can cluster and classify data for tasks like making recommendations or fraud detection and make predictions for sales trends, risk analysis, and other forecasts. Once the domain of academic data scientists, machine learning has become a mainstream business process, and tools like the easy-to-learn R programming language put high-quality data analysis in the hands of any programmer. Machine Learning with R, tidyverse, and mlr teaches you widely used ML techniques and how to apply them to your own datasets using the R programming language and its powerful ecosystem of tools. This book will get you started!
Table of Contents detailed table of contents

Part 1: Introduction

1 Introduction to machine learning

1.1 What is machine learning?

1.1.1 Artificial intelligence and machine learning

1.1.2 The difference between a model and an algorithm

1.2 Classes of machine learning algorithms

1.2.1 Differences between supervised, unsupervised, and semi-supervised learning

1.2.2 Classification, regression, dimension reduction, and clustering

1.2.3 A brief word on deep learning

1.3 Why use R for machine learning?

1.4 Which datasets will we use?

1.5 What will you learn in this book

1.6 Summary

2 Tidying, manipulating and plotting data with the tidyverse

2.1 What is the tidyverse and what is tidy data?

2.2 Loading the tidyverse

2.3 What the tibble package is and what it does

2.3.1 Creating tibbles

2.3.2 Converting existing data frames into tibbles

2.3.3 Differences between data frames and tibbles

2.4 What the dplyr package is and what it does

2.4.1 Manipulating the CO2 dataset with dplyr

2.4.2 Chaining dplyr functions together

2.5 What the ggplot2 package is and what it does

2.6 What the tidyr package is and what it does

2.7 What the purrr package is and what it does

2.7.1 Replacing for loops with map()

2.7.2 Returning an atomic vector instead of a list

2.7.3 Using anonymous functions inside the map() family

2.7.4 Using walk() to produce a function’s side effects

2.7.5 Iterating over multiple lists simultaneously

2.8 Summary

2.9 Solutions to exercises

Part 2: Classification

3 Classifying based on similar observations: the k-Nearest neighbors algorithm

3.1 What is the k-nearest neighbors algorithm?

3.1.1 How does the k-nearest neighbors algorithm learn?

3.1.2 What happens if the vote is tied?

3.2 Building our first k-NN model

3.2.1 Loading and exploring the diabetes dataset

3.2.2 Using mlr to train your first k-NN model

3.2.3 Telling mlr what we’re trying to achieve: defining the task

3.2.4 Telling mlr which algorithm to use: defining the learner

3.2.5 Putting it all together: training the model

3.3 Balancing two sources of model error: the bias-variance trade-off

3.4 How to tell if you’re over/underfitting: cross-validation

3.5 Cross validating our k-NN model

3.5.1 Hold-out cross-validation

3.5.2 k-fold cross-validation

3.5.3 Leave-one-out cross-validation

3.6 What algorithms can learn and what they must be told: parameters and hyperparameters

3.7 Tuning k to improve our model

3.7.1 Including hyperparameter tuning in our cross-validation

3.7.2 Using our model to make predictions

3.8 Strengths and weaknesses of k-NN

3.9 Summary

3.10 Solutions to exercises

4 Classifying based on odds: logistic regression

4.1 What is logistic regression?

4.1.1 How does logistic regression learn?

4.1.2 What if I have more than two classes?

4.2 Building our first logistic regression model

4.2.1 Loading and exploring the Titanic dataset

4.2.2 Making the most of the data: feature engineering and feature selection

4.2.3 Plotting the data

4.2.4 Training the model

4.2.5 Dealing with missing data

4.2.6 Training the model (take two)

4.3 Cross-validating our logistic regression model

4.3.1 Including missing value imputation in our cross-validation

4.3.2 Accuracy is the most important performance metric, right?

4.4 Interpreting the model: the odds ratio

4.4.1 Converting model parameters into odds ratios

4.4.2 When a one unit increase doesn’t make sense

4.5 Using our model to make predictions

4.6 Strengths and weaknesses of logistic regression

4.7 Summary

4.8 Solutions to exercises

5 Classifying by maximizing class separation: discriminant analysis

5.1 What is discriminant analysis?

5.1.1 How does discriminant analysis learn?

5.1.2 What if I have more than two classes?

5.1.3 Learning curves instead of straight lines: QDA

5.1.4 How do LDA and QDA make predictions?

5.2 Building our first linear and quadratic discriminant models

5.2.1 Loading and exploring the wine dataset

5.2.2 Plotting the data

5.2.3 Training the models

5.3 Strengths and weaknesses of LDA and QDA

5.4 Summary

5.5 Solutions to exercises

6 Classifying based on probabilities and hyperplanes: naive Bayes and support vector machines

6.1 What is the naive Bayes algorithm?

6.1.1 Using naive Bayes for classification

6.1.2 How is the likelihood calculated for categorical and continuous predictors?

6.2 Building our first naive Bayes model

6.2.1 Loading and exploring the HouseVotes84 dataset

6.2.2 Plotting the data

6.2.3 Training the model

6.3 Strengths and weaknesses of naive Bayes

6.4 What is the support vector machine (SVM) algorithm?

6.4.1 SVMs for linearly-separable data

6.4.2 What if the classes aren’t fully separable?

6.4.3 SVMs for non-linearly separable data

6.4.4 Hyperparameters of the SVM algorithm

6.4.5 What if I have more than two classes?

6.5 Building our first SVM model

6.5.1 Loading and exploring the spam dataset

6.5.2 Tuning our hyperparameters

6.5.3 Training the model with the tuned hyperparameters

6.6 Cross-validating our SVM model

6.7 Strengths and weaknesses of the SVM algorithm

6.8 Summary

6.9 Solutions to exercises

7 Classifying with trees: decision trees

7.1 What is the recursive partitioning algorithm?

7.1.1 Using Gini gain to split the tree

7.1.2 What about continuous, and multi-level categorical predictors?

7.1.3 Hyperparameters of the rpart algorithm

7.2 Building our first decision tree model

7.3 Loading and exploring the zoo dataset

7.4 Training the decision tree model

7.4.1 Training the model with the tuned hyperparameters

7.5 Cross-validating our decision tree model

7.6 Strengths and weaknesses of tree-based algorithms

7.7 Summary

8 Improving decision trees: random forests and gradient boosting

8.1 Ensemble techniques: bagging, boosting, and stacking

8.1.1 Training models on sampled data: bootstrap aggregating

8.1.2 Learning from the previous models' mistakes: boosting

8.1.3 Learning from predictions made by other models: stacking

8.2 Building our first random forest model

8.3 Building our first XGBoost model

8.4 Strengths and weaknesses of tree-based algorithms

8.5 Benchmarking algorithms against each other

8.6 Summary

Part 3: Regression

9 Regression with lines: linear regression

9.1 What is linear regression?

9.1.1 What if we have multiple predictors?

9.1.2 What if my predictors are categorical?

9.2 Building our first linear regression model

9.2.1 Loading and exploring the Ozone dataset

9.2.2 Imputing missing values

9.2.3 Automating feature selection

9.2.4 Including imputation and feature selection in our cross-validation

9.2.5 Interpreting the model

9.3 Strengths and weaknesses of linear regression

9.4 Summary

9.5 Solutions to exercises

10 When the relationships aren’t linear: generalized additive models

10.1 Making linear regression non-linear with polynomial terms

10.2 When we need even more flexibility: splines and generalized additive models

10.2.1 How do GAMs learn their smoothing functions

10.2.2 How do GAMs handle categorical variables?

10.3 Building our first GAM

10.4 Strengths and weaknesses of GAMs

10.5 Summary

10.6 Solutions to exercises

11 Preventing overfitting: ridge regression, LASSO, and elastic net

11.1 What is regularization?

11.2 What is ridge regression?

11.3 What is the L2 norm and how does ridge regression use it?

11.4 What is the L1 norm and how does LASSO use it?

11.5 What is elastic net?

11.6 Building our first ridge, LASSO, and elastic net models

11.6.1 Loading and exploring the Iowa dataset

11.6.2 Training the ridge regression model

11.6.3 Training the LASSO model

11.6.4 Training the elastic net model

11.7 Benchmarking ridge, LASSO, elastic net and OLS against each other

11.8 Strengths and weaknesses of ridge, LASSO and elastic net

11.9 Summary

11.10 Solutions to exercises

12 Regression with distance and trees: k-nearest neighbors, random forest and XGBoost

12.1 Using k-nearest neighbors to predict a continuous variable

12.2 Using tree-based learners predict a continuous variable

12.3 Building our first k-NN regression model

12.3.1 Loading and exploring the fuel dataset

12.3.2 Tuning the k hyperparameter

12.4 Building our first random forest regression model

12.5 Building our first XGBoost regression model

12.6 Benchmarking the k-NN, random forest, and XGBoost model-building processes

12.7 Strengths and weaknesses of k-NN, random forest, and XGBoost

12.8 Summary

12.9 Solutions to exercises

Part 4: Dimension reduction

13 Maximizing variance: principal component analysis

13.1 Why dimension reduction: visualization, curse of dimensionality, and colinearity

13.1.1 Visualizing high-dimensional data

13.1.2 Consequences of the curse of dimensionality

13.1.3 Consequences of colinearity

13.1.4 Dimension reduction mitigates the curse of dimensionality and colinearity

13.2 What is principal component analysis?

13.3 Building our first principal components analysis model

13.3.1 Loading and exploring the banknote dataset

13.3.2 Performing PCA

13.3.3 Plotting the result of our PCA

13.3.4 Computing the component scores of new data

13.4 Strengths and weaknesses of PCA

13.5 Summary

13.6 Solutions to exercises

14 Maximizing similarity: t-SNE and UMAP

14.1 What is t-SNE?

14.2 Building our first t-SNE embedding

14.2.1 Performing t-SNE

14.2.2 Plotting the result of t-SNE

14.3 What is UMAP?

14.4 Building our first UMAP model

14.4.1 Performing UMAP

14.4.2 Plotting the result of UMAP

14.4.3 Computing the UMAP embeddings of new data

14.5 Strengths and weaknesses of t-SNE and UMAP

14.6 Summary

14.7 Solutions to exercises

15 Dimension reduction with networks and local structure: self-organizing maps and locally-linear embedding

15.1 What are self-organizing maps?

15.1.1 Creating the grid of nodes

15.1.2 Randomly assigning weights, and placing cases in nodes

15.1.3 Updating the weights of the nodes to better match the cases inside them

15.2 Building our first SOM

15.2.1 Loading and exploring the flea dataset

15.2.2 Training the SOM

15.2.3 Plotting the SOM result

15.2.4 Mapping new data onto the SOM

15.3 What is locally-linear embedding?

15.4 Building our first LLE

15.4.1 Loading and exploring the s curve dataset

15.4.2 Training the LLE

15.4.3 Plotting the LLE result

15.5 Building an LLE of our flea data

15.6 Strengths and weaknesses of SOMs and LLE

15.7 Summary

15.8 Solutions to exercises

Part 5: Clustering

16 Clustering by finding centers: k-means

16.1 What is k-means clustering?

16.1.1 How does Lloyd’s algorithm work?

16.1.2 How does MacQueen’s algorithm work?

16.1.3 Hartigan-Wong algorithm

16.2 Building our first k-means model

16.2.1 Loading and exploring the GvHD dataset

16.2.2 Defining our task and learner

16.2.3 How do I choose the number of clusters?

16.2.4 Tuning k and the algorithm choice for our k-means model

16.2.5 Training the final, tuned k-means model

16.2.6 Using our model to predict clusters of new data

16.3 Strengths and weaknesses of k-means clustering

16.4 Summary

16.5 Solutions to exercises

17 Clustering by finding hierarchies: hierarchical clustering

17.1 What is hierarchical clustering?

17.1.1 Agglomerative hierarchical clustering

17.1.2 Divisive hierarchical clustering

17.2 Building our first agglomerative hierarchical clustering model

17.2.1 How do I choose the number of clusters?

17.2.2 Cutting the tree to select a flat set of clusters

17.3 How stable are my clusters?

17.4 Strengths and weaknesses hierarchical clustering

17.5 Summary

17.6 Solutions to exercises

18 Clustering based on density: DBSCAN and OPTICS

18.1 What is density-based clustering?

18.1.1 How does the DBSCAN algorithm learn?

18.1.2 How does the OPTICS algorithm learn?

18.2 Building our first DBSCAN model

18.2.1 Loading and exploring the banknote dataset

18.2.2 Tuning the epsilon and minPts hyperparameters

18.3 Building our first OPTICS model

18.4 Strengths and weaknesses of density-based clustering

18.5 Summary

18.6 Solutions to exercises

19 Clustering based on the distribution of data: mixture model clustering

19.1 What is mixture model clustering?

19.1.1 The EM algorithm: calculating probabilities

19.1.2 The EM algorithm: expectation and maximization steps

19.1.3 What if we have more than one variable?

19.2 Building our first Gaussian mixture model for clustering

19.3 Strengths and weaknesses of mixture model clustering

19.4 Summary

19.5 Solutions to exercises

20 Final notes and further reading

20.1 A brief recap on machine learning concepts

20.1.1 Supervised, unsupervised, and semi-supervised learning

20.1.2 Balancing the bias-variance trade-off for model performance

20.1.3 Using model validation to identify over/underfitting

20.1.4 Hyperparameter tuning to maximize model performance

20.1.5 Using missing value imputation to deal with missing data

20.1.6 Feature engineering and feature selection help make the most of the available information

20.1.7 Ensemble techniques help improve model performance

20.1.8 Regularization prevents overfitting by penalizing the model parameters

20.2 Where can you go from here?

20.2.1 Deep learning

20.2.2 Reinforcement learning

20.2.3 General R data science and tidyverse

20.2.4 mlr tutorial and creating new learners/metrics

20.2.5 Generalized additive models (GAMs)

20.2.6 Ensemble methods

20.2.7 Support vector machines (SVMs)

20.2.8 Anomaly detection

20.2.9 Time series

20.2.10 Clustering

20.2.11 Generalized linear models

20.2.12 Semi-supervised learning

20.2.13 Modeling spectral data

20.3 The last word


Appendix A: A refresher on statistical concepts

About the Technology

Machine learning techniques accurately and efficiently identify patterns and relationships in data and use those models to make predictions about new data. ML techniques can work on even relatively small datasets, making these skills a powerful ally for nearly any data analysis task. The R programming language was designed with mathematical and statistical applications in mind. Small datasets are its sweet spot, and its modern data science tools, including the popular tidyverse package, make R a natural choice for ML.

About the book

Machine Learning with R, tidyverse, and mlr teaches you how to gain valuable insights from your data using the powerful R programming language. In his engaging and informal style, author and R expert Hefin Ioan Rhys lays a firm foundation of ML basics and introduces you to the tidyverse, a powerful set of R tools designed specifically for practical data science. Armed with the fundamentals, you’ll delve deeper into commonly used machine learning techniques including classification, prediction, reduction, and clustering algorithms, applying each to real data to make predictions on fun and interesting problems.

Using the tidyverse packages, you’ll transform, clean, and plot your data, onboarding data science best practices as you go. To simplify your learning process, you’ll also use R’s mlr package, an incredibly flexible interface for a variety of core algorithms that allows you to perform complicated ML tasks with minimal coding. You’ll explore essential concepts like overfitting, underfitting, validating model performance, and how to choose the best model for your task. Illuminating visuals provide clear explanations, cementing your new knowledge.

Whether you’re tackling business problems, crunching research data, or just a data-minded developer, you’ll be building your own ML pipelines in no time with this hands-on tutorial!

What's inside

  • Commonly used ML techniques
  • Using the tidyverse packages to organize and plot your data
  • Validating model performance
  • Choosing the best ML model for your task
  • A variety of hands-on coding exercises
  • ML best practices

About the reader

For readers with basic programming skills in R, Python, or another standard programming language.

About the author

Hefin Ioan Rhys is a senior laboratory research scientist in the Flow Cytometry Shared Technology Platform at The Francis Crick Institute. He spent the final year of his PhD program teaching basic R skills at the university. A data science and machine learning enthusiast, he has his own Youtube channel featuring screencast tutorials in R and R Studio.

Manning Early Access Program (MEAP) Read chapters as they are written, get the finished eBook as soon as it’s ready, and receive the pBook long before it's in bookstores.
MEAP combo $49.99 pBook + eBook + liveBook
MEAP eBook $39.99 pdf + ePub + kindle + liveBook
Prices displayed in rupees will be charged in USD when you check out.

placing your order...

Don't refresh or navigate away from the page.

FREE domestic shipping on three or more pBooks