Machine Learning with R, the tidyverse, and mlr
Hefin I. Rhys
  • March 2020
  • ISBN 9781617296574
  • 536 pages
  • printed in black & white

Easy language, clear explanations, good examples...I love this book!

Mario Giesel, Mediaplus
Machine learning (ML) is a collection of programming techniques for discovering relationships in data. With ML algorithms, you can cluster and classify data for tasks like making recommendations or fraud detection and make predictions for sales trends, risk analysis, and other forecasts. Once the domain of academic data scientists, machine learning has become a mainstream business process, and tools like the easy-to-learn R programming language put high-quality data analysis in the hands of any programmer. Machine Learning with R, the tidyverse, and mlr teaches you widely used ML techniques and how to apply them to your own datasets using the R programming language and its powerful ecosystem of tools. This book will get you started!

About the book

Machine Learning with R, the tidyverse, and mlr gets you started in machine learning using R Studio and the awesome mlr machine learning package. This practical guide simplifies theory and avoids needlessly complicated statistics or math. All core ML techniques are clearly explained through graphics and easy-to-grasp examples. In each engaging chapter, you’ll put a new algorithm into action to solve a quirky predictive analysis problem, including Titanic survival odds, spam email filtering, and poisoned wine investigation.
Table of Contents detailed table of contents

Part 1: Introduction

1 Introduction to machine learning

1.1 What is machine learning?

1.1.1 Artificial intelligence and machine learning

1.1.2 The difference between a model and an algorithm

1.2 Classes of machine learning algorithms

1.2.1 Differences between supervised, unsupervised, and semi-supervised learning

1.2.2 Classification, regression, dimension reduction, and clustering

1.2.3 A brief word on deep learning

1.3 Why use R for machine learning?

1.4 Which datasets will we use?

1.5 What will you learn in this book


2 Tidying, manipulating, and plotting data with the tidyverse

2.1 What is the tidyverse, and what is tidy data?

2.2 Loading the tidyverse

2.3 What the tibble package is and what it does

2.3.1 Creating tibbles

2.3.2 Converting existing data frames into tibbles

2.3.3 Differences between data frames and tibbles

2.4 What the dplyr package is and what it does

2.4.1 Manipulating the CO2 dataset with dplyr

2.4.2 Chaining dplyr functions together

2.5 What the ggplot2 package is and what it does

2.6 What the tidyr package is and what it does

2.7 What the purrr package is and what it does

2.7.1 Replacing for loops with map()

2.7.2 Returning an atomic vector instead of a list

2.7.3 Using anonymous functions inside the map() family

2.7.4 Using walk() to produce a function’s side effects

2.7.5 Iterating over multiple lists simultaneously


Solutions to exercises

Part 2: Classification

3 Classifying based on similarities with k-nearest neighbors

3.1 What is the k-nearest neighbors algorithm?

3.1.1 How does the k-nearest neighbors algorithm learn?

3.1.2 What happens if the vote is tied?

3.2 Building your first kNN model

3.2.1 Loading and exploring the diabetes dataset

3.2.2 Using mlr to train your first kNN model

3.2.3 Telling mlr what we’re trying to achieve: defining the task

3.2.4 Telling mlr which algorithm to use: defining the learner

3.2.5 Putting it all together: training the model

3.3 Balancing two sources of model error: The bias-variance trade-off

3.4 Using cross-validation to tell if we’re overfitting or underfitting

3.5 Cross-validating our kNN model

3.5.1 Holdout cross-validation

3.5.2 K-fold cross-validation

3.5.3 Leave-one-out cross-validation

3.6 What algorithms can learn, and what they must be told: Parameters and hyperparameters

3.7 Tuning k to improve the model

3.7.1 Including hyperparameter tuning in our cross-validation

3.7.2 Using our model to make predictions

3.8 Strengths and weaknesses of kNN


Solutions to exercises

4 Classifying based on odds with logistic regression

4.1 What is logistic regression?

4.1.1 How does logistic regression learn?

4.1.2 What if I have more than two classes?

4.2 Building your first logistic regression model

4.2.1 Loading and exploring the Titanic dataset

4.2.2 Making the most of the data: feature engineering and feature selection

4.2.3 Plotting the data

4.2.4 Training the model

4.2.5 Dealing with missing data

4.2.6 Training the model (take two)

4.3 Cross-validating the logistic regression model

4.3.1 Including missing value imputation in our cross-validation

4.3.2 Accuracy is the most important performance metric, right?

4.4 Interpreting the model: The odds ratio

4.4.1 Converting model parameters into odds ratios

4.4.2 When a one-unit increase doesn’t make sense

4.5 Using our model to make predictions

4.6 Strengths and weaknesses of logistic regression


Solutions to exercises

5 Classifying by maximizing separation with discriminant analysis

5.1 What is discriminant analysis?

5.1.1 How does discriminant analysis learn?

5.1.2 What if I have more than two classes?

5.1.3 Learning curves instead of straight lines: QDA

5.1.4 How do LDA and QDA make predictions?

5.2 Building your first linear and quadratic discriminant models

5.2.1 Loading and exploring the wine dataset

5.2.2 Plotting the data

5.2.3 Training the models

5.3 Strengths and weaknesses of LDA and QDA


Solutions to exercises

6 Classifying with naive Bayes and support vector machines

6.1 What is the naive Bayes algorithm?

6.1.1 Using naive Bayes for classification

6.1.2 How is the likelihood calculated for categorical and continuous predictors?

6.2 Building your first naive Bayes model

6.2.1 Loading and exploring the HouseVotes84 dataset

6.2.2 Plotting the data

6.2.3 Training the model

6.3 Strengths and weaknesses of naive Bayes

6.4 What is the support vector machine (SVM) algorithm?

6.4.1 SVMs for linearly-separable data

6.4.2 What if the classes aren’t fully separable?

6.4.3 SVMs for non-linearly separable data

6.4.4 Hyperparameters of the SVM algorithm

6.4.5 What if I have more than two classes?

6.5 Building your first SVM model

6.5.1 Loading and exploring the spam dataset

6.5.2 Tuning our hyperparameters

6.5.3 Training the model with the tuned hyperparameters

6.6 Cross-validating our SVM model

6.7 Strengths and weaknesses of the SVM algorithm


Solutions to exercises

7 Classifying with decision trees

7.1 What is the recursive partitioning algorithm?

7.1.1 Using Gini gain to split the tree

7.1.2 What about continuous, and multi-level categorical predictors?

7.1.3 Hyperparameters of the rpart algorithm

7.2 Building your first decision tree model

7.3 Loading and exploring the zoo dataset

7.4 Training the decision tree model

7.4.1 Training the model with the tuned hyperparameters

7.5 Cross-validating our decision tree model

Strengths and weaknesses of tree-based algorithms


8 Improving decision trees with random forests and boosting

8.1 Ensemble techniques: Bagging, boosting, and stacking

8.1.1 Training models on sampled data: bootstrap aggregating

8.1.2 Learning from the previous models' mistakes: boosting

8.1.3 Learning from predictions made by other models: stacking

8.2 Building your first random forest model

8.3 Building your first XGBoost model

8.4 Strengths and weaknesses of tree-based algorithms

8.5 Benchmarking algorithms against each other


Part 3: Regression

9 Linear regression

9.1 What is linear regression?

9.1.1 What if we have multiple predictors?

9.1.2 What if my predictors are categorical?

9.2 Building your first linear regression model

9.2.1 Loading and exploring the Ozone dataset

9.2.2 Imputing missing values

9.2.3 Automating feature selection

9.2.4 Including imputation and feature selection in our cross-validation

9.2.5 Interpreting the model

9.3 Strengths and weaknesses of linear regression


Solutions to exercises

10 Nonlinear regression with generalized additive models

10.1 Making linear regression nonlinear with polynomial terms

10.2 More flexibility: Splines and generalized additive models

10.2.1 How do GAMs learn their smoothing functions

10.2.2 How do GAMs handle categorical variables?

10.3 Building your first GAM

10.4 Strengths and weaknesses of GAMs


Solutions to exercises

11 Preventing overfitting with ridge regression, LASSO, and elastic net

11.1 What is regularization?

11.2 What is ridge regression?

11.3 What is the L2 norm, and how does ridge regression use it?

11.4 What is the L1 norm, and how does LASSO use it?

11.5 What is elastic net?

11.6 Building your first ridge, LASSO, and elastic net models

11.6.1 Loading and exploring the Iowa dataset

11.6.2 Training the ridge regression model

11.6.3 Training the LASSO model

11.6.4 Training the elastic net model

11.7 Benchmarking ridge, LASSO, elastic net, and OLS against each other

11.8 Strengths and weaknesses of ridge, LASSO, and elastic net


Solutions to exercises

12 Regression with kNN, random forest, and XGBoost

12.1 Using k-nearest neighbors to predict a continuous variable

12.2 Using tree-based learners to predict a continuous variable

12.3 Building your first kNN regression model

12.3.1 Loading and exploring the fuel dataset

12.3.2 Tuning the k hyperparameter

12.4 Building your first random forest regression model

12.5 Building your first XGBoost regression model

12.6 Benchmarking the kNN, random forest, and XGBoost model-building processes

12.7 Strengths and weaknesses of kNN, random forest, and XGBoost


Solutions to exercises

Part 4: Dimension reduction

13 Maximizing variance with principal component analysis

13.1 Why dimension reduction?

13.1.1 Visualizing high-dimensional data

13.1.2 Consequences of the curse of dimensionality

13.1.3 Consequences of colinearity

13.1.4 Dimension reduction mitigates the curse of dimensionality and colinearity

13.2 What is principal component analysis?

13.3 Building your first PCA model

13.3.1 Loading and exploring the banknote dataset

13.3.2 Performing PCA

13.3.3 Plotting the result of our PCA

13.3.4 Computing the component scores of new data

13.4 Strengths and weaknesses of PCA


Solutions to exercises

14 Maximizing similarity with t-SNE and UMAP

14.1 What is t-SNE?

14.2 Building your first t-SNE embedding

14.2.1 Performing t-SNE

14.2.2 Plotting the result of t-SNE

14.3 What is UMAP?

14.4 Building your first UMAP model

14.4.1 Performing UMAP

14.4.2 Plotting the result of UMAP

14.4.3 Computing the UMAP embeddings of new data

14.5 Strengths and weaknesses of t-SNE and UMAP


Solutions to exercises

15 Self-organizing maps and locally linear embedding

15.1 Prerequisites: Grids of nodes and manifolds

15.1.1 Creating the grid of nodes

15.1.2 Randomly assigning weights, and placing cases in nodes

15.1.3 Updating the weights of the nodes to better match the cases inside them

15.2 What are self-organizing maps?

15.2.1 Loading and exploring the flea dataset

15.2.2 Training the SOM

15.2.3 Plotting the SOM result

15.2.4 Mapping new data onto the SOM

15.3 Building your first SOM

15.4 What is locally linear embedding?

15.4.1 Loading and exploring the s curve dataset

15.4.2 Training the LLE

15.4.3 Plotting the LLE result

15.5 Building your first LLE

15.6 Building an LLE of our flea data

15.7 Strengths and weaknesses of SOMs and LLE


Solutions to exercises

Part 5: Clustering

16 Clustering by finding centers with k-means

16.1 What is k-means clustering?

16.1.1 How does Lloyd’s algorithm work?

16.1.2 How does MacQueen’s algorithm work?

16.1.3 Hartigan-Wong algorithm

16.2 Building your first k-means model

16.2.1 Loading and exploring the GvHD dataset

16.2.2 Defining our task and learner

16.2.3 How do I choose the number of clusters?

16.2.4 Tuning k and the algorithm choice for our k-means model

16.2.5 Training the final, tuned k-means model

16.2.6 Using our model to predict clusters of new data

16.3 Strengths and weaknesses of k-means clustering


Solutions to exercises

17 Hierarchical clustering

17.1 What is hierarchical clustering?

17.1.1 Agglomerative hierarchical clustering

17.1.2 Divisive hierarchical clustering

17.2 Building your first agglomerative hierarchical clustering model

17.2.1 Choosing the number of clusters

17.2.2 Cutting the tree to select a flat set of clusters

17.3 How stable are our clusters?

17.4 Strengths and weaknesses of hierarchical clustering


Solutions to exercises

18 Clustering based on density: DBSCAN and OPTICS

18.1 What is density-based clustering?

18.1.1 How does the DBSCAN algorithm learn?

18.1.2 How does the OPTICS algorithm learn?

18.2 Building your first DBSCAN model

18.2.1 Loading and exploring the banknote dataset

18.2.2 Tuning the epsilon and minPts hyperparameters

18.3 Building your first OPTICS model

18.4 Strengths and weaknesses of density-based clustering


Solutions to exercises

19 Clustering based on distributions with mixture modeling

19.1 What is mixture model clustering?

19.1.1 Calculating probabilities with the EM algorithm

19.1.2 EM algorithm expectation and maximization steps

19.1.3 What if we have more than one variable?

19.2 Building your first Gaussian mixture model for clustering

19.3 Strengths and weaknesses of mixture model clustering


Solutions to exercises

20 Final notes and further reading

20.1 A brief recap of machine learning concepts

20.1.1 Supervised, unsupervised, and semi-supervised learning

20.1.2 Balancing the bias-variance trade-off for model performance

20.1.3 Using model validation to identify over-/underfitting

20.1.4 Maximizing model performance with hyperparameter tuning

20.1.5 Using missing value imputation to deal with missing data

20.1.6 Feature engineering and feature selection

20.1.7 Improving model performance with ensemble techniques

20.1.8 Preventing overfitting with regularization

20.2 Where can you go from here?

20.2.1 Deep learning

20.2.2 Reinforcement learning

20.2.3 General R data science and the tidyverse

20.2.4 mlr tutorial and creating new learners/metrics

20.2.5 Generalized additive models

20.2.6 Ensemble methods

20.2.7 Support vector machines

20.2.8 Anomaly detection

20.2.9 Time series

20.2.10 Clustering

20.2.11 Generalized linear models

20.2.12 Semi-supervised learning

20.2.13 Modeling spectral data

20.3 The last word


Appendix A: A refresher on statistical concepts

A.1 Data vocabulary

A.1.1 Sample vs. population

A.1.2 Rows and columns

A.1.3 Variable types

A.2 Vectors

A.3 Distributions

A.4 Sigma notation

A.5 Central tendency

A.5.1 Arithmetic mean

A.5.2 Median

A.5.3 Mode

A.6 Measures of dispersion

A.6.1 Mean absolute deviation

A.6.2 Standard deviation

A.6.3 Variance

A.6.4 Interquartile range

A.7 Measures of the relationships between variables

A.7.1 Covariance

A.7.2 Pearson correlation coefficient

A.8 Logarithms

What's inside

  • Using the tidyverse packages to process and plot your data
  • Techniques for supervised and unsupervised learning
  • Classification, regression, dimension reduction, and clustering algorithms
  • Statistics primer to fill gaps in your knowledge

About the reader

For newcomers to machine learning with basic skills in R.

About the author

Hefin I. Rhys is a senior laboratory research scientist at the Francis Crick Institute. He runs his own YouTube channel of screencast tutorials for R and RStudio.

placing your order...

Don't refresh or navigate away from the page.
print book $34.99 $49.99 pBook + eBook + liveBook
Additional shipping charges may apply
Machine Learning with R, the tidyverse, and mlr (print book) added to cart
continue shopping
go to cart

eBook $27.99 $39.99 3 formats + liveBook
Machine Learning with R, the tidyverse, and mlr (eBook) added to cart
continue shopping
go to cart

Prices displayed in rupees will be charged in USD when you check out.

FREE domestic shipping on three or more pBooks