Machine Learning with R, tidyverse, and mlr
Hefin I. Rhys
  • MEAP began March 2019
  • Publication in Spring 2020 (estimated)
  • ISBN 9781617296574
  • 300 pages (estimated)
  • printed in black & white

A great combination of statistics and code.

Ron Lease
Machine learning is a collection of programming techniques for discovering relationships in data. With ML algorithms, you can cluster and classify data for tasks like making recommendations or fraud detection and make predictions for sales trends, risk analysis, and other forecasts. Once the domain of academic data scientists, machine learning has become a mainstream business process, and tools like the easy-to-learn R programming language put high-quality data analysis in the hands of any programmer. Machine Learning with R, tidyverse, and mlr teaches you widely used ML techniques and how to apply them to your own datasets using the R programming language and its powerful ecosystem of tools. This book will get you started!
Table of Contents detailed table of contents

Part 1: Introduction

1 Introduction

1.1 What is machine learning?

1.1.1 Artificial intelligence and machine learning

1.1.2 The difference between a model and an algorithm

1.2 Classes of machine learning algorithms

1.2.1 Differences between supervised, unsupervised, and semi-supervised learning

1.2.2 Classification, regression, dimension reduction, and clustering

1.2.3 A brief word on deep learning

1.2.4 Why use R for machine learning?

1.3 Which datasets will we use?

1.4 What will you learn in this book

1.5 Summary

2 Tidying, manipulating and plotting data with the tidyverse

2.1 What is the tidyverse and what is tidy data?

2.2 Loading the tidyverse

2.3 What the tibble package is and what it does

2.3.1 Creating tibbles

2.3.2 Converting existing data frames into tibbles

2.3.3 Differences between data frames and tibbles

2.4 What the dplyr package is and what it does

2.4.1 Manipulating the CO2 dataset with dplyr

2.4.2 Chaining dplyr functions together

2.5 What the ggplot2 package is and what it does

2.6 What the tidyr package is and what it does

2.7 What the purrr package is and what it does

2.7.1 Replacing for loops with map()

2.7.2 Returning an atomic vector instead of a list

2.7.3 Using anonymous functions inside the map() family

2.7.4 Using walk() to produce a function’s side effects

2.7.5 Iterating over multiple lists simultaneously

2.8 Summary

2.9 Solutions to exercises

Part 2: Classification

3 Classifying based on similar observations: the k-Nearest neighbors algorithm

3.1 What is the k-nearest neighbors algorithm?

3.1.1 How does the k-nearest neighbors algorithm learn?

3.1.2 What happens if the vote is tied?

3.2 Building our first k-NN model

3.2.1 Loading and exploring the diabetes dataset

3.2.2 Using mlr to train your first k-NN model

3.2.3 Telling mlr what we’re trying to achieve: defining the task

3.2.4 Telling mlr which algorithm to use: defining the learner

3.2.5 Putting it all together: training the model

3.3 Balancing two sources of model error: the bias-variance trade-off

3.4 How to tell if you’re over/underfitting: cross-validation

3.5 Cross validating our k-NN model

3.5.1 Hold-out cross-validation

3.5.2 k-fold cross-validation

3.5.3 Leave-one-out cross-validation

3.6 What algorithms can learn and what they must be told: parameters and hyperparameters

3.7 Tuning k to improve our model

3.7.1 Including hyperparameter tuning in our cross-validation

3.7.2 Using our model to make predictions

3.8 strengths and weaknesses of k-NN

3.9 Summary

3.10 Solutions to exercises

4 Classifying based on odds: logistic regression

4.1 What is logistic regression?

4.1.1 How does logistic regression learn?

4.1.2 What if I have more than two classes?

4.2 Building our first logistic regression model

4.2.1 Loading and exploring the Titanic dataset

4.2.2 Making the most of the data: feature engineering and feature selection

4.2.3 Plotting the data

4.2.4 Training the model

4.2.5 Dealing with missing data

4.2.6 Training the model (take two)

4.3 Cross-validating our logistic regression model

4.3.1 Including missing value imputation in our cross-validation

4.3.2 Accuracy is the most important performance metric, right?

4.4 Interpreting the model: the odds ratio

4.4.1 Converting model parameters into odds ratios

4.4.2 When a one unit increase doesn’t make sense

4.5 Using our model to make predictions

4.6 Strengths and weaknesses of logistic regression

4.7 Summary

4.8 Solutions to exercises

5 Classifying by maximizing class separation: discriminant analysis

5.1 What is discriminant analysis?

5.1.1 How does discriminant analysis learn?

5.1.2 What if I have more than two classes?

5.1.3 Learning curves instead of straight lines: QDA

5.1.4 How do LDA and QDA make predictions?

5.2 Building our first linear and quadratic discriminant models

5.2.1 Loading and exploring the wine dataset

5.2.2 Plotting the data

5.2.3 Training the models

5.3 Strengths and weaknesses of LDA and QDA

5.4 Summary

5.5 Solutions to exercises

6 Classifying based on probabilities and hyperplanes: naive Bayes and support vector machines

6.1 What is the naive Bayes algorithm?

6.1.1 Using naive Bayes for classification

6.1.2 How is the likelihood calculated for categorical and continuous predictors?

6.2 Building our first naive Bayes model

6.2.1 Loading and exploring the HouseVotes84 dataset

6.2.2 Plotting the data

6.2.3 Training the model

6.3 Strengths and weaknesses of naive Bayes

6.4 What is the support vector machine (SVM) algorithm?

6.4.1 SVMs for linearly-separable data

6.4.2 SVMs for non-linearly separable data

6.4.3 Hyperparameters of the SVM algorithm

6.4.4 What if I have more than two classes?

6.5 Building our first SVM model

6.5.1 Loading and exploring the spam dataset

6.5.2 Tuning our hyperparameters

6.5.3 Training the model with the tuned hyperparameters

6.6 Cross-validating our SVM model

6.7 Strengths and weaknesses of the SVM algorithm

6.8 Summary

6.9 Solutions to exercises

7 Classifying with trees: Decision trees, random forests and gradient boosting

7.1 What is the recursive partitioning algorithm?

7.1.1 Using Gini gain to split the tree

7.1.2 What about continuous, and multi-level categorical predictors?

7.1.3 Hyperparameters of the rpart algorithm

7.2 Building our first decision tree model

7.3 Loading and exploring the zoo dataset

7.4 Training the decision tree model

7.4.1 Training the model with the tuned hyperparameters

7.5 Cross-validating our decision tree model

7.6 Ensemble techniques: bagging, boosting, and stacking

7.6.1 Training models on sampled data: bootstrap aggregating

7.6.2 Learning from the previous models' mistakes: boosting

7.6.3 Learning from predictions made by other models: stacking

7.7 Building our first random forest model

7.8 Building our first XGBoost model

7.9 Strengths and weaknesses of tree-based algorithms

7.10 Benchmarking algorithms against each other

7.11 Summary

Part 3: Regression

8 Regression with lines: linear regression and generalized additive models

8.1 What is linear regression?

8.1.1 What if we have multiple predictors?

8.1.2 What if my predictors are categorical?

8.2 When the relationship isn’t linear: polynomial terms

8.3 When we need even more flexibility: splines and generalized additive models

8.4 Building our first linear regression model

8.4.1 Loading and exploring the Ozone dataset

8.4.2 Imputing missing values

8.4.3 Automating feature selection

8.4.3 Including imputation and feature selection in our cross-validation

8.4.4 Interpreting the model

8.5 Building our first GAM

8.6 Strengths and weaknesses of linear regression and GAMs

8.7 Summary

8.8 Solutions to exercises

9 Preventing overfitting: ridge regression, LASSO, and elastic net

9.1 What is regularization?

9.2 What is ridge regression?

9.3 What is the L2 norm and how does ridge regression use it?

9.4 What is the L1 norm and how does LASSO use it?

9.5 What is elastic net?

9.6 Building our first ridge, LASSO, and elastic net models

9.6.1 Loading and exploring the Iowa dataset

9.6.2 Training the ridge regression model

9.6.3 Training the LASSO model

9.6.4 Training the elastic net model

9.7 Benchmarking ridge, LASSO, elastic net and OLS against each other

9.8 Strengths and weaknesses of ridge, LASSO and elastic net

9.9 Summary

9.10 Solutions to exercises

10 Regression with distance and trees: k-nearest neighbors, random forest and XGBoost

10.1 Using k-nearest neighbors to predict a continuous variable

10.2 Using tree-based learners predict a continuous variable

10.3 Building our first k-NN regression model

10.3.1 Loading and exploring the fuel dataset

10.3.2 Tuning the k hyperparameter

10.4 Building our first random forest regression model

10.5 Building our first XGBoost regression model

10.6 Benchmarking the k-NN, random forest, and XGBoost model-building processes

10.7 Strengths and weaknesses of k-NN, random forest, and XGBoost

10.8 Summary

10.9 Solutions to exercises

Part 4: Dimension reduction

11 Maximizing variance and similarity: principal component analysis, t-SNE, and UMAP

11.1 Why dimension reduction: visualization, curse of dimensionality and colinearity

11.1.1 Visualizing a dataset

11.1.2 Curse of dimensionality

11.1.3 Colinearity

11.1.4 Dimension reduction mitigates the curse of dimensionality and colinearity

11.2 What is principal component analysis?

11.3 Building our first principal components analysis model

11.3.1 Loading and exploring the banknote dataset

11.3.2 Performing PCA

11.3.3 Plotting the result of our PCA

11.3.4 Computing the component scores of new data

11.4 What is t-SNE?

11.5 Building our first t-SNE embedding

11.5.1 Performing t-SNE

11.5.2 Plotting the result of t-SNE

11.6 What is UMAP?

11.7 Building our first UMAP model

11.7.1 Performing UMAP

11.7.2 Plotting the result of UMAP

11.7.3 Computing the UMAP embeddings of new data

11.8 Strengths and weaknesses of PCA, t-SNE, and UMAP

11.9 Summary

11.10 Solutions to exercises

12 Dimension reduction with networks and local structure: Self-organizing maps and locally-linear embedding

12.1 What are self-organizing maps?

12.2 Building our first SOM map

12.2.1 Loading and exploring the flea dataset

12.2.2 Training the SOM

12.2.3 Plotting the SOM result

12.2.4 Mapping new data onto the SOM

12.3 What is locally-linear embedding?

12.4 Building our first LLE

12.4.1 Loading and exploring the s curve dataset

12.4.2 Training the LLE

12.4.3 Plotting the LLE result

12.5 Building an LLE of our flea data

12.5 Strengths and weaknesses of SOMs and LLE

12.6 Summary

12.7 Solutions to exercises

Part 5: Clustering

13 Clustering by finding centers and hierarchies: k-means and hierarchical clustering

14 Clustering based on the distribution of data: Density and mixture model clustering

15 Final notes and further reading


Appendix A: A refresher on statistical concepts

About the Technology

Machine learning techniques accurately and efficiently identify patterns and relationships in data and use those models to make predictions about new data. ML techniques can work on even relatively small datasets, making these skills a powerful ally for nearly any data analysis task. The R programming language was designed with mathematical and statistical applications in mind. Small datasets are its sweet spot, and its modern data science tools, including the popular tidyverse package, make R a natural choice for ML.

About the book

Machine Learning with R, tidyverse, and mlr teaches you how to gain valuable insights from your data using the powerful R programming language. In his engaging and informal style, author and R expert Hefin Ioan Rhys lays a firm foundation of ML basics and introduces you to the tidyverse, a powerful set of R tools designed specifically for practical data science. Armed with the fundamentals, you’ll delve deeper into commonly used machine learning techniques including classification, prediction, reduction, and clustering algorithms, applying each to real data to make predictions on fun and interesting problems.

Using the tidyverse packages, you’ll transform, clean, and plot your data, onboarding data science best practices as you go. To simplify your learning process, you’ll also use R’s mlr package, an incredibly flexible interface for a variety of core algorithms that allows you to perform complicated ML tasks with minimal coding. You’ll explore essential concepts like overfitting, underfitting, validating model performance, and how to choose the best model for your task. Illuminating visuals provide clear explanations, cementing your new knowledge.

Whether you’re tackling business problems, crunching research data, or just a data-minded developer, you’ll be building your own ML pipelines in no time with this hands-on tutorial!

What's inside

  • Commonly used ML techniques
  • Using the tidyverse packages to organize and plot your data
  • Validating model performance
  • Choosing the best ML model for your task
  • A variety of hands-on coding exercises
  • ML best practices

About the reader

For readers with basic programming skills in R, Python, or another standard programming language.

About the author

Hefin Ioan Rhys is a senior laboratory research scientist in the Flow Cytometry Shared Technology Platform at The Francis Crick Institute. He spent the final year of his PhD program teaching basic R skills at the university. A data science and machine learning enthusiast, he has his own Youtube channel featuring screencast tutorials in R and R Studio.

Manning Early Access Program (MEAP) Read chapters as they are written, get the finished eBook as soon as it’s ready, and receive the pBook long before it's in bookstores.
MEAP combo $25.00 $49.99 pBook + eBook + liveBook
MEAP eBook $39.99 pdf + ePub + kindle + liveBook
Prices displayed in rupees will be charged in USD when you check out.

placing your order...

Don't refresh or navigate away from the page.

FREE domestic shipping on three or more pBooks