Machine Learning with R, tidyverse, and mlr
Hefin I. Rhys
  • MEAP began March 2019
  • Publication in Early 2020 (estimated)
  • ISBN 9781617296574
  • 300 pages (estimated)
  • printed in black & white
Machine learning is a collection of programming techniques for discovering relationships in data. With ML algorithms, you can cluster and classify data for tasks like making recommendations or fraud detection and make predictions for sales trends, risk analysis, and other forecasts. Once the domain of academic data scientists, machine learning has become a mainstream business process, and tools like the easy-to-learn R programming language put high-quality data analysis in the hands of any programmer. Machine Learning with R, tidyverse, and mlr teaches you widely used ML techniques and how to apply them to your own datasets using the R programming language and its powerful ecosystem of tools. This book will get you started!
Table of Contents detailed table of contents

Part 1: Introduction

1 Introduction to machine learning

1.1 What is machine learning?

1.1.1 Artificial intelligence and machine learning

1.1.2 The difference between a model and an algorithm

1.2 Classes of machine learning algorithms

1.2.1 Differences between supervised, unsupervised, and semi-supervised learning

1.2.2 Classification, regression, dimension reduction, and clustering

1.2.3 A brief word on deep learning

1.3 Why use R for machine learning?

1.4 Which datasets will we use?

1.5 What will you learn in this book

1.6 Summary

2 Tidying, manipulating and plotting data with the tidyverse

2.1 What is the tidyverse and what is tidy data?

2.2 Loading the tidyverse

2.3 What the tibble package is and what it does

2.3.1 Creating tibbles

2.3.2 Converting existing data frames into tibbles

2.3.3 Differences between data frames and tibbles

2.4 What the dplyr package is and what it does

2.4.1 Manipulating the CO2 dataset with dplyr

2.4.2 Chaining dplyr functions together

2.5 What the ggplot2 package is and what it does

2.6 What the tidyr package is and what it does

2.7 Summary

2.8 Solutions to exercises

Part 2: Classification

3 Classifying based on similar observations: the k-Nearest neighbours algorithm

3.1 What is the k-nearest neighbors algorithm?

3.1.1 How does the k-nearest neighbors algorithm learn?

3.1.2 What happens if the vote is tied?

3.2 Building our first k-NN model

3.2.1 Loading and exploring the diabetes dataset

3.2.2 Using mlr to train your first k-NN model

3.2.3 Telling mlr what we’re trying to achieve: defining the task

3.2.4 Telling mlr which algorithm to use: defining the learner

3.2.5 Putting it all together: training the model

3.3 Balancing two sources of model error: the bias-variance trade-off

3.4 How to tell if you’re over/underfitting: cross-validation

3.5 Cross validating our k-NN model

3.5.1 Hold-out cross-validation

3.5.2 k-fold cross-validation

3.5.3 Leave-one-out cross-validation

3.6 What algorithms can learn and what they must be told: parameters and hyperparameters

3.7 Tuning k to improve our model

3.7.1 Including hyperparameter tuning in our cross-validation

3.7.2 Using our model to make predictions

3.8 strengths and weaknesses of k-NN

3.9 Summary

3.10 Solutions to exercises

4 Classifying based on odds: logistic regression

4.1 What is logistic regression?

4.1.1 How does logistic regression learn?

4.1.2 What if I have more than two classes?

4.2 Building our first logistic regression model

4.2.1 Loading and exploring the Titanic dataset

4.2.2 Making the most of the data: feature engineering and feature selection

4.2.3 Plotting the data

4.2.4 Training the model

4.2.5 Dealing with missing data

4.2.6 Training the model (take two)

4.3 Cross-validating our logistic regression model

4.3.1 Including missing value imputation in our cross-validation

4.3.2 Accuracy is the most important performance metric, right?

4.4 Interpreting the model: the odds ratio

4.4.1 Converting model parameters into odds ratios

4.4.2 When a one unit increase doesn’t make sense

4.5 Using our model to make predictions

4.6 Strengths and weaknesses of logistic regression

4.7 Summary

4.8 Solutions to exercises

5 Classifying by maximizing class separation: discriminant analysis

6 Classifying based on probabilities and hyperplanes: naive Bayes and support vector machines

7 Classifying with trees: Decision trees, random forests and gradient boosting

Part 3: Regression

8 Regression with lines: Linear, polynomial and spline regression

9 Preventing overfitting in regression: Ridge regression, LASSO and elastic net

10 Regression with distance and trees: k-nearest neighbors, random forest and XGBoost

Part 4: Dimension reduction

11 Maximising variance and similarity: Principal components analysis and t-SNE

12 Dimension reduction with networks and local structure: Selforganizing maps and locally-linear embedding

Part 5: Clustering

13 Clustering by finding centers and hierarchies in data: k-means and hierarchical clustering

14 Clustering based on the distribution of data: Density and mixture model clustering

15 Final notes and further reading

About the Technology

Machine learning techniques accurately and efficiently identify patterns and relationships in data and use those models to make predictions about new data. ML techniques can work on even relatively small datasets, making these skills a powerful ally for nearly any data analysis task. The R programming language was designed with mathematical and statistical applications in mind. Small datasets are its sweet spot, and its modern data science tools, including the popular tidyverse package, make R a natural choice for ML.

About the book

Machine Learning with R, tidyverse, and mlr teaches you how to gain valuable insights from your data using the powerful R programming language. In his engaging and informal style, author and R expert Hefin Ioan Rhys lays a firm foundation of ML basics and introduces you to the tidyverse, a powerful set of R tools designed specifically for practical data science. Armed with the fundamentals, you’ll delve deeper into commonly used machine learning techniques including classification, prediction, reduction, and clustering algorithms, applying each to real data to make predictions on fun and interesting problems.

Using the tidyverse packages, you’ll transform, clean, and plot your data, onboarding data science best practices as you go. To simplify your learning process, you’ll also use R’s mlr package, an incredibly flexible interface for a variety of core algorithms that allows you to perform complicated ML tasks with minimal coding. You’ll explore essential concepts like overfitting, underfitting, validating model performance, and how to choose the best model for your task. Illuminating visuals provide clear explanations, cementing your new knowledge.

Whether you’re tackling business problems, crunching research data, or just a data-minded developer, you’ll be building your own ML pipelines in no time with this hands-on tutorial!

What's inside

  • Commonly used ML techniques
  • Using the tidyverse packages to organize and plot your data
  • Validating model performance
  • Choosing the best ML model for your task
  • A variety of hands-on coding exercises
  • ML best practices

About the reader

For readers with basic programming skills in R, Python, or another standard programming language.

About the author

Hefin Ioan Rhys is a senior laboratory research scientist in the Flow Cytometry Shared Technology Platform at The Francis Crick Institute. He spent the final year of his PhD program teaching basic R skills at the university. A data science and machine learning enthusiast, he has his own Youtube channel featuring screencast tutorials in R and R Studio.

Manning Early Access Program (MEAP) Read chapters as they are written, get the finished eBook as soon as it’s ready, and receive the pBook long before it's in bookstores.
MEAP combo $49.99 pBook + eBook + liveBook
MEAP eBook $39.99 pdf + ePub + kindle + liveBook

placing your order...

Don't refresh or navigate away from the page.

FREE domestic shipping on three or more pBooks