Real-World Machine Learning
Henrik Brink, Joseph W. Richards and Mark Fetherolf
  • MEAP began December 2013
  • Publication in Early 2016 (estimated)
  • ISBN 9781617291920
  • 400 pages (estimated)
  • printed in black & white

Real-World Machine Learning is a practical guide designed to teach working developers the art of ML project execution. Without overdosing you on academic theory and complex mathematics, it introduces the day-to-day practice of machine learning, preparing you to successfully build and deploy powerful ML systems. Using the Python language and the R statistical package, you'll start with core concepts like data acquisition and modeling, classification, and regression. You'll then move through the most important ML tasks, like model validation, optimization and feature engineering. By following numerous real-world examples, you'll learn how to anticipate and overcome common pitfalls. Along the way, you'll discover scalable and online algorithms for large and streaming data sets. Advanced readers will appreciate the in-depth discussion of enhanced ML systems through advanced data exploration and pre-processing methods.

Table of Contents show full

1. What is machine learning?

1.1. How Machines Learn

1.2. Using Data to Make Decisions

1.2.1. Traditional Approaches

1.2.2. The Machine Learning Approach

1.2.3. Five Advantages to Machine Learning

1.2.4. Challenges

1.3. The ML workflow; from Data to Deployment

1.3.1. Learning a Model from Data

1.3.2. Evaluating Model Performance

1.3.3. Optmizing Model Performance

1.4. Boosting Model Performance with Advanced Techniques

1.4.1. Data Pre-processing and Feature Engineering

1.4.2. Improving models continuously with online methods

1.4.3. Scaling Models with Data Volume and Velocity

1.5. Summary

1.6. Terms from this chapter

2. Real World Data

2.1. Getting Started: Data Collection

2.1.1. Which featurs should be included?

2.1.2. How much training data is required?

2.1.3. Is the training set representative enough?

2.2. Pre-processing the data for modeling

2.2.1. Categorical features

2.2.2. Dealing with missing data

2.2.3. Simple feature engineering

2.2.4. Data normalization

2.3. Data visualization

2.3.1. Mosaic Plots

2.3.2. Boxplots

2.3.3. Density Plots

2.3.4. Scatterplots

2.4. Summary

3. Modeling and Prediction

3.1. Basic machine learning modeling

3.1.1. Finding the relationship between input and target

3.1.2. The purpose of finding a good model

3.1.3. Types of modeling methods

3.1.4. Supervised versus unsupervised learning

3.2. Classification: predicting into buckets

3.2.1. Building a classifier and making predictions

3.2.2. Classification on complex, nonlinear data

3.2.3. Classification with multiple classes

3.3. Regression: prediction of numberical values

3.3.1. Building a regressor and making predictions

3.3.2. Regression on complex, nonlinear data

3.4. Terms from this chapter

3.5. Summary

4. Model Evaluation and Optimization

4.1. Model generalization: assessing predictive accuracy for new data

4.1.1. The Problem: Over-fitting and Model Optimism

4.1.2. The Solution: Cross-validation

4.1.3. Some things to look out for when using cross-validation

4.2. Evaluation of classification models

4.2.1. Class-wise accuracy and the conclusion matrix

4.2.2. Accuracy trade-offs and ROC curves

4.2.3. Multi-class classification

4.3. Evaluation of regression models

4.3.1. Simple regression performance metrics

4.3.2. Examining residuals

4.4. Model Optimization through Parameter Tuning

4.4.1. ML Algorithms and their Tuning Parameters

4.5. Summary

5. Basic Feature Engineering

5.1. Motivation: Why is Feature Engineering Useful?

5.1.1. What is feature engineering?

5.1.2. Five reasons to use feature engineering

5.2. Basic feature engineering processes

5.2.1. Example: event recommendation

5.2.2. Handling date and time features

5.2.3. Simple text features

5.3. Feature selection

5.3.1. Some algorithms have built-in feature selection

5.3.2. Forwards selection and backwards elimination

5.3.3. Feature selection for data exploration

5.3.4. Real-World feature selection example

5.4. Summary

6. Example: NYC Taxi Data

6.1. Data: NYC taxi trip and fare information

6.1.1. Visualizing the data

6.1.2. Defining the problem and preparing the data

6.2. Modeling

6.2.1. Basic linear model

6.2.2. Non-linear classifier

6.2.3. Including categorical features

6.2.4. Including date-time features

6.2.5. Model insights

6.3. Summary

7. Advanced feature engineering

7.1. Advanced text features

7.1.1. Bag of words

7.1.2. Topic modeling

7.1.3. Content expansion

7.2. Image features

7.2.1. Simple image features

7.2.2. Extracting objects and shapes

7.3. Time-series features

7.3.1. Classical Time-Series Features

7.3.2. Feature Engineering for Event Streams

7.4. Summary

8. Scaling with Size and Speed

8.1. Linear scalability

8.1.1. MapReduce

8.1.2. Scalable linear methods

8.2. Parallelization of nonlinear algorithms

8.2.1. Complexity

8.2.2. Memory-bound algorithms

8.2.3. Scalable nonlinear methods

8.3. Improving training speed

8.4. Increasing prediction bandwidth

9. Scaling Machine Learning Workflows

9.1. Before scaling up

9.1.1. Identifying important dimensions

9.1.2. Sub-sampling training data in lieu of scaling?

9.1.3. Scalable data management systems

9.2. Scaling ML modeling pipelines

9.2.1. Scaling learning algorithms

9.3. Scaling predictions

9.3.1. Scaling prediction volume

9.3.2. Scaling prediction velocity

9.4. Summary

10. The Future of Machine Learning

10.1. The Internet of Things

10.1.1. Smart meters

10.1.2. Personal monitoring

10.2. Web-scale machine learning

10.2.1. Deep neural nets

10.3. Synthesizing the brain

Appendix A: Popular Machine Learning Algorithms

About the Technology

In a world where big data is the norm and near-real-time decisions are crucial, machine learning is a critical component of the data workflow. Machine learning systems can quickly crunch massive amounts of information to offer insight and make decisions in a way that matches or even surpasses human cognitive abilities. These systems use sophisticated computational and statistical tools to build models that can recognize and visualize patterns, predict outcomes, forecast values, and make recommendations. Gartner predicts that big data analytics will be a $25 billion market by 2017, and financial firms, marketing organizations, scientific facilities, and Silicon Valley startups are all demanding machine learning skills from their developers.

What's inside

  • Learn to build and maintain your own ML system
  • Explore real-world machine-learning problems
  • Detailed treatment of many example real-world use-cases
  • Understand the ML workflow, practical considerations and common pitfalls
  • Python and R code snippets to get you started
  • Advanced material: feature engineering, computational scalability, and real-time streaming ML
  • Beautiful visuals throughout

About the reader

Code examples are in Python and R. No prior machine learning experience required.

About the authors

Henrik Brink is a data scientist and software developer with extensive ML experience in industry and academia. Joseph Richards is a senior data scientist with expertise in applied statistics and predictive analytics. Henrik and Joseph are co-founders of wise.io, a leading developer of machine learning solutions for industry. Mark Fetherolf is founder and President of data management and predictive analytics company, Numinary Data Science. He has worked as a statistician and analytics database developer in social science research, chemical engineering, information systems performance, capacity planning, cable television, and online advertising applications.


Manning Early Access Program (MEAP) Read chapters as they are written, get the finished eBook as soon as it’s ready, and receive the pBook long before it's in bookstores.
MEAP combo $49.99 pBook + eBook
MEAP eBook $39.99 pdf + ePub + kindle

FREE domestic shipping on three or more pBooks