Real-World Machine Learning
Henrik Brink, Joseph W. Richards, and Mark Fetherolf
  • MEAP began December 2013
  • Publication in July 2016 (estimated)
  • ISBN 9781617291920
  • 400 pages (estimated)
  • printed in black & white

Real-World Machine Learning is a practical guide designed to teach working developers the art of ML project execution. Without overdosing you on academic theory and complex mathematics, it introduces the day-to-day practice of machine learning, preparing you to successfully build and deploy powerful ML systems. Using the Python language and the R statistical package, you'll start with core concepts like data acquisition and modeling, classification, and regression. You'll then move through the most important ML tasks, like model validation, optimization and feature engineering. By following numerous real-world examples, you'll learn how to anticipate and overcome common pitfalls. Along the way, you'll discover scalable and online algorithms for large and streaming data sets. Advanced readers will appreciate the in-depth discussion of enhanced ML systems through advanced data exploration and pre-processing methods.

Table of Contents detailed table of contents


1. What is machine learning?

1.1. How Machines Learn

1.2. Using Data to Make Decisions

1.2.1. Traditional Approaches

1.2.2. The Machine Learning Approach

1.2.3. Five Advantages to Machine Learning

1.2.4. Challenges

1.3. The ML workflow: from Data to Deployment

1.3.1. Learning a Model from Data

1.3.2. Evaluating Model Performance

1.3.3. Optimizing Model Performance

1.4. Boosting Model Performance with Advanced Techniques

1.4.1. Data Pre-processing and Feature Engineering

1.4.2. Improving models continuously with online methods

1.4.3. Scaling Models with Data Volume and Velocity

1.5. Summary

1.6. Terms from this chapter

2. Real World Data

2.1. Getting Started: Data Collection

2.1.1. Which features should be included?

2.1.2. Obtaining ground-truth for the target variable

2.1.3. How much training data is required?

2.1.4. Is the training set representative enough?

2.2. Pre-processing the data for modeling

2.2.1. Categorical features

2.2.2. Dealing with missing data

2.2.3. Simple feature engineering

2.2.4. Data normalization

2.3. Data visualization

2.3.1. Mosaic Plots

2.3.2. Boxplots

2.3.3. Density Plots

2.3.4. Scatterplots

2.4. Summary

2.5. Terms from this chapter

3. Modeling and Prediction

3.1. Basic machine learning modeling

3.1.1. Finding the relationship between input and target

3.1.2. The purpose of finding a good model

3.1.3. Types of modeling methods

3.1.4. Supervised versus unsupervised learning

3.2. Classification: predicting into buckets

3.2.1. Building a classifier and making predictions

3.2.2. Classification on complex, nonlinear data

3.2.3. Classification with multiple classes

3.3. Regression: prediction of numerical values

3.3.1. Building a regressor and making predictions

3.3.2. Regression on complex, nonlinear data

3.4. Summary

3.5. Terms from this chapter

4. Model Evaluation and Optimization

4.1. Model generalization: assessing predictive accuracy for new data

4.1.1. The Problem: Over-fitting and Model Optimism

4.1.2. The Solution: Cross-validation

4.1.3. Some things to look out for when using cross-validation

4.2. Evaluation of classification models

4.2.1. Class-wise accuracy and the confusion matrix

4.2.2. Accuracy trade-offs and ROC curves

4.3. Evaluation of regression models

4.3.1. Simple regression performance metrics

4.3.2. Examining residuals

4.4. Model Optimization through Parameter Tuning

4.4.1. ML Algorithms and their Tuning Parameters

4.5. Summary

5. Basic Feature Engineering

5.1. Motivation: Why is Feature Engineering Useful?

5.2. Basic feature engineering processes

5.2.1. Example: Event recommendation

5.2.2. Handling date and time features

5.2.3. Simple text features

5.3. Feature selection

5.3.1. Forwards selection and backwards elimination

5.3.2. Feature selection for data exploration

5.3.3. Real-world feature selection example

5.4. Summary


6. Example: NYC Taxi Data

6.1. Data: NYC taxi trip and fare information

6.1.1. Visualizing the data

6.1.2. Defining the problem and preparing the data

6.2. Modeling

6.2.1. Basic linear model

6.2.2. Nonlinear classifier

6.2.3. Including categorical features

6.2.4. Including date-time features

6.2.5. Model insights

6.3. Summary

7. Advanced feature engineering

7.1. Advanced text features

7.1.1. Bag of words

7.1.2. Topic modeling

7.1.3. Content expansion

7.2. Image features

7.2.1. Simple image features

7.2.2. Extracting objects and shapes

7.3. Time-series features

7.3.1. Classical Time-Series Features

7.3.2. Feature Engineering for Event Streams

7.4. Summary

8. Advanced NLP Example: Movie Review Sentiment

8.1. Exploring the data and use case

8.1.1. A first glance at the dataset

8.1.2. Inspecting the dataset

8.1.3. So, what’s the use case?

8.2. Extracting basic NLP features and building the initial model

8.2.1. Bag-of-words features

8.2.2. Building the model with the Naive Bayes algorithm

8.2.3. Normalizing bag-of-words features with the tf-idf algorithm

8.2.4. Optimizing model parameters

8.3. Advanced algorithms and model deployment considerations

8.3.1. word2vec features

8.3.2. Random forest model

8.4. Summary

9. Scaling Machine Learning Workflows

9.1. Before scaling up

9.1.1. Identifying important dimensions

9.1.2. Sub-sampling training data in lieu of scaling?

9.1.3. Scalable data management systems

9.2. Scaling ML modeling pipelines

9.2.1. Scaling learning algorithms

9.3. Scaling predictions

9.3.1. Scaling prediction volume

9.3.2. Scaling prediction velocity

9.4. Summary

10. Example: Digital Display Advertising

10.1. Display Advertising

10.1.1. Digital Advertising Data

10.1.2. Feature Engineering and Modeling Strategy

10.1.3. Size and Shape of the Data

10.2. Singular Value Decomposition

10.3. Resource Estimation and Optimization

10.4. Modeling

10.4.1. K-nearest neighbors

10.4.2. Random forests

10.5. Other Real World Considerations

10.6. Summary

10.7. Recap and Conclusion

10.8. Further Reading

10.9. Terms from this chapter

Appendix A: Popular Machine Learning Algorithms

About the Technology

In a world where big data is the norm and near-real-time decisions are crucial, machine learning is a critical component of the data workflow. Machine learning systems can quickly crunch massive amounts of information to offer insight and make decisions in a way that matches or even surpasses human cognitive abilities. These systems use sophisticated computational and statistical tools to build models that can recognize and visualize patterns, predict outcomes, forecast values, and make recommendations. Gartner predicts that big data analytics will be a $25 billion market by 2017, and financial firms, marketing organizations, scientific facilities, and Silicon Valley startups are all demanding machine learning skills from their developers.

What's inside

  • Learn to build and maintain your own ML system
  • Explore real-world machine-learning problems
  • Detailed treatment of many example real-world use-cases
  • Understand the ML workflow, practical considerations and common pitfalls
  • Python and R code snippets to get you started
  • Advanced material: feature engineering, computational scalability, and real-time streaming ML
  • Beautiful visuals throughout

About the reader

Code examples are in Python and R. No prior machine learning experience required.

About the authors

Henrik Brink is a data scientist and software developer with extensive ML experience in industry and academia. Joseph Richards is a senior data scientist with expertise in applied statistics and predictive analytics. Henrik and Joseph are co-founders of, a leading developer of machine learning solutions for industry. Mark Fetherolf is founder and President of data management and predictive analytics company, Numinary Data Science. He has worked as a statistician and analytics database developer in social science research, chemical engineering, information systems performance, capacity planning, cable television, and online advertising applications.

Manning Early Access Program (MEAP) Read chapters as they are written, get the finished eBook as soon as it’s ready, and receive the pBook long before it's in bookstores.
MEAP combo $49.99 pBook + eBook
MEAP eBook $39.99 pdf + ePub + kindle

FREE domestic shipping on three or more pBooks