Data Science Bookcamp
Ten Python projects
Leonard Apeltsin
  • MEAP began September 2019
  • Publication in Spring 2020 (estimated)
  • ISBN 9781617296253
  • 900 pages (estimated)
  • printed in black & white

The book lives up to its promise of using code examples instead of going with heavy mathematics.

Bill Mitchell
Learn data science with Python by building 10 real-world projects! In Data Science Bookcamp you’ll test and build your knowledge of Python and learn to handle the kind of open-ended problems that professional data scientists work on daily. Downloadable data sets and thoroughly-explained solutions help you lock in what you’ve learned, building your confidence and making you ready for an exciting new data science career.
Table of Contents detailed table of contents

Case study 1: Finding the Winning Strategy in a Card Game

1 Computing Probabilities Using Python

1.1 Sample Space Analysis: An Equation-Free Approach for Measuring Uncertainty in Outcomes

1.1.1 Analyzing a Biased Coin

1.2 Computing Non-Trivial Probabilities

1.2.1 Problem 1: Analyzing a Family with 4 Children

1.2.2 Problem 2: Analyzing Multiple Dice Rolls

1.2.3 Problem 3: Computing Dice-Roll Probabilities using Weighted Sample Spaces

1.3 Computing Probabilities Over Interval Ranges

1.3.1 Evaluating Extremes Using Interval Analysis

1.4 Summary

2 Plotting Probabilities Using Matplotlib

2.1 Basic Matplotlib Plots

2.2 Plotting Coin-Flip Probabilities

2.2.1 Comparing Multiple Coin-Flip Probability Distributions

2.3 Summary

3 Running Random Simulations in NumPy

3.1 Simulating Random Coin-Flips and Dice-Rolls Using NumPy

3.1.1 Analyzing Biased Coin-Flips

3.2 Computing Confidence Intervals Using Histograms and NumPy Arrays

3.2.1 Binning Similar Points in Histogram Plots

3.2.2 Deriving Probabilities from Histograms

3.2.3 Shrinking the Range of a High Confidence Interval

3.2.4 Computing Histograms in NumPy

3.3 Leveraging Confidence Intervals to Analyze a Biased Deck of Cards

3.4 Using Permutations to Shuffle Cards

3.5 Summary

4 Case Study 1 Solution

4.1 Overview

4.2 Predicting Red Cards within a Shuffled Deck

4.2.1 Estimating the Probability of Strategy Success

4.3 Optimizing Strategies using the Sample Space for a 10-Card Deck

4.4 Key Takeaways

Case study 2: Assessing Online Ad-Clicks for Significance

5 Basic Probability and Statistical Analysis Using SciPy

5.1 Exploring the Relationships between Data and Probability Using SciPy

5.2 Mean as a Measure of Centrality

5.2.2 Finding the Mean of a Probability Distribution

5.3 Variance as a Measure of Dispersion

5.3.1 Finding the Variance of a Probability Distribution

5.4 Summary

6 Making Predictions Using the Central Limit Theorem and SciPy

6.1 Manipulating the Normal Distribution Using SciPy

6.1.1 Comparing Two Sampled Normal Curves

6.2 Determining Mean and Variance of a Population through Random Sampling

6.3 Making Predictions Using Mean and Variance

6.3.1 Computing the Area Beneath a Normal Curve

6.3.2 Interpreting the Computed Probability

6.4 Summary

7 Statistical Hypothesis Testing

7.1 Assessing the Divergence Between Sample Mean and Population Mean

7.2 Data Dredging: Coming to False Conclusions through Oversampling

7.3 Bootstrapping with Replacement: Testing a Hypothesis When the Population Variance is Unknown

7.4 Permutation Testing: Comparing Means of Samples when the Population Parameters are Unknown

7.5 Summary

8 Analyzing Tables Using Pandas

8.1 Storing Tables Using Basic Python

8.2 Exploring Tables Using Pandas

8.3 Retrieving Table Columns

8.4 Retrieving Table Rows

8.5 Modifying Table Rows and Columns

8.6 Saving and Loading Table Data

8.7 Visualizing Tables Using Seaborn

8.8 Summary

9 Case Study 2 Solution

9.1 Overview

9.2 Processing the Ad-Click Table in Pandas

9.3 Computing P-values from Differences in Means

9.4 Determining Statistical Significance

9.5 41 Shades of Blue: A Real-Life Cautionary Tale

9.6 Key Takeaways

Case Study 3: Tracking Disease Outbreaks Using News Headlines

10 Clustering Data into Groups

10.1 Using Centrality to Discover Clusters

10.2 K-Means: A Clustering Algorithm for Grouping Data into K Central Groups

10.2.1 K-means Clustering Using Scikit-learn

10.2.2 Selecting the Optimal K Using the Elbow Method

10.3 Using Density to Discover Clusters

10.4 DBSCAN: A Clustering Algorithm for Grouping Data Based on Spatial Density

10.4.1 Comparing DBSCAN and K-means

10.4.2 Clustering Based on Non-Euclidean Distance

10.4.3 Limitations of the DBSCAN Algorithm

10.5 Analyzing Clusters Using Pandas

10.6 Summary

11 Geographic Location Visualization and Analysis

11.1 The Great-Circle Distance: A Metric for Computing Distances Between 2 Global Points

11.2 Plotting Maps Using Basemap

11.3 Location Tracking Using GeoNamesCache

11.3.1 Accessing Country Information

11.3.2 Accessing City Information

11.3.3 Limitations of the GeoNamesCache Library

11.4 Matching Location Names in Text

11.5 Summary

12 Case Study 3 Solution

12.1 Overview

12.2 Extracting Locations from Headline Data

12.3 Visualizing and Clustering the Extracted Location Data

12.4 Extracting Insights from Location Clusters

12.5 Key Takeaways

Case Study 4: Predicting Scientific Trends from Paper Abstracts

13 Measuring text similarity

14 Dimensional reduction of text data

15 Linear and polynomial regression

16 Downloading and parsing Wikipedia pages

17 Case Study 4 solution

Case Study 5: Social Circle Detection in Facebook Data

18 Analyzing and visualizing networks

19 Logistic regression: Training a simple linear classifier

20 Case Study 5 solution

Case Study 6: Predicting Anomalies in Product Purchase Time Series Data

21 Comparing machine learning Models using Scikit-Learn

22 Time series analysis

23 Case Study 6 solution

Case Study 7: Optimizing Ad Purchases for a Planned Marketing Campaign

24 Bayesian statistics

25 Linear programming

26 Case Study 7 solution

Case Study 8: Discovering Conflicting Viewpoints in Product Tweets Using Sentiment Analysis

27 Naïve Bayes classification of text

28 Training classifiers from unbalanced training sets

29 Case Study 8 solution

Case Study 9: Creating a Balanced Sales Territory for a Sales Team

30 Polygon boundary analysis and visualization

31 Combinatorial optimization techniques

32 Case Study 9 solution

Case Study 10: Predicting Stock Price Movements from Quarterly Earnings Data

33 Analyzing temporal stock trends

34 Case Study 10 solution

About the Technology

In real-world practice, data scientists create innovative solutions to novel open ended problems. Easy to learn and use, the Python language has become the de facto language for data science amongst researchers, developers, and business users. But knowing a few basic algorithms is not enough to tackle a vague and thorny problem. It takes relentless practice at cracking difficult data tasks to achieve mastery in the field. That’s just what this book delivers.

About the book

Data Science Bookcamp is a comprehensive set of challenging projects carefully designed to grow your data science skills from novice to master. Veteran data scientist Leonard Apeltsin sets 10 increasingly difficult exercises that test your abilities against the kind of problems you’d encounter in the real-world. As you solve each challenge, you’ll acquire and expand the data science and Python skills you’ll use as a professional data scientist. Ranging from text processing to machine learning, each project comes complete with a unique downloadable data set and a fully-explained step-by-step solution. Because these projects come from Dr. Apelstin’s vast experience, each solution highlights the most likely failure points along with practical advice for getting past unexpected pitfalls. When you wrap up these 10 awesome exercises, you’ll have a diverse relevant skill set that’s transferable to working in industry.

What's inside

  • 10 in-depth Python exercises with full downloadable data sets
  • Web scraping for text and images
  • Organise data sets with clustering algorithms
  • Visualize complex multi-variable datasets
  • Train a decision tree machine learning algorithm

About the reader

For readers who know the basics of Python. No prior data science or machine learning skills required.

About the author

Leonard Apeltsin is a senior data scientist and engineering lead at Primer AI, a startup that specializes in using advanced Natural Language Processing techniques to extract insight from terabytes of unstructured text data. His PhD research focused on bioinformatics that required analyzing millions of sequenced DNA patterns to uncover genetic links in deadly diseases.

Manning Early Access Program (MEAP) Read chapters as they are written, get the finished eBook as soon as it’s ready, and receive the pBook long before it's in bookstores.
MEAP combo $25.00 $59.99 pBook + eBook + liveBook
MEAP eBook $20.00 $47.99 pdf + ePub + kindle + liveBook
Prices displayed in rupees will be charged in USD when you check out.

placing your order...

Don't refresh or navigate away from the page.

FREE domestic shipping on three or more pBooks