Data Science Bookcamp
Five Python projects
Leonard Apeltsin
  • MEAP began September 2019
  • Publication in Spring 2021 (estimated)
  • ISBN 9781617296253
  • 600 pages (estimated)
  • printed in black & white

The book lives up to its promise of using code examples instead of going with heavy mathematics.

Bill Mitchell
Learn data science with Python by building five real-world projects! In Data Science Bookcamp you’ll test and build your knowledge of Python and learn to handle the kind of open-ended problems that professional data scientists work on daily. Downloadable data sets and thoroughly-explained solutions help you lock in what you’ve learned, building your confidence and making you ready for an exciting new data science career.

About the Technology

In real-world practice, data scientists create innovative solutions to novel open ended problems. Easy to learn and use, the Python language has become the de facto language for data science amongst researchers, developers, and business users. But knowing a few basic algorithms is not enough to tackle a vague and thorny problem. It takes relentless practice at cracking difficult data tasks to achieve mastery in the field. That’s just what this book delivers.

About the book

Data Science Bookcamp is a comprehensive set of challenging projects carefully designed to grow your data science skills from novice to master. Veteran data scientist Leonard Apeltsin sets five increasingly difficult exercises that test your abilities against the kind of problems you’d encounter in the real-world. As you solve each challenge, you’ll acquire and expand the data science and Python skills you’ll use as a professional data scientist. Ranging from text processing to machine learning, each project comes complete with a unique downloadable data set and a fully-explained step-by-step solution. Because these projects come from Dr. Apelstin’s vast experience, each solution highlights the most likely failure points along with practical advice for getting past unexpected pitfalls. When you wrap up these five awesome exercises, you’ll have a diverse relevant skill set that’s transferable to working in industry.
Table of Contents detailed table of contents

Case study 1: Finding the Winning Strategy in a Card Game

1 Computing Probabilities Using Python

1.1 Sample Space Analysis: An Equation-Free Approach for Measuring Uncertainty in Outcomes

1.1.1 Analyzing a Biased Coin

1.2 Computing Non-Trivial Probabilities

1.2.1 Problem 1: Analyzing a Family with 4 Children

1.2.2 Problem 2: Analyzing Multiple Dice Rolls

1.2.3 Problem 3: Computing Dice-Roll Probabilities using Weighted Sample Spaces

1.3 Computing Probabilities Over Interval Ranges

1.3.1 Evaluating Extremes Using Interval Analysis

1.4 Summary

2 Plotting Probabilities Using Matplotlib

2.1 Basic Matplotlib Plots

2.2 Plotting Coin-Flip Probabilities

2.2.1 Comparing Multiple Coin-Flip Probability Distributions

2.3 Summary

3 Running Random Simulations in NumPy

3.1 Simulating Random Coin-Flips and Dice-Rolls Using NumPy

3.1.1 Analyzing Biased Coin-Flips

3.2 Computing Confidence Intervals Using Histograms and NumPy Arrays

3.2.1 Binning Similar Points in Histogram Plots

3.2.2 Deriving Probabilities from Histograms

3.2.3 Shrinking the Range of a High Confidence Interval

3.2.4 Computing Histograms in NumPy

3.3 Leveraging Confidence Intervals to Analyze a Biased Deck of Cards

3.4 Using Permutations to Shuffle Cards

3.5 Summary

4 Case Study 1 Solution

4.1 Overview

4.2 Predicting Red Cards within a Shuffled Deck

4.2.1 Estimating the Probability of Strategy Success

4.3 Optimizing Strategies using the Sample Space for a 10-Card Deck

4.4 Key Takeaways

Case study 2: Assessing Online Ad-Clicks for Significance

5 Basic Probability and Statistical Analysis Using SciPy

5.1 Exploring the Relationships between Data and Probability Using SciPy

5.2 Mean as a Measure of Centrality

5.2.1 Finding the Mean of a Probability Distribution

5.3 Variance as a Measure of Dispersion

5.3.1 Finding the Variance of a Probability Distribution

5.4 Summary

6 Making Predictions Using the Central Limit Theorem and SciPy

6.1 Manipulating the Normal Distribution Using SciPy

6.1.1 Comparing Two Sampled Normal Curves

6.2 Determining Mean and Variance of a Population through Random Sampling

6.3 Making Predictions Using Mean and Variance

6.3.1 Computing the Area Beneath a Normal Curve

6.3.2 Interpreting the Computed Probability

6.4 Summary

7 Statistical Hypothesis Testing

7.1 Assessing the Divergence Between Sample Mean and Population Mean

7.2 Data Dredging: Coming to False Conclusions through Oversampling

7.3 Bootstrapping with Replacement: Testing a Hypothesis When the Population Variance is Unknown

7.4 Permutation Testing: Comparing Means of Samples when the Population Parameters are Unknown

7.5 Summary

8 Analyzing Tables Using Pandas

8.1 Storing Tables Using Basic Python

8.2 Exploring Tables Using Pandas

8.3 Retrieving Table Columns

8.4 Retrieving Table Rows

8.5 Modifying Table Rows and Columns

8.6 Saving and Loading Table Data

8.7 Visualizing Tables Using Seaborn

8.8 Summary

9 Case Study 2 Solution

9.1 Overview

9.2 Processing the Ad-Click Table in Pandas

9.3 Computing P-values from Differences in Means

9.4 Determining Statistical Significance

9.5 41 Shades of Blue: A Real-Life Cautionary Tale

9.6 Key Takeaways

Case Study 3: Tracking Disease Outbreaks Using News Headlines

10 Clustering Data into Groups

10.1 Using Centrality to Discover Clusters

10.2 K-Means: A Clustering Algorithm for Grouping Data into K Central Groups

10.2.1 K-means Clustering Using Scikit-learn

10.2.2 Selecting the Optimal K Using the Elbow Method

10.3 Using Density to Discover Clusters

10.4 DBSCAN: A Clustering Algorithm for Grouping Data Based on Spatial Density

10.4.1 Comparing DBSCAN and K-means

10.4.2 Clustering Based on Non-Euclidean Distance

10.4.3 Limitations of the DBSCAN Algorithm

10.5 Analyzing Clusters Using Pandas

10.6 Summary

11 Geographic Location Visualization and Analysis

11.1 The Great-Circle Distance: A Metric for Computing Distances Between 2 Global Points

11.2 Plotting Maps Using Basemap

11.3 Location Tracking Using GeoNamesCache

11.3.1 Accessing Country Information

11.3.2 Accessing City Information

11.3.3 Limitations of the GeoNamesCache Library

11.4 Matching Location Names in Text

11.5 Summary

12 Case Study 3 Solution

12.1 Overview

12.2 Extracting Locations from Headline Data

12.3 Visualizing and Clustering the Extracted Location Data

12.4 Extracting Insights from Location Clusters

12.5 Key Takeaways

Case Study 4: Using Online Job Postings to Improve Your Data Science Resume

13 Measuring Text Similarities

13.1 Simple Text Comparison

13.1.1 Introduction to the Jaccard Similarity

13.1.2 Replacing Words with Numeric Values

13.2 Vectorizing Texts Using Word Counts

13.2.1 Using Normalization to Improve TF Vector Similarity

13.3 Matrix Multiplication for Efficient Similarity Calculation

13.3.1 Basic Matrix Operations

13.3.2 Computing All-By-All Matrix Similarities

13.4 Computational Limits of Matrix Multiplication

13.5 Summary

14 Dimension Reduction of Matrix Data

14.1 Clustering 2D Data in 1-Dimension

14.1.1 Reducing Dimensions Using Rotation

14.2 Dimension Reduction Using PCA and Scikit-Learn

14.3 Clustering 4D Data in 2-Dimensions

14.3.1 Limitations of PCA

14.4 Computing Principal Componets Without Rotation

14.4.1 Extracting Eigenvectors Using Power Iteration

14.5 Efficient Dimension Reduction Using SVD and Scikit-Learn

14.6 Summary

15 NLP Analysis of Large Text Datasets

15.1 Loading Online Forum Discussions Using Scikit-Learn

15.2 Vectorizing Documents Using Scikit-Learn

15.3 Ranking Words by Both Post-Frequency and Count

15.3.1 Computing TFIDF Vectors with Scikit-Learn

15.4 Computing Similarities Across Large Document Datasets

15.5 Clustering Texts by Topic

15.5.1 Exploring a Single Text Cluster

15.6 Visualizing Text Clusters

15.6.1 Using Subplots to Display Multiple Word Clouds

15.7 Summary

16 Extracting Text from Web Pages

16.1 The Structure of HTML Documents

16.2 Parsing HTML using Beautiful Soup

16.3 Downloading and Parsing Online Data

16.4 Summary

17 Case Study 4 Solution

17.1 Overview

17.2 Extracting Skill Requirements from Job Posting Data

17.2.1 Exploring the HTML for Skill Descriptions

17.3 Filtering Jobs by Relevance

17.4 Clustering Skills in Relevant Job Postings

17.4.1 Grouping the Job Skills into 15 Clusters

17.4.2 Investigating the Technical Skill Clusters

17.4.3 Investing the Soft-Skill Clusters

17.4.4 Exploring Clusters at Alternative Values of K

17.4.5 Analyzing the 700 Most-Relevant Postings

17.5 Conclusion

17.6 Key Takeaways

Case Study 5: Social Circle Detection in Facebook Data

18 Graph Theory and Network Analysis

19 Network-driven Supervised Machine Learning

20 Training Linear Classifiers with Logistic Regression.

21 Training Non-Linear Classifiers with Decision Tree Techniques

22 Case Study 5 Solution

What's inside

  • Five in-depth Python exercises with full downloadable data sets
  • Web scraping for text and images
  • Organise data sets with clustering algorithms
  • Visualize complex multi-variable datasets
  • Train a decision tree machine learning algorithm

About the reader

For readers who know the basics of Python. No prior data science or machine learning skills required.

About the author

Leonard Apeltsin is a senior data scientist and engineering lead at Primer AI, a startup that specializes in using advanced Natural Language Processing techniques to extract insight from terabytes of unstructured text data. His PhD research focused on bioinformatics that required analyzing millions of sequenced DNA patterns to uncover genetic links in deadly diseases.

placing your order...

Don't refresh or navigate away from the page.
Manning Early Access Program (MEAP) Read chapters as they are written, get the finished eBook as soon as it’s ready, and receive the pBook long before it's in bookstores.
print book $35.99 $59.99 pBook + eBook + liveBook
Additional shipping charges may apply
Data Science Bookcamp (print book) added to cart
continue shopping
go to cart

eBook $38.39 $47.99 3 formats + liveBook
Data Science Bookcamp (eBook) added to cart
continue shopping
go to cart

Prices displayed in rupees will be charged in USD when you check out.
customers also reading

This book 1-hop 2-hops 3-hops

FREE domestic shipping on three or more pBooks