Four-Project Series

DS Pipeline with Python you own this product

prerequisites
basic Python (Jupyter Notebook, NumPy, Matplotlib, NLTK, and RegEx) • intermediate pandas
skills learned
feature extraction with NumPy • text vectorization with TF-IDF and SVD • feature engineering with pandas • interactive data visualization with Matplotlib • data augmentation for ML with object-oriented programming (OOP) • statistical modeling with SciPy
Ruihao Qiu
4 weeks · 3-6 hours per week average · INTERMEDIATE

pro $24.99 per month

  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose one free eBook per month to keep
  • exclusive 50% discount on all purchases

lite $19.99 per month

  • access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more


The tasks you’ll tackle in this series of liveProjects are typical of tasks a data scientist/engineer would encounter in an online recruiting tech company, a large organization’s HR department, or similar environments. You’ll develop a data pipeline for processing, extracting, and transforming various types of data to be consumed by different types of users, including machine learning engineers, data analysts, and product developers. You’ll build data processing tools with NumPy, use pandas for feature extraction and engineering, use Matplotlib to explore, visualize, and analyze processed data, and build data augmentation tools to enhance the ML modeling. By the end, you’ll have already finished 80% of the work of a typical data science project. You’ll have acquired skills, experience, and confidence that will take you closer to a career in data science.

These projects are designed for learning purposes and are not complete, production-ready applications or solutions.

The techniques in the building of the data processing tool were very useful. I can use that in my work projects.

Hung Le, data-engineer, Eco-Energy

here's what's included

Project 1 Data Processing Tools with NumPy

As a data engineer in an online recruiting tech company or a large organization’s HR department, you’ll build a series of practical tools to process and extract useful information from unstructured text data using NumPy. You’ll learn important methods (including trie data structure, TF-IDF, SVD), how to implement them, and their applications in the real world. When you’re finished, you’ll have the know-how to build data processing tools that meet the needs of machine learning engineers, data analysts, and product developers.

Project 2 Pandas for Feature Extraction

Master the basic methods for handling most real-world scenarios as you play the role of a data scientist in an online recruiting tech company or a large organization’s HR department. Using pandas, you’ll process, extract, and transform numerical, categorical, time series, and text data into structured features that are ready for data analysis and ML model training. When you’re done, you’ll have hands-on experience working with most data types you’ll find in the real world, as well as useful skills for extracting and engineering features.

Project 3 Data Visualization for Exploratory Analysis

Visualize this: you’re a data analyst in an online recruiting tech company or a large organization’s HR department. You’ll use Matplotlib to explore, visualize, and analyze processed data to identify missing data and outliers. You’ll build interactive plots for superior data presentation, analyze the correlation of different features using visualization methods, and create analytics dashboards for two types of users. By the end, you’ll be a better data analyst and have the skills to build storytelling tools that let you answer important business questions.

Project 4 Data Augmentation for ML

As a machine learning engineer in an online recruiting tech company or HR department of a large organization, your task is to address a lack of data, a common problem in data science projects. To solve this, you’ll create multiple tools to augment processed data, increasing its volume and learning essentials about probability distributions, random sampling, and OOP. Completing this project will enhance your data analysis and visualization skills, taking you further down the path to a career in data science.

book resources

When you start each of the projects in this series, you'll get full access to the following book for 90 days.

choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • DS Pipeline with Python project for free

I believe people will want to purchase this project because it is an interesting topic at a good price.

Casey Childers, software engineering manager, Mindbody

project author

Ruihao Qiu

Ruihao Qiu is the senior data scientist at a German tech company and has more than five years experience in data science and machine learning. As part of the process of earning his PhD in statistical physics, he developed statistical models to simulate and search for new nanomaterials. In his early days as a data science consultant, he helped his clients from DAX30 multinational companies solve real-world data challenges. As a senior data scientist, he designed and built data pipeline and ML recommender systems for online recruitment applications. He enjoys taking on different career roles and sharing ideas about his data science work in tech blogs and in public presentations.

Prerequisites

These liveProjects are for Python beginners who are passionate about data and who would like to advance their careers as data analysts, data engineers, or data scientists. To begin these liveProjects you’ll need to be familiar with the following:

TOOLS
  • Basic Python
  • Basic Jupyter Notebook
  • Basic NumPy
  • Intermediate pandas
  • Basic Matplotlib
  • Basic NLTK
  • Basic RegEx
TECHNIQUES
  • Basic matrix operations
  • Basics of trie data structure
  • Basics of TF-IDF, SVD
  • Basics of tokenization and text cleaning
  • Basics of plot types
  • Basic statistics

you will learn

In this liveProject series, you’ll learn to build data processing, data augmentation, feature extraction and engineering tools, and create interactive data analytics dashboards for storytelling.

  • Use built-in Python modules: string, RegEx (regular expression)
  • Use NumPy for different matrix operations
  • Use SciPy to compute cosine similarity
  • Use stats modules for probability distribution fitting
  • Use pandas for dataframe operations
  • Matplotlib plot type
  • ipywidgets for interactive widgets

features

Self-paced
You choose the schedule and decide how much time to invest as you build your project.
Project roadmap
Each project is divided into several achievable steps.
Get Help
While within the liveProject platform, get help from other participants and our expert mentors.
Compare with others
For each step, compare your deliverable to the solutions by the author and other participants.
book resources
Get full access to select books for 90 days. Permanent access to excerpts from Manning products are also included, as well as references to other resources.