DS Pipeline

Data Processing Tools with NumPy you own this product

This project is part of the liveProject series DS Pipeline with Python
basic Python (Jupyter Notebook, NumPy, RegEx) • basic matrix operations • basic knowledge of trie data structure • basic knowledge of TF-IDF and SVD
skills learned
extract features with NumPy • vectorize text with TF-IDF and SVD • compute similar items with embedded vectors
Ruihao Qiu
1 week · 4-6 hours per week · INTERMEDIATE

pro $24.99 per month

  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • share your subscription with another person
  • choose one free eBook per month to keep
  • exclusive 50% discount on all purchases

lite $19.99 per month

  • access to all Manning books, including MEAPs!


5, 10 or 20 seats+ for your team - learn more

Look inside

As a data engineer in an online recruiting tech company or a large organization’s HR department, you’ll build a series of practical tools to process and extract useful information from unstructured text data using NumPy. You’ll learn important methods (including trie data structure, TF-IDF, SVD), how to implement them, and their applications in the real world. When you’re finished, you’ll have the know-how to build data processing tools that meet the needs of machine learning engineers, data analysts, and product developers.

This project is designed for learning purposes and is not a complete, production-ready application or solution.

book resources

When you start your liveProject, you get full access to the following books for 90 days.

project author

Ruihao Qiu

Ruihao Qiu is the senior data scientist at a German tech company and has more than five years experience in data science and machine learning. As part of the process of earning his PhD in statistical physics, he developed statistical models to simulate and search for new nanomaterials. In his early days as a data science consultant, he helped his clients from DAX30 multinational companies solve real-world data challenges. As a senior data scientist, he designed and built data pipeline and ML recommender systems for online recruitment applications. He enjoys taking on different career roles and sharing ideas about his data science work in tech blogs and in public presentations.


This liveProject is for Python beginners who are interested in building data processing tools using NumPy. To begin these liveProjects you’ll need to be familiar with the following:

  • Basic Python
  • Basic Jupyter Notebook/JupyterLab
  • Basic NumPy and pandas
  • Basic matrix operations
  • Basic knowledge of tree data structure
  • Basic concept of TF-IDF, SVD (what they’re named for and used for)
  • Basic understanding of tokenization and cleaning of text data

you will learn

In this liveProject, you’ll learn commonly used methods for machine learning data preprocessing, various use cases for NumPy and sklearn, and how to build custom data processing tools.

  • Use built-in Python modules
  • Use NumPy for various matrix operations
  • Use SciPy to calculate cosine similarity
  • Use sklearn
  • Trie data structure
  • Use TF-IDF
  • Correlation matrix


You choose the schedule and decide how much time to invest as you build your project.
Project roadmap
Each project is divided into several achievable steps.
Get Help
While within the liveProject platform, get help from other participants and our expert mentors.
Compare with others
For each step, compare your deliverable to the solutions by the author and other participants.
book resources
Get full access to select books for 90 days. Permanent access to excerpts from Manning products are also included, as well as references to other resources.

choose your plan


only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • Data Processing Tools with NumPy project for free