In this liveProject, you’ll take on the role of a data scientist working for an online movie streaming service. Your bosses want a machine learning model that can analyze written customer reviews of your movies, but you discover that the data is biased towards negative reviews. Training a model on this imbalanced data would hurt its accuracy, and so your challenge is to create a balanced dataset for your model to learn from. You’ll start by simulating your company’s data by deliberately introducing imbalance to an IMDB (Internet Movie Database) review dataset. You’ll experiment with two different methods for balancing this dataset: using sampling techniques, and generating a new synthetic corpus with deep learning text generation. You’ll build and train a simple machine learning model on each dataset to compare the effectiveness of each approach.
KC Tung is an AI architect, machine learning engineer, and data scientist who specializes in delivering AI, deep learning, and NLP models across enterprise architectures. As an AI architect at Microsoft, he helps enterprise customers with use-case driven architecture, AI/ML model development/deployment in the cloud, and technology selection and integration best suited for their requirements. He is a Microsoft certified AI engineer and data engineer. He has a PhD in molecular biophysics from the University of Texas Southwestern Medical, and has spoken at the 2018 O'Reilly AI Conference in San Francisco and the 2019 O'Reilly Tensorflow World Conference in San Jose.
This liveProject is for intermediate-to-experienced Python programmers. To begin this liveProject, you will need to have hands-on experience with or be familiar with:
Basics of scikit-learn
Intermediate TensorFlow 2.0 and/or Keras
Fundamental statistics for classification
Basics of gradient descent and SGD
Basics of loss functions
Basics of back-propagation
Basics of overfitting and underfitting
Basics of kNN
Basics of Gradient Boosted Decision Trees/GBM
Basics of classification techniques such as Logistic Regression or SVM
Intermediate knowledge of neural networks such as RNN, CNN, and GRU.
Basics of Comparing classifiers
Basics of of clustering such as Affinity Propagation and Hierarchical Clustering
Intermediate knowledge of Natural Language Processing concepts, including embedding, tokenization at word of character level, basic one-hot encoding, and basic handling out-of-vocabulary tokenization
Intermediate knowledge of activation functions for ANNs, such as softmax, sigmoid, and RELU
Intermediate knowledge of Dropout, Maxpool, and Regularization
Intermediate knowledge of multi-layer perceptron
you will learn
In this liveProject, you’ll develop natural language processing skills for machine learning models that can determine the sentiment and meaning of raw text. You’ll also learn useful and easily transferable ML techniques to help classify NLP patterns at scale.
Commonly used text processing/cleansing techniques
Recommended statistics for model performance and misclassification cost
Data balance through sampling
Generating new corpus with deep learning
Training and testing a deep learning model to classify text
You choose the schedule and decide how much time to invest as you build your project.
Each project is divided into several achievable steps.
Chat with other participants within the liveProject platform.
Compare with others
For each step, compare your deliverable to the solutions by the author and other participants.
Book and video resources
Excerpts from Manning books and videos are included, as well as references to other resources.
1. Data Extraction and Exploration
1.1. Data Extraction and Exploration
1.2. Recurrent Neural Net with Keras
1.3. Text Embeddings
1.4. A Gentle Introduction to Classification
1.5. Implementing a Recurrent Neural Network
1.6. Submit Your Work
2.1. Does Oversampling Text Work?
2.2. Oversample Positive Reviews
2.3. Build Classification Model
2.5. Submit Your Work
3. Generate New Corpus
3.1. Generate New Corpus
3.2. Build a Deep Learning Model for Text Generation
3.3. Train the Model
3.4. Generate the Corpus
3.5. Two-way Street
3.6. LSTM Model Structure
3.7. Regular Expressions
3.8. Submit Your Work
4. Training with Generated Corpus
4.1. Generated Corpus as Training Data
4.2. Merge the Generated Corpus
4.3. Build the Machine Learning Model
4.4. Evaluate Model Performance on Holdout Data
4.5. Submit Your Work
LiveProject Session Notification
notify me when registration opens for Training Models on Imbalanced Text Data