Building Domain-Specific Language Models you own this product

prerequisites
intermediate Python • intermediate NLTK • beginner PyTorch or TensorFlow • intermediate NLP • basics of deep learning
skills learned
data manipulation with NumPy and pandas • text preprocessing with NLTK • train an RNN with PyTorch • score and evaluate language models
Alexis Perrier and Rakshit Sakhuja
4 weeks · 8-10 hours per week · ADVANCED

pro $24.99 per month

  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • share your subscription with another person
  • choose one free eBook per month to keep
  • exclusive 50% discount on all purchases

lite $19.99 per month

  • access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more


Look inside

In this liveProject, you’ll step into the role of a natural language processing data scientist working for Stack Exchange. Stack Exchange runs a network of question-and-answer sites on diverse topics ranging from programming to cooking. Your boss wants you to create language models that are tuned to the statistical, probabilistic, and technical jargon present in different Stack Exchange sites.


Language is domain-specific—an insurance company’s documents will use very different terminology than a post on a social media site. Because of this, off-the-shelf NLP models trained on generic text can be inaccurate for specialized domains such as healthcare, legal, clinical, and agricultural language. Your goal is to build a language model capable of query completion and larger text generation for Stack Exchange sites. At the end of this project, you will be able to build the foundations of any domain-specific NLP system by creating a robust and efficient language model using statistical and deep learning techniques.


Updated: March 2022

  • Fully updated to the latest version of AllenNLP
  • Improved GPU compatibility for training larger models
  • New help layers with detailed hints and guidance
  • New preprocessing steps for data preparation
  • Adjusted prerequisites and libraries
This project is designed for learning purposes and is not a complete, production-ready application or solution.

book resources

When you start your liveProject, you get full access to the following books for 90 days.

project authors

Alexis Perrier
Alexis Perrier is a data science consultant specialized in predictive modeling and natural language processing. He holds a master’s in probability from Sorbonne Universités and a PhD in signal processing from Telecom Paris. He is the author of several books and online courses on data science.
Rakshit Sakhuja
Rakshit Sakhuja is a senior data scientist based in India, with most of his work in machine learning and NLP. He is also pursuing a master’s from the Indian Institute of Technology, Hyderabad. He has spent the last four years working in data-driven companies and is currently engaged in semantic search systems. His research focuses on few-shot object detection in the computer vision domain using transfer learning and meta-learning.

prerequisites

This course is for intermediate Python programmers who have experience with text-based deep learning. To begin this liveProject, you will need to be familiar with the following:


TOOLS
  • Intermediate Python
  • Basics of NumPy
  • Basics of pandas
  • Intermediate NLTK
  • Basics of creating neural networks with PyTorch or Keras
TECHNIQUES
  • Basics of deep learning
  • Basics of word embeddings
  • Intermediate seq2seq models, algebra and probabilities, such as matrix manipulation, chain rule, and independence

you will learn

In this liveProject, you’ll learn to build a domain-focused language model using deep learning. You’ll develop skills in Python scripting, neural networks creation, and training language models. At the end of this project, you will be able to build a foundation for any domain-specific NLP system by creating specialized, robust and efficient language models.


  • Python scripting, including object-oriented programming
  • Data manipulation with NumPy and pandas
  • Text preprocessing such as pattern removal with regular expressions, text manipulation, and tokenization with NLTK
  • Creating baseline language model using NLTK(lm) API
  • Designing and training language model using recurrent neural networks with Keras and AllenNLP
  • Scoring and evaluating language models for different tasks
  • Summarizing findings of a data science project effectively

features

Self-paced
You choose the schedule and decide how much time to invest as you build your project.
Project roadmap
Each project is divided into several achievable steps.
Get Help
While within the liveProject platform, get help from other participants and our expert mentors.
Compare with others
For each step, compare your deliverable to the solutions by the author and other participants.
book resources
Get full access to select books for 90 days. Permanent access to excerpts from Manning products are also included, as well as references to other resources.

choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • Building Domain-Specific Language Models project for free