For years, the AI industry operated on a simple assumption: bigger models meant better results. But a quiet shift is underway. Across startups, enterprises, and research labs, leading teams are discovering that smaller, domain-specific models often outperform massive general-purpose systems while running faster and costing a fraction as much. This book bundle explains why the small language model (SLM) market is exploding from $6.5B in 2024 to $20.7B by 2030, and how practitioners are using domain-specific models to achieve better results at a fraction of the cost. While everyone chases bigger models, leading teams are building right-sized models that excel at specific tasks, cutting inference costs by 90% while improving accuracy. You'll learn when to choose small over large, how to architect task-specific models, and the optimization techniques that make small models punch above their weight.
By default, general purpose LLMs are not optimized for specific domains and business goals. Using techniques like specialized fine-tuning, pruning unnecessary neural components, and knowledge distillation, you can rearchitect your models to cost less, run faster, and deliver more accurate results.
CUDA (Compute Unified Device Architecture) provides a powerful parallel programming model AI engineers can use to tap the massive processing power of NVIDIA GPUs. CUDA delivers direct control, debugging power, and acceleration at the GPU level that can’t be matched by other types of optimizations.
This is a practical book that shows you how to adapt pretrained open source models to your domain using transfer learning and parameter-efficient fine-tuning. You’ll learn to minimize cost through optimization and quantization, develop secure APIs to serve your models, and deploy SLMs on commodity hardware—including small devices. The hands-on examples include integrating SLMs into RAG systems and agentic workflows.
Building Reliable AI Systems is a comprehensive guide to creating LLM-based apps that are faster and more accurate. It takes you from training to production and beyond into the ongoing maintenance of an LLM. In each chapter, you’ll find in-depth code samples and hands-on projects—including building a RAG-powered chatbot and an agent created with LangChain. Deploying an LLM can be costly, so you’ll love the performance optimization techniques—prompt optimization, model compression, and quantization—that make your LLMs quicker and more efficient. Throughout, real-world case studies from e-commerce, healthcare, and legal work give concrete examples of how businesses have solved some of LLMs common problems.