Overview

12 Object detection

Object detection locates and labels objects in images by predicting bounding boxes, enabling tasks like counting, tracking across video frames, and cropping regions of interest for downstream models. Although instance segmentation yields richer, pixel-level information (and can imply bounding boxes), it is more computationally demanding and far costlier to label. When pixel-precise masks aren’t required, object detection is typically preferred for its speed and data-efficiency.

Modern detectors fall into two families. Two-stage models (R-CNN variants) first generate region proposals and then classify/refine them; they’re accurate but computationally heavy. Single-stage models (YOLO, SSD, RetinaNet) predict boxes and classes in one pass, offering much higher throughput with a small potential accuracy trade-off—ideal for real-time use where recent YOLO variants are especially popular. The chapter walks through training a simplified YOLO from scratch on COCO: normalizing boxes, mapping them to a prediction grid, and building a model with a ResNet backbone that outputs per-cell box coordinates, confidence, and class probabilities. A custom loss combines squared error for box parameters with an IoU-driven confidence target, and uses weighting tricks to balance empty cells and emphasize localization. With limited training, results are promising but underfit, and the chapter outlines straightforward avenues for improvement (more data/epochs, augmentation, multi-box per cell, larger grids).

It then demonstrates inference with a pretrained RetinaNet, highlighting its feature pyramid network: upsampling deep semantic features and fusing them with higher-resolution features via lateral connections to better detect both small and large objects. This multi-scale design, now adopted by many YOLO versions, helps models generalize even to out-of-distribution imagery (e.g., pointillist paintings). The overarching guidance is to choose detection when you need object locations rather than pixel masks, pick architectures that match your latency and compute constraints, and leverage pretrained single-stage detectors for strong, practical performance.

Object detectors draw boxes around objects in an image and label them.
object detection
An R-CNN first extracts region proposals, then classifies the proposals with a convnet (a CNN).
r cnn pipeline
YOLO outputs a bounding box prediction and class label for each image region. [2]
coco example
YOLO outputs as visualized in the first YOLO paper.
yolo diagram
YOLO outputs a bounding box prediction and class label for each image region. [4]
yolo targets
Predictions for our sample image.
yolo predictions
Every bounding box predicted by the YOLO model.
yolo predictions all
A feature pyramid network creates semantically interesting feature maps at different scales.
feature pyramid network
Predictions on a test image from the RetinaNet model.
retinanet output

Chapter summary

  • Object detection identifies and locates objects within an image using bounding boxes. It’s basically a weaker version of image segmentation, but one that can be run much more efficiently.
  • There are two primary approaches to object detection:
    • Region-based Convolutional Neural Networks (R-CNNs), which are two-stage models that first propose regions of interest and then classify them with a convnet.
    • Single-stage detectors (like RetinaNet and YOLO), which perform both tasks in a single step. Single-stage detectors are generally faster and more efficient, making them suitable for real-time applications (e.g., self-driving cars).
  • YOLO computes two separate outputs simultaneously during training – possible bounding boxes and a class probability map.
    • Each candidate bounding box is paired with a confidence score, which is trained to target the Intersection over Union of the predicted box and the ground truth box.
    • The class probability map classifies different regions of an image as belonging to different objects.
  • RetinaNet builds on this idea by using a feature pyramid network (FPN), which combines features from multiple convnet layers to create feature maps at different scales, allowing it to more accurately detect objects of different sizes.

[1] The COCO 2017 detection dataset can be explored at https://cocodataset.org/ Most images in this chapter are from the dataset.

[3] Redmon et al., “You Only Look Once: Unified, Real-Time Object Detection”, CoRR (2015), https://arxiv.org/abs/1506.02640

FAQ

What is object detection, and what are its common applications?Object detection locates and classifies objects in an image by drawing bounding boxes around them. Typical uses include counting instances, tracking objects across video frames, and cropping regions of interest for tasks like classification or OCR.
Why not always use image segmentation instead of detection?Segmentation provides pixel-level masks and thus strictly more information than detection, but it is costlier to compute and far more expensive to label. If you don’t need pixel-precise boundaries, a detector is faster and cheaper to train and deploy.
What’s the difference between two-stage and single-stage object detectors?Two-stage detectors (R-CNN family) first generate region proposals and then classify/refine them, which is accurate but computationally heavy. Single-stage detectors (e.g., YOLO, RetinaNet) predict boxes and classes in one pass, offering much higher speed with comparable accuracy for many use cases.
How does a two-stage R-CNN work, and why is it expensive?An R-CNN first proposes thousands of candidate regions, then runs a convnet to classify and refine each one. Classifying so many patches per image makes inference slow and resource-intensive, which limits real-time and embedded applications.
How does the simplified YOLO model in this chapter make predictions?The image is divided into a grid. At each grid cell, the model predicts a bounding box (x, y, w, h) and a confidence score, plus class probabilities. In this simplified version, there’s one box per grid cell (the original YOLO predicted multiple).
Why are bounding boxes normalized to the [0, 1] range?Normalizing coordinates removes dependence on the original image size, simplifying preprocessing, batching, and loss computation across varied image dimensions.
How are YOLO training targets constructed for the grid?Two targets are built: (1) a class map assigning a label to grid cells overlapping an object, and (2) a box tensor storing the box parameters anchored to the cell whose center contains the object, with confidence 1 for true boxes and 0 elsewhere.
What loss functions are used, and how does IoU affect training?The model uses sparse categorical crossentropy for class predictions and a custom box loss. The box loss penalizes xy and sqrt(w,h) errors more heavily and sets the confidence target to the Intersection over Union (IoU) with the ground truth when a cell contains an object, and 0 otherwise.
How does tf.data help when training on COCO?tf.data streams and preprocesses images on the fly, batching and prefetching to keep the GPU fed. This avoids loading the entire dataset into memory and reduces input pipeline bottlenecks during training.
What is RetinaNet’s feature pyramid network, and why does it help?RetinaNet builds a feature pyramid by combining high-level semantic features with higher-resolution features from earlier layers via lateral connections. This multi-scale representation improves detection of both small and large objects. Recent YOLO versions adopt similar ideas.

pro $24.99 per month

  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose one free eBook per month to keep
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime

lite $19.99 per month

  • access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more


choose your plan

team

monthly
annual
$49.99
$399.99
only $33.33 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Deep Learning with Python, Third Edition ebook for free
choose your plan

team

monthly
annual
$49.99
$399.99
only $33.33 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Deep Learning with Python, Third Edition ebook for free