Overview

5 The birth of information theory: Shannon and the mathematics of uncertainty

The chapter recounts how Claude Shannon reframed uncertainty into a measurable resource and, in doing so, founded information theory. Building on earlier statistical ideas yet striking in a new direction, Shannon treated a message as a probabilistic sequence of symbols and defined information as the reduction of uncertainty. By introducing entropy—quantified in bits—and insisting that communication engineering concerns transmission rather than meaning, he supplied a universal yardstick that made “information” concrete, comparable, and optimizable across any channel.

Armed with this definition, Shannon articulated a complete end‑to‑end model—source, encoder, channel with noise, decoder, receiver—and turned the central trade‑offs of communication into mathematics. Source coding tied the limits of lossless compression to a source’s entropy; noisy‑channel coding showed that carefully added redundancy can make errors arbitrarily rare below a channel’s capacity; and capacity itself set the ceiling for reliable information rate. These results, extended through practical signaling and modulation schemes, solved urgent mid‑century engineering problems and still underwrite today’s digital infrastructure—from telephony and storage to satellite links, Wi‑Fi, and streaming.

The chapter also traces Shannon’s unintended legacy far beyond wires and radio. Entropy and mutual information became core tools in statistics, data science, and AI: decision trees and random forests split on information gain; feature selection ranks predictors by mutual information; clustering is assessed with entropy‑based measures such as purity and normalized mutual information; and neural networks learn by minimizing cross‑entropy, while representation learning compresses inputs to preserve high‑entropy structure and discard redundancy. In all these settings, the bit remains the unit and entropy the common currency for quantifying, managing, and ultimately learning from uncertainty.

Entropy quantifies the uncertainty of a message in bits. A fair coin resolves 1 bit of uncertainty, while a biased coin conveys less (0.47 bits) because the outcome is more predictable. A die roll (2.58 bits) and a card draw (5.7 bits) carry progressively higher entropy as the number of equally likely outcomes increases. As entropy rises, so do uncertainty, surprise, information gain, and channel capacity requirements. When entropy is low, redundancy allows greater opportunities for compression.
An illustration of representation learning as compression. On the left is the original raw object, containing many pixels and redundancies. In representation learning, an encoder compresses this object into a compact form, shown here on the right. The compressed version looks blurrier, but it preserves the high-entropy elements—the unpredictable parts that carry real information—while discarding low-entropy redundancies. This compressed representation is the “message” that can be transmitted or stored efficiently. Although not depicted here, a decoder can reconstruct an image that closely resembles the original, demonstrating how uncertainty can be reduced without losing essential content.

Summary

  • Claude Shannon redefined information not as meaning but as the resolution of uncertainty. A predictable message like “the sun will rise tomorrow” carries almost no information, while an unpredictable one like a coin flip resolves genuine doubt. This shift allowed information to be quantified in probabilistic terms rather than treated as a vague concept.
  • Borrowing from physics, Shannon formalized entropy as the average uncertainty in a source of messages. Measured in bits, entropy provides a precise yardstick: one bit for a fair coin flip, more for larger sets of equally likely outcomes. With entropy, engineers could assign numbers to unpredictability and compare the information content of different sources.
  • Low-entropy sequences, such as repeated letters, are highly redundant and can be compressed without loss. High-entropy sequences, by contrast, resist compression because every symbol is unpredictable. Shannon also showed that redundancy, when deliberately added through coding, enables error correction—balancing efficiency with reliability in noisy channels.
  • Shannon demonstrated that every channel has a maximum capacity, defined by its bandwidth and noise level, beyond which reliable transmission is impossible. By comparing entropy against channel capacity, engineers could design systems that transmit as much information as possible with minimal error. This principle still governs today’s telephony, Wi-Fi, satellite links, and digital media.
  • In the 1940s, telegraphy, telephony, and radio faced urgent problems of efficiency and reliability. By making uncertainty measurable and showing how entropy governs redundancy, coding, and channel limits, Shannon gave engineers the tools to tame noise, compress signals, and design communication systems that approach theoretical limits of performance.
  • Entropy quickly escaped engineering. In data science, it became the foundation of decision trees and random forests, where splits are chosen to maximize information gain. Mutual information extended this idea, guiding feature selection and quantifying relationships between variables. Entropy also became central to clustering, neural networks, and representation learning.
  • By treating information as measurable uncertainty and entropy as its unit, Shannon gave science and engineering a new common currency. His work solved practical problems of mid-20th-century communication while also laying the groundwork for the information age. Today, bits and entropy remain the backbone of both digital technology and modern artificial intelligence.

FAQ

What did Shannon redefine “information” to mean in this chapter?He defined information as the reduction of uncertainty. By treating a message as one outcome selected from many possible outcomes with known probabilities, Shannon made information measurable. The average uncertainty of a source is quantified by entropy, measured in bits.
What is entropy, and why is it measured in bits?Entropy is the average uncertainty of a source—the expected “surprise” before a message arrives. Measuring with base-2 logarithms makes the natural unit the bit, the resolution of one yes/no choice. Examples: a fair coin resolves 1 bit; a biased 90/10 coin conveys about 0.47 bits; a fair die about 2.58 bits; drawing a card from a 52-card deck about 5.7 bits.
Why does Shannon’s theory ignore meaning (semantics)?Shannon’s framework is about transmission, not interpretation. It treats all messages uniformly and focuses on how uncertain we are before a message and how efficiently a channel can resolve that uncertainty. By stripping away meaning, information becomes a quantifiable commodity in bits, applicable to any message.
What is redundancy and why is it a “double-edged sword”?Redundancy is predictable, repeated structure. It enables compression (strip predictable parts to save capacity) but also enables error correction (add structured redundancy so receivers can detect/fix errors). AAAAAA is highly compressible (low entropy); a string like W7dX8$ isn’t (high entropy). Shannon showed both compression and coding are governed by entropy.
What is channel capacity and how does it relate to entropy?Channel capacity is the maximum reliable information rate a channel can support, set by bandwidth and noise. Compare source entropy (bits needed) to capacity (bits per second available): if entropy exceeds capacity, you must compress or slow the rate; if you transmit below capacity with proper coding, reliable communication is achievable.
What is the noisy channel coding theorem and why was it surprising?It states that for any noisy channel, if the transmission rate is below capacity, there exist codes that make error probability arbitrarily small. Noise doesn’t impose an absolute reliability floor—structured redundancy can beat it. This insight underpins practical codes like Hamming, Reed–Solomon, LDPC, and Turbo codes.
What does the source coding (compression) theorem say?It sets a hard limit: the average length of any lossless code cannot be less than the source’s entropy. Predictable (low-entropy) sources compress well; near-random (high-entropy) sources don’t. This explains why ZIP can’t shrink already compressed or random data and why lossy codecs trade redundancy and perceptual irrelevance for efficiency.
How do decision trees and random forests use entropy?At each split, trees measure node entropy and choose the feature/threshold that maximizes information gain (entropy reduction). The Gini index is an alternative impurity measure with similar behavior. Feature importance in trees/forests aggregates these entropy (or Gini) reductions: variables that consistently cut uncertainty most rank highest.
What is mutual information and how is it used beyond communications?Mutual information measures how much knowing one variable reduces uncertainty about another. It guides feature selection (keep predictors that cut target uncertainty) and evaluates/optimizes clustering (e.g., purity, cluster entropy, normalized mutual information) by asking how much cluster assignments reveal about true categories.
How does entropy appear in deep learning and representation learning?Neural networks are trained by minimizing cross-entropy between predicted and true distributions—directly reducing uncertainty in predictions. Representation learning (e.g., autoencoders, embeddings) compresses data by preserving high-entropy, informative structure and discarding low-entropy redundancy, echoing Shannon’s balance between efficiency and fidelity.

pro $24.99 per month

  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose one free eBook per month to keep
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime

lite $19.99 per month

  • access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more


choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Timeless Algorithms: The Seminal Papers ebook for free
choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Timeless Algorithms: The Seminal Papers ebook for free