The Ultimate Introduction to Big Data
Frank Kane
  • Course duration: 14h 29m

I love that the author demonstrates how to use each tool and technology in the course, and provides great examples. The comparison (pros & cons) of tools really helps to decide what to use in a project.

Dmytro Bekuzarov, Java tech lead, GlobalLogic
See it. Do it. Learn it! Businesses rely on data for decision-making, success, and survival. The volume of data companies can capture is growing every day, and big data platforms like Hadoop help store, manage, and analyze it. In The Ultimate Introduction to Big Data, big data guru Frank Kane introduces you to big data processing systems and shows you how they fit together. This liveVideo spotlights over 25 different technologies in over 14 hours of video instruction.

Distributed by Manning Publications

This course was created independently by big data expert Frank Kane and is distributed by Manning through our exclusive liveVideo platform.

Table of Contents detailed table of contents

Learn all the buzzwords and install Hadoop

Introduction, and install Hadoop on your desktop!

Hadoop overview and history

Overview of the Hadoop ecosystem

Tips for using this course

Using Hadoops core: HDFS and MapReduce

HDFS: what it is and how it works

Install the MovieLens dataset into HDFS using the Ambari UI

Install the MovieLens dataset into HDFS using the command line

MapReduce: what it is and how it works

How MapReduce distributes processing

MapReduce example: break down movie ratings by rating score

Installing Python, MRJob, and nano

Code up the ratings histogram MapReduce job and run it

Exercise: Rank movies by their popularity

Check your results against mine!

Programming Hadoop with Pig

Introducing Ambari

Introducing Pig

Find the oldest movie with a 5-star rating using Pig

Find old 5-star movies with Pig

More Pig Latin

Exercise: Find the most-rated, one-star movie

Compare your results to mine!

Programming Hadoop with Spark

Why Spark?

The Resilient Distributed Dataset (RDD)

Find the movie with the lowest average rating with RDDs

Datasets and Spark 2.0

Find the movie with the lowest average rating wth DataFrames

Movie recommendations with MLLib

Exercise: Filter the lowest-rated movies by number of ratings

Check your results against mine!

Using relational data stores with Hadoop

What is Hive?

Use Hive to find the most popular movie

How Hive works

Exercise: Use Hive to find the movie with the highest average rating

Compare your solution to mine

Integrating MySQL with Hadoop

Install MySQL and import our movie data

Use Sqoop to import data from MySQL to HDFS/Hive

Use Sqoop to export data from Hadoop to MySQL

Using non-relational data stores with Hadoop

Why NoSQL?

What is HBase?

Import movie ratings into HBase

Use HBase with Pig to import data at scale

Cassandra overview

Installing Cassandra

Write Spark output into Cassandra

MongoDB overview

Install MongoDB and integrate it with Spark

Using the MongoDB shell

Choosing a database technology

Choose a database for a given problem

Querying your data interactively

Overview of Drill

Setting up Drill

Querying across multiple databases

Overview of Phoenix

Install Phoenix and query HBase with it

Integrate Phoenix with Pig

Overview of Presto

Install Presto and query Hive with it

Query both Cassandra and Hive using Presto

Managing your cluster

YARN explained

Tez explained

Use Hive on Tez and measure the performance benefit

Mesos explained

ZooKeeper explained

Simulating a failing master with ZooKeeper

Oozie explained

Set up a simple Oozie workflow

Zeppelin overview

Use Zeppelin to analyze movie ratings, part 1

Use Zeppelin to analyze movie ratings, part 2

Hue overview

Other technologies worth mentioning

Feeding data to your cluster

Kafka explained

Setting up Kafka and publishing some data

Publishing web logs with Kafka

Flume explained

Set up Flume and publish logs with it

Set up Flume to monitor a directory and store its data in HDFS

Analyzing streams of data

Spark Streaming: introduction

Analyze web logs published with Flume using Spark Streaming

Exercise: Monitor Flume-published logs for errors in real time

Solution: Aggregating HTTP access codes with Spark Streaming

Apache Storm: Introduction

Count words with Storm

Flink: an overview

Counting words with Flink

Designing real-world systems

The best of the rest

Review: how the pieces fit together

Understanding your requirements

Sample application: consume webserver logs and keep track of top sellers

Sample application: serving movie recommendations to a website

Exercise: Design a system to report web sessions per day

Solution: Design a system to report daily sessions

Learning more

Books and online resources

About the subject

Designed for data storage and processing, Hadoop is a reliable, fault-tolerant operating system. The most celebrated features of this open source Apache project are HDFS, Hadoop’s highly-scalable distributed file system, and the MapReduce data processing engine. Together, they can process vast amounts of data across large clusters. An ecosystem of hundreds of technologies has sprung up around Hadoop to answer the ever-growing demand for large-scale data processing solutions. Understanding the architecture of massive-scale data processing applications is an increasingly important and desirable skill, and you’ll have it when you complete this liveVideo course!

About the video

The Ultimate Introduction to Big Data teaches you how to design powerful distributed data applications. With lots of hands-on exercises, instructor Frank Kane goes beyond Hadoop to cover many related technologies, giving you valuable firsthand experience with modern data processing applications. You’ll learn to choose an appropriate data storage technology for your application and discover how Hadoop clusters are managed by YARN, Tez, Mesos, and other technologies. You’ll also experience the combined power of HDFS and MapReduce for storing and analyzing data at scale.

Using other key parts of the Hadoop ecosystem like Hive and MySQL, you’ll analyze relational data, and then tackle non-relational data analysis using HBase, Cassandra, and MongoDB. With Kafka, Sqoop, and Flume, you’ll make short work of publishing data to your Hadoop cluster. When you’re done, you’ll have a deep understanding of data processing applications on Hadoop and its distributed systems.


Suitable for software engineers, program managers, data analysts, database administrators, system architects, and everyone else with an interest in learning about Hadoop, its ecosystem, and how it relates to their work. Familiarity with the Linux command line would be helpful, along with some programming experience in Python or Scala.

What you will learn

  • Using HDFS and MapReduce for storing and analyzing data at scale
  • Analyzing relational data using Hive and MySQL
  • Creating scripts to process data on a Hadoop cluster using Pig and Spark
  • Using HBase, Cassandra, and MongoDB to analyze non-relational data
  • Querying data interactively with Drill, Phoenix, and Presto
  • Choosing an appropriate data storage technology for your application
  • Understanding how Hadoop clusters are managed by YARN, Tez, Mesos, Zookeeper, Zeppelin, Hue, and Oozie
  • Publishing data to your Hadoop cluster using Kafka, Sqoop, and Flume
  • Consuming streaming data using Spark Streaming, Flink, and Storm

About the instructor

Frank Kane holds 17 issued patents in the fields of distributed computing, data mining, and machine learning. He spent 9 years at Amazon and IMDb, developing and managing the technology that automatically delivers product and movie recommendations to millions of customers every day. Sundog Software, his own company specializing in virtual reality environment technology and teaching others about big data analysis, is his pride and joy.

liveVideo $34.99 $184.99

placing your order...

Don't refresh or navigate away from the page.

Good source to get to know the big data tools better.