Spark GraphX in Action
Michael S. Malak and Robin East
  • June 2016
  • ISBN 9781617292521
  • 280 pages
  • printed in black & white

Learn complex graph processing from two experienced authors…A comprehensive guide.

Gaurav Bhardwaj, 3Pillar Global

Spark GraphX in Action starts out with an overview of Apache Spark and the GraphX graph processing API. This example-based tutorial then teaches you how to configure GraphX and how to use it interactively. Along the way, you'll collect practical techniques for enhancing applications and applying machine learning algorithms to graph data.

Table of Contents detailed table of contents

Part 1 SPARK and GRAPHS

1. Two important technologies: Spark and graphs

1.1. Spark: the step beyond Hadoop MapReduce

1.1.1. The elusive definition of Big Data

1.1.2. Hadoop: the world before Spark

1.1.3. Spark: in—memory MapReduce processing

1.2. Graphs: finding meaning from relationships

1.2.1. Uses of graphs

1.2.2. Types of graph data

1.2.3. Plain RDBMS inadequate for graphs

1.3. Putting them together for lightning fast graph processing: Spark GraphX

1.3.1. Property graph - adding richness

1.3.2. Graph partitioning - graphs meet Big Data

1.3.3. GraphX lets you choose: graph parallel or data parallel

1.3.4. Various ways GraphX fits into a processing flow

1.3.5. GraphX vs. other systems

1.3.6. Storing the graphs: distributed file Storage vs. graph database

1.4. Summary

2. GraphX Quick Start

2.1. Getting set up and getting data

2.2. Interactive GraphX querying using the Spark Shell

2.3. PageRank example

2.4. Summary

3. Some Fundamentals

3.1. Scala, the native language of Spark

3.1.1. Scala philosophy: conciseness and expressiveness

3.1.2. Functional Programming

3.1.3. Inferred typing

3.1.4. Class declaration

3.1.5. Map and Reduce

3.1.6. Everything is a function

3.1.7. Java interoperability

3.2. Spark

3.2.1. Distributed in-memory data: RDDs

3.2.2. Laziness

3.2.3. Cluster requirements and terminology

3.2.4. Serialization

3.2.5. Common RDD operations

3.2.6. Hello World with Spark and sbt

3.3. Graph terminology

3.3.1. Basics

3.3.2. RDF graphs vs. property graphs

3.3.3. Adjacency matrix

3.3.4. Graph querying systems

3.4. Summary

Part 2 CONNECTING VERTICES

4. GraphX Basics

4.1. Vertex and Edge classes

4.2. Mapping operations

4.2.1. Simple graph transformation

4.2.2. Map/Reduce

4.2.3. Iterated Map/Reduce

4.3. Serialization/Deserialization

4.3.1. Reading/writing binary format

4.3.2. JSON format

4.3.3. GEXF format for Gephi visualization software

4.4. Graph generation

4.4.1. Deterministic graphs

4.4.2. Random graphs

4.5. Pregel API

4.6. Summary

5. Built-in Algorithms

5.1. Seek out authoritative nodes: PageRank

5.1.1. PageRank algorithm explained

5.1.2. Invoking PageRank in GraphX

5.1.3. Personalized PageRank

5.2. Measuring connectedness: Triangle Count

5.2.1. Uses of Triangle Count

5.2.2. Slashdot friends and foes example

5.3. Find the fewest hops: Shortest Paths

5.4. Finding isolated populations: Connected Components

5.4.1. Predicting social circles

5.5. Reciprocated love only, please: Strongly Connected Components

5.6. Community detection: LabelPropagation

5.7. Summary

6. Other Useful Graph Algorithms

6.1. Your Own GPS: Shortest Paths with Weights

6.2. Travelling Salesman: Greedy Algorithm

6.3. Route Utilities: Minimum Spanning Trees

6.3.1. Deriving Taxonomies with Word2Vec and Minimum Spanning Trees

6.4. Summary

7. Machine Learning

7.1. Supervised, Unsupervised, Semi-Supervised Learning

7.2. Recommend a Movie: SVDPlusPlus

7.2.1. Explanation of the Koren formula

7.3. Using GraphX With MLlib

7.3.1. Determine Topics: Latent Dirichlet Allocation

7.3.2. Detect Spam: LogisticRegressionWithSGD

7.3.3. Image Segmentation (for Computer Vision) Using Power Iteration Clustering

7.4. Poor Man’s Training Data: Graph-Based Semi-Supervised Learning

7.4.1. K-Nearest Neighbors Graph Construction

7.4.2. Semi—Supervised Learning Label Propagation

7.5. Summary

Part 3 OVER THE ARC

8. The Missing Algorithms

8.1. Missing Basic Graph Operations

8.1.1. Common Sense Subgraphs

8.1.2. Merge two graphs

8.2. Reading RDF Graph Files

8.2.1. Matching Vertices and Constructing the Graph

8.2.2. Improving Performance with IndexedRDD, the RDD HashMap

8.3. Poor Man’s Graph Isomorphism - Finding Missing Wikipedia Infobox Items

8.4. Global Clustering Coefficient: Compare Connectedness

8.5. Summary

9. Performance and Monitoring

9.1. Monitoring your Spark Application

9.1.1. How Spark runs your Application

9.1.2. Understanding your application run-time with Spark monitoring

9.1.3. History server

9.2. Configuring Spark

9.2.1. Utilizing All CPU cores

9.3. Spark Performance Tuning

9.3.1. Speeding—up Spark with caching and persistence

9.3.2. Checkpointing

9.3.3. Reducing memory pressure with serialization

9.4. Graph Partitioning

9.5. Summary

10. Other Languages and Tools

10.1. Other Languages

10.1.1. Using GraphX with Java 7

10.1.2. Using GraphX with Java 8

10.1.3. Whether GraphX May Gain Python or R Bindings in the Future

10.2. Other Visualization Tool: Apache Zeppelin plus d3.js

10.3. Almost a Database: Spark JobServer

10.3.1. Example: Query Slashdot friends degree of separation

10.3.2. More on using Spark JobServer

10.4. Using SQL with Spark graphs with GraphFrames

10.4.1. Getting GraphFrames, plus GraphX interoperability

10.4.2. Using SQL for convenience and performance

10.4.3. Searching for vertices with Cypher subset

10.4.4. Slightly more complex isomorphic searching on YAGO

10.5. Summary

Appendixes

Appendix A: Installing Spark

A.1. On a Local Virtual Machine: CDH QuickStart VM

A.1.1. VirtualBox Tweaks

A.2. Onto your laptop and Hadoopless: Linux or OSX

A.2.1. On a Custom Local Virtual Machine

A.3. In the Cloud: Amazon Web Services

Appendix B: Gephi Visualization Software

B.1. Laying Out Your Environment

B.2. Basic Recipe

B.3. Key Settings

B.3.1. Layout Window

B.3.2. Preview Settings Window

Appendix C: Resources: Where to Go for More

C.1. Spark

C.1.1. Apache Mailing Lists

C.1.2. Databricks forums

C.1.3. Conference and Meetup Videos

C.1.4. Jira

C.1.5. Twitter

C.1.6. spark-packages.org

C.1.7. AMPLab

C.1.8. Google Scholar Alerts

C.1.9. Author blogs

C.2. Scala

C.3. Graphs

Appendix D: List of Scala Tips in this Book

About the Technology

GraphX is a powerful graph processing API for the Apache Spark analytics engine that lets you draw insights from large datasets. GraphX gives you unprecedented speed and capacity for running massively parallel and machine learning algorithms.

About the book

Spark GraphX in Action begins with the big picture of what graphs can be used for. This example-based tutorial teaches you how to use GraphX interactively. You?ll start with a crystal-clear introduction to building big data graphs from regular data, and then explore the problems and possibilities of implementing graph algorithms and architecting graph processing pipelines. Along the way, you?ll collect practical techniques for enhancing applications and applying machine learning algorithms to graph data.

What's inside

  • Understanding graph technology
  • Using the GraphX API
  • Developing algorithms for big graphs
  • Machine learning with graphs
  • Graph visualization

About the reader

Readers should be comfortable writing code. Experience with Apache Spark and Scala is not required.

About the authors

Michael Malak has worked on Spark applications for Fortune 500 companies since early 2013. Robin East has worked as a consultant to large organizations for over 15 years and is a data scientist at Worldpay.


Buy
combo $49.99 pBook + eBook
eBook $39.99 pdf + ePub + kindle

FREE domestic shipping on three or more pBooks

The best resource to go from GraphX novice to expert in the least amount of time.

Justin Fister, PaperRater

A must-read for anyone serious about large-scale graph data mining!

Antonio Magnaghi, OpenMail

Reveals the awesome and elegant capabilities of working with linked data for large-scale datasets.

Sumit Pal, Independent consultant