Parallel and High Performance Computing
Robert Robey and Yuliana Zamora
  • MEAP began May 2019
  • Publication in Spring 2021 (estimated)
  • ISBN 9781617296468
  • 600 pages (estimated)
  • printed in black & white

This is an authoritative, comprehensive and detailed introduction to parallel computing.

Domingo Salazar
Complex calculations, like training deep learning models or running large-scale simulations, can take an extremely long time. Efficient parallel programming can save hours—or even days—of computing time. Parallel and High Performance Computing shows you how to deliver faster run-times, greater scalability, and increased energy efficiency to your programs by mastering parallel techniques for multicore processor and GPU hardware.

About the Technology

Modern computing hardware comes equipped with multicore CPUs and GPUs that can process numerous instruction sets simultaneously. Parallel computing takes advantage of this now-standard computer architecture to execute multiple operations at the same time, offering the potential for applications that run faster, are more energy efficient, and can be scaled to tackle problems that demand large computational capabilities. But to get these benefits, you must change the way you design and write software. Taking advantage of the tools, algorithms, and design patterns created specifically for parallel processing is essential to creating top performing applications.

About the book

Parallel and High Performance Computing is an irreplaceable guide for anyone who needs to maximize application performance and reduce execution time. Parallel computing experts Robert Robey and Yuliana Zamora take a fundamental approach to parallel programming, providing novice practitioners the skills needed to tackle any high-performance computing project with modern CPU and GPU hardware. Get under the hood of parallel computing architecture and learn to evaluate hardware performance, scale up your resources to tackle larger problem sizes, and deliver a level of energy efficiency that makes high performance possible on hand-held devices. When you’re done, you’ll be able to build parallel programs that are reliable, robust, and require minimal code maintenance.

This book is unique in its breadth, with discussions of parallel algorithms, techniques to successfully develop parallel programs, and wide coverage of the most effective languages for the CPU and GPU. The programming paradigms include MPI, OpenMP threading, and vectorization for the CPU. For the GPU, the book covers OpenMP and OpenACC directive-based approaches and the native-based CUDA and OpenCL languages.
Table of Contents detailed table of contents

Part 1: Introduction to Parallel Computing

1 Why parallel computing

1.1 Why should you learn about parallel computing?

1.1.1 What are the potential benefits of parallel computing?

1.1.2 Parallel computing cautions

1.2 The fundamental laws of parallel computing

1.2.1 The limit to parallel computing: Amdahl’s Law

1.2.2 Breaking through the parallel limit: Gustafson-Barsis’s Law

1.3 How does parallel computing work?

1.3.1 Walk through a sample application

1.3.2 A hardware model for today’s heterogeneous parallel systems

1.3.3 The application/software model for today’s heterogeneous parallel systems

1.4 Categorizing parallel approaches

1.5 Parallel Strategies

1.6 Parallel speedup vs comparative speedups: two different measures

1.7 What will you learn in this book?

1.7.1 Exercises

1.8 Summary

2 Planning for parallel

2.1 Approaching a new project: the preparation

2.1.1 Version Control: creating a safety vault for your parallel code

2.1.2 Test Suites: the first step to creating a robust, reliable application

2.1.3 Finding and fixing memory issues

2.1.4 Improving code portability

2.2 Profiling step: probing the gap between system capabilities and application performance

2.3 Planning step: a foundation for success

2.3.1 Exploring with benchmarks and mini-apps

2.3.2 Design of the core data structures and code modularity

2.3.3 Algorithms: redesign for parallel

2.4 Implementation step: where it all happens

2.5 Commit Step: wrapping it up with quality

2.6 Further explorations

2.6.1 Additional Reading

2.6.2 Exercises

2.7 Summary

3 Performance limits and profiling

3.1 Know your application’s potential performance limits

3.2 Determine your hardware capabilities: benchmarking

3.2.1 Tools for gathering system characteristics

3.2.2 Calculating theoretical maximum FLOPS

3.2.3 The memory hierarchy and theoretical memory bandwidth

3.2.4 Empirical measurement of bandwidth and flops

3.2.5 Calculating the machine balance between flops and bandwidth

3.3 Characterizing your application: profiling

3.3.1 Profiling Tools

3.3.2 Empirical measurement of processor clock frequency and energy consumption

3.3.3 Tracking memory during runtime

3.4 Further explorations

3.4.1 Additional Reading

3.4.2 Exercises

3.5 Summary

4 Data design and performance models

4.1 Performance data structures: data-oriented design

4.1.1 Multidimensional arrays

4.1.2 Array of Structures (AOS) versus Structures of Arrays (SOA)

4.1.3 Array of Structure of Arrays (AOSOA)

4.2 Three C’s of cache misses: compulsory, capacity, conflict

4.3 Simple performance models: a case study

4.3.1 Full matrix data representations

4.3.2 Compressed sparse storage representations

4.4 Advanced performance models

4.5 Network messages

4.6 Further explorations

4.6.1 Additional reading

4.6.2 Exercises

4.7 Summary

5 Parallel algorithms and patterns

5.1 Algorithm analysis for parallel computing applications

5.2 Parallel algorithms: what are they?

5.3 What is a hash function?

5.4 Spatial hashing: a highly-parallel algorithm

5.4.1 Using perfect hashing for spatial mesh operations

5.4.2 Using compact hashing for spatial mesh operations

5.5 Prefix sum (scan) pattern and its importance in parallel computing

5.5.1 Step-efficient parallel scan operation

5.5.2 Work-efficient parallel scan operation

5.5.3 Parallel scan operations for large arrays

5.6 Parallel global sum: addressing the problem of associativity

5.7 Future of parallel algorithm research

5.8 Further explorations

5.8.1 Additional reading

5.8.2 Exercises

5.9 Summary


6 Vectorization: FLOPs for free

6.1 Vectorization and Single-Instruction, Multiple-Data (SIMD) overview

6.3 Vectorization methods

6.3.1 Optimized libraries provide performance for little effort

6.3.2 Auto-vectorization: the easy way to vectorization speed-up (most of the time*):

6.3.3 Teaching the compiler through hints — pragmas and directives

6.3.4 Crappy loops, we got them: use vector intrinsics

6.3.5 Not for the faint of heart: using assembler code for vectorization

6.4 Programming style for better vectorization

6.5 Compiler flags relevant for vectorization for various compilers

6.6 OpenMP SIMD directives for better portability

6.7 Further explorations

6.7.1 Additional reading

6.7.2 Exercises

6.8 Summary

7 OpenMP that performs

7.1 OpenMP introduction

7.1.1 OpenMP concepts

7.1.2 A very simple OpenMP program

7.2 Typical OpenMP use cases: Loop-level, High-level, and MPI+OpenMP

7.2.1 Loop-level OpenMP for quick parallelization

7.2.2 High-level OpenMP for better parallel performance

7.2.3 MPI + OpenMP for extreme scalability

7.3 Examples of standard loop-level OpenMP

7.3.1 Loop level OpenMP: Vector addition example

7.3.2 Stream triad example

7.3.3 Loop level OpenMP: Stencil example

7.3.4 Performance of loop-level examples

7.3.5 Reduction example of a global sum using OpenMP threading

7.3.6 Potential loop-level OpenMP issues

7.4 Variable scope is critically important in OpenMP for correctness

7.5 Function-level OpenMP: making a whole function thread parallel

7.6 Improving parallel scalability with high-level OpenMP

7.6.1 How to implement high-level OpenMP

7.6.2 Example of implementing high-level OpenMP

7.7 Hybrid threading and vectorization with OpenMP

7.8 Advanced examples using OpenMP

7.8.1 Stencil example with a separate pass for the x and y directions

7.8.2 Kahan summation implementation with OpenMP threading

7.8.3 Threaded implementation of the prefix scan algorithm

7.9 Threading tools essential for robust implementations

7.9.1 Using Allinea-Map to get a quick high-level profile of your application

7.9.2 Finding your thread race conditions with Intel thread inspector

7.10 Example of task-based support algorithm

7.11 Further explorations

7.11.1 Additional reading

7.11.2 Exercises

7.12 Summary

8 MPI: the parallel backbone

8.1 The basics for an MPI program

8.1.1 Basic MPI function calls for every MPI program

8.1.2 Compiler wrappers for simpler MPI programs

8.1.3 Using parallel startup commands

8.1.4 Minimum working example of an MPI program

8.2 The send and receive commands for process-to-process communication

8.3 Collective communication: a powerful component of MPI

8.3.1 Using a barrier to synchronize timers

8.3.2 Using the broadcast to handle small file input

8.3.3 Using a reduction to get a single value from across all processes

8.3.4 Using gather to put order in debug printouts

8.3.5 Using scatter and gather to send data out to processes for work

8.4 Data parallel examples

8.4.1 Stream triad to measure bandwidth on the node

8.4.2 Ghost cell exchanges in a two-dimensional mesh

8.4.3 Ghost cell exchanges in a three-dimensional stencil calculation

8.5 Advanced MPI functionality to simplify code and enable optimizations

8.5.1 Using custom MPI datatypes for performance and code simplification

8.5.2 Cartesian topology support in MPI

8.5.3 Performance tests of ghost cell exchange variants

8.6 Hybrid MPI+OpenMP for extreme scalability

8.6.1 Hybrid MPI+OpenMP benefits

8.6.2 MPI+OpenMP example

8.7 Further explorations

8.7.1 Additional reading

8.7.2 Exercises

8.8 Summary


9 GPU architectures and concepts

9.1 The CPU-GPU system as an accelerated computational platform

9.1.1 Integrated GPUs: an underutilized option on commodity-based systems

9.1.2 Dedicated GPUs: the workhorse option

9.2 The GPU and the thread engine

9.2.1 The compute unit is the multiprocessor

9.2.2 Processing elements are the individual processors

9.2.3 Multiple data operations by each processing element

9.2.4 Calculating the peak theoretical flops for some leading GPUs

9.3 Characteristics of GPU memory spaces

9.3.1 Calculating theoretical peak memory bandwidth

9.3.2 Measuring the GPU stream benchmark

9.3.3 Roofline performance model for GPUs

9.3.4 Using the mixbench performance tool to choose the best GPU for a workload

9.4 The PCI bus: CPU-GPU data transfer overhead

9.4.1 Theoretical Bandwidth of the PCI Bus

9.4.2 A Benchmark Application for PCI bandwidth

9.5 Multi-GPU platforms and MPI

9.5.1 Optimizing the data movement from one GPU across a network to another

9.5.2 A higher performance alternative to the PCI Bus

9.6 Potential benefits of GPU accelerated platforms

9.6.1 Reducing time-to-solution

9.6.2 Reducing energy use with GPUs

9.6.3 Reduction in cloud computing costs with GPUs

9.7 When to use GPUs

9.8 Further explorations

9.8.1 Additional reading

9.8.2 Exercises

9.9 Summary

10 GPU programming model

10.1 GPU programming abstractions: a common framework

10.1.1 Data decomposition into independent units of work: an NDRange or grid

10.1.2 Work groups provide a right-sized chunk of work

10.1.3 Subgroups, warps or wavefronts execute in lockstep

10.1.4 Work item: the basic unit of operation

10.1.5 SIMD or vector hardware

10.2 The code structure for the GPU programming model

10.2.1 Me programming: the concept of a parallel kernel

10.2.2 Thread indices: mapping the local tile to the global world

10.2.3 Index sets

10.2.4 How to address memory resources in your GPU programming model

10.3 Optimizing GPU resource usage

10.3.1 How many registers does my kernel use?

10.3.2 Occupancy - making more work available for work group scheduling

10.4 Reduction pattern requires synchronization across work groups

10.5 Asynchronous computing through queues (streams)

10.6 Developing a plan to parallelize an application for GPUs

10.6.1 Case 1: 3D atmospheric simulation

10.6.2 Case 2: Unstructured mesh application

10.7 Further explorations

10.7.1 Additional reading

10.7.2 Exercises

10.8 Summary

11 Directive-based GPU programming

11.1 Process to apply directives and pragmas for a GPU implementation

11.2 OpenACC: the easiest way to run on your GPU

11.2.1 Compiling OpenACC code

11.2.2 Parallel compute regions in OpenACC for accelerating computations

11.2.3 Using directives to reduce data movement between the CPU and the GPU

11.2.4 Optimizing the GPU kernels

11.2.5 Summary of performance results for stream triad

11.2.6 Advanced OpenACC techniques

11.3 OpenMP: the heavyweight champ enters the world of accelerators

11.3.1 Compiling OpenMP code

11.3.2 Generating parallel work on the GPU with OpenMP

11.3.3 Creating data regions to control data movement to the GPU with OpenMP

11.3.4 Optimizing OpenMP for GPUs

11.3.5 Advanced OpenMP for GPUs

11.4 Further explorations

11.4.1 Additional reading

11.4.2 Exercises

11.5 Summary

12 GPU languages: getting down to basics

12.1 Features of a native GPU programming language

12.2 CUDA and HIP GPU languages; the low-level performance option

12.2.1 Writing and building your first CUDA application

12.2.2 A reduction kernel in CUDA: life gets complicated

12.2.3 Hipifying the CUDA code

12.3 OpenCL for a portable open source GPU language

12.3.1 Writing and building your first OpenCL application

12.3.2 Reductions in OpenCL

12.4 SYCL: an experimental C++ implementation goes mainstream

12.5 Higher-level languages for performance portability

12.5.1 Kokkos: a performance portability ecosystem.

12.5.2 RAJA for a more adaptable performance portability layer

12.6 Further explorations

12.6.1 Additional reading

12.6.2 Exercises

12.7 Summary

13 GPU profiling and tools

13.1 Profiling tools overview

13.2 How to select a good workflow

13.3 Example problem: shallow water simulation

13.4 A sample of a profiling workflow

13.4.1 Run the shallow water code

13.4.2 Profile the CPU code:

13.4.3 Add OpenACC compute directives

13.4.4 Add data movement directives

13.4.5 The Nvidia Nsight suite of tools can be a powerful development aid

13.4.6 Code XL for the AMD GPU ecosystem

13.5 Don’t get lost in the swamp: focus on the important metrics

13.5.1 Occupancy

13.5.2 Issue efficiency

13.5.3 Achieved Bandwidth

13.6 Containers and Virtual Machines provide alternate workflows

13.6.1 Docker containers as a workaround

13.6.2 Virtual machines using Virtual Box

13.7 Cloud options: a flexible and portable capability

13.8 Further Explorations

13.8.1 Additional Reading

13.8.2 Exercises

13.9 Summary

Part 4: High Performance Computing Ecosystem

14 Affinity: truce with the kernel

14.1 Why is affinity important?

14.2 Discovering your architecture

14.3 Thread affinity with OpenMP

14.4 Process affinity with MPI

14.4.1 Default process placement with OpenMPI

14.4.2 Taking control: Basic techniques for specifying process placement in OpenMPI

14.4.3 Affinity is more than just process binding: the full picture

14.5 Affinity for MPI plus OpenMP

14.6 Controlling affinity from the command line

14.6.1 Using hwloc-bind to assign affinity

14.6.2 likwid-pin: an affinity tool in the likwid tool suite

14.7 The future: setting and changing affinity at runtime

14.7.1 Setting affinities in your executable

14.7.2 Changing your process affinities during runtime

14.8 Further Explorations

14.8.1 Additional Reading

14.8.2 Exercises

14.9 Summary

15 Batch schedulers: bringing order to chaos

16 File operations for a parallel world

17 Tools and resources


Appendix A: Appendix A: References

A.1 Chapter 1

A.2 Chapter 2

A.3 Chapter 3

A.4 Chapter 4

A.5 Chapter 5

Appendix B: Appendix B: Solutions to Exercises

B.1 Chapter 1

B.2 Chapter 2

B.3 Chapter 3

B.4 Chapter 4

B.5 Chapter 5

B.6 Chapter 6

B.7 Chapter 7

B.8 Chapter 8

B.9 Chapter 9

B.10 Chapter 10

B.11 Chapter 11

Appendix C: Appendix C: Glossary

What's inside

  • Steps for planning a new parallel project
  • Choosing the right data structures and algorithms
  • Addressing underperforming kernels and loops
  • The differences in CPU and GPU architecture

About the reader

For experienced programmers with proficiency in a high performance computing language such as C, C++, or Fortran.

About the authors

Robert Robey has been active in the field of parallel computing for over 30 years. He works at Los Alamos National Laboratory, and has previously worked at the University of New Mexico, where he started up the Albuquerque High Performance Computing Center. Yuliana Zamora has lectured on efficient programming of modern hardware at national conferences, based on her work developing applications running on tens of thousands of processing cores and the latest GPU architectures.

placing your order...

Don't refresh or navigate away from the page.
Manning Early Access Program (MEAP) Read chapters as they are written, get the finished eBook as soon as it’s ready, and receive the pBook long before it's in bookstores.
print book $34.99 $69.99 pBook + eBook + liveBook
Additional shipping charges may apply
Parallel and High Performance Computing (print book) added to cart
continue shopping
go to cart

eBook $27.99 $55.99 3 formats + liveBook
Parallel and High Performance Computing (eBook) added to cart
continue shopping
go to cart

Prices displayed in rupees will be charged in USD when you check out.

FREE domestic shipping on three or more pBooks