Stanford CS149, Fall 2021
PARALLEL COMPUTING

From smart phones, to multi-core CPUs and GPUs, to the world's largest supercomputers and web sites, parallel processing is ubiquitous in modern computing. The goal of this course is to provide a deep understanding of the fundamental principles and engineering trade-offs involved in designing modern parallel computing systems as well as to teach parallel programming techniques necessary to effectively utilize these machines. Because writing good parallel programs requires an understanding of key machine performance characteristics, this course will cover both parallel hardware and software design.

Basic Info
Tues/Thurs 3:15-4:45pm
All lectures are virtual
Instructors: Kayvon Fatahalian and Kunle Olukotun
See the course info page for more info on policies and logistics.
Fall 2021 Schedule
Sep 21
Challenges of parallelizing code, motivations for parallel chips, processor basics
Sep 23
Forms of parallelism: multicore, SIMD, threading + understanding latency and bandwidth
Sep 28
Ways of thinking about parallel programs, and their corresponding hardware implementations, ISPC programming
Sep 30
Thought process of parallelizing a program in data parallel and shared address space models
Oct 05
Achieving good work distribution while minimizing overhead, scheduling Cilk programs with work stealing
Oct 07
Message passing, async vs. blocking sends/receives, pipelining, increasing arithmetic intensity, avoiding contention
Oct 12
CUDA programming abstractions, and how they are implemented on modern GPUs
Oct 14
Data-parallel operations like map, reduce, scan, prefix sum, groupByKey
Oct 19
Producer-consumer locality, RDD abstraction, Spark implementation and scheduling
Oct 21
Definition of memory coherence, invalidation-based coherence using MSI and MESI, false sharing
Oct 26
Consistency vs. coherence, relaxed consistency models and their motivation, acquire/release semantics
Oct 28
Implementation of locks, fine-grained synchronization via locks, basics of lock-free programming: single-reader/writer queues, lock-free stacks, the ABA problem, hazard pointers
Nov 02
No class (Stanford Election Day Holiday)
Go vote!
Nov 04
Motivation for transactions, design space of transactional memory implementations.
Nov 09
Finishing up transactional memory focusing on implementations of STM and HTM.
Nov 11
Energy-efficient computing, motivation for heterogeneous processing, fixed-function processing, FPGAs, mobile SoCs
Nov 16
Performance/productivity motivations for DSLs, case study on Halide image processing DSL
Nov 18
domain-specific frameworks for graph processing, streaming graph processing, graph compression, DRAM basics
Nov 30
Programming for Hardware Specialization
Programming reconfigurable hardware like FPGAs and CGRAs
Dec 02
Efficiently Evaluating DNNs (+ Course Wrap Up)
Scheduling conv layers, exploiting precision and sparsity, DNN acelerators (e.g., GPU TensorCores, TPU)
Programming Assignments
Oct 1 Assignment 1: Analyzing Parallel Program Performance on a Quad-Core CPU
Oct 19 Assignment 2: Scheduling Task Graphs
Nov 5 Assignment 3: A Simple Renderer in CUDA
Nov 29 Assignment 4: Big Graph Processing in OpenMP
Dec 3 Extra Credit: Implement Matrix Multiplication as Fast as You Can
Written Assignments
Sep 30 Written Assignment 1
Oct 7 Written Assignment 2
Oct 26 Written Assignment 3
Nov 10 Written Assignment 4
Nov 30 Written Assignment 5