Stanford CS149, Winter 2019

From smart phones, to multi-core CPUs and GPUs, to the world's largest supercomputers and web sites, parallel processing is ubiquitous in modern computing. The goal of this course is to provide a deep understanding of the fundamental principles and engineering trade-offs involved in designing modern parallel computing systems as well as to teach parallel programming techniques necessary to effectively utilize these machines. Because writing good parallel programs requires an understanding of key machine performance characteristics, this course will cover both parallel hardware and software design.

Basic Info
Tues/Thurs 4:30-6:00pm
Room 420-040
Instructors: Kunle Olukotun and Kayvon Fatahalian
See the course info page for more info on course policies and logistics.
Winter 2019 Schedule
Jan 8
Motivations for parallel chip decisions, challenges of parallelizing code
Jan 10
Forms of parallelism: multicore, SIMD, threading + understanding latency and bandwidth
Jan 15
ways of thinking about parallel programs, and their corresponding hardware implementations, ISPC programming
Jan 17
Thought process of parallelizing a program in data parallel and shared address space models
Jan 22
Achieving good work distribution while minimizing overhead, scheduling Cilk programs with work stealing
Jan 24
Program Optimization 2: Locality and Communication
Message passing, async vs. blocking sends/receives, pipelining, increasing arithmetic intensity, avoiding contention
Jan 29
GPU architecture and CUDA Programming
CUDA programming abstractions, and how they are implemented on modern GPUs
Jan 31
Data-Parallel Programming
Data parallel thinking: map, reduce, scan, prefix sum, groupByKey
Feb 5
Snooping-Based Cache Coherence
Definition of memory coherence, invalidation-based coherence using MSI and MESI, false sharing
Feb 7
Memory Consistency
Consistency vs. coherence, relaxed consistency models and their motivation, acquire/release semantics
Feb 12
Midterm Exam
Feb 14
Implementing Synchronization
Machine-level atomic operations, implementing locks, implementing barriers, deadlock/livelock/starvation
Feb 19
Fine-Grained Synchronization and Lock-Free Programming
Fine-grained snychronization via locks, basics of lock-free programming: single-reader/writer queues, lock-free stacks, the ABA problem, hazard pointers
Feb 21
Transactional memory
Motivation for transactions, design space of transactional memory implementations, lazy-optimistic HTM
Feb 26
Distributed Computing using Spark
producer-consumer locality, RDD abstraction, Spark implementation and scheduling
Feb 28
Heterogeneous Parallelism and Hardware Specialization
Energy-efficient computing, motivation for heterogeneous processing, fixed-function processing, FPGAs, mobile SoCs
Mar 5
Domain-Specific Programming Systems
Motivation for DSLs, case studies on Halide and TBD DSL
Mar 7
Domain-Specific Programming for Parallel Graph Processing
GraphLab, Ligra, and GraphChi, streaming graph processing, graph compression
Mar 12
Applications Talk TBD
Topic TBD: scaling a web site, DNN performance tuning, etc.
Mar 14
Course Wrap Up
Have a great spring break!
Programming Assignments
Jan 18Assignment 1: Analyzing Parallel Program Performance on a Quad-Core CPU
Jan 29Assignment 2: A Multi-Threaded Web Server
Feb 11Assignment 3: Parallel Graph Processing
Feb 26Assignment 4: A Simple Renderer in CUDA
Mar 7Assignment 5: Big Data Processing in Spark