Course Overview

Large datasets pose difficulties across the machine learning pipeline. They are difficult to visualize, and it can be hard to determine what sorts of errors and biases may be present in them. They are computationally expensive to process, and the cost of learning is often hard to predict---for instance, an algorithm that runs quickly on a dataset that fits in memory may be exorbitantly expensive when the dataset is too large for memory. Large datasets may also display qualitatively different behavior in terms of which learning methods produce the most accurate predictions.

This course is intended to provide a student practical knowledge of, and experience with, the issues involving large datasets. Among the topics considered are: data cleaning, visualization, and pre-processing at scale; principles of parallel and distributed computing for machine learning; techniques for scalable deep learning; analysis of programs in terms of memory, disk usage, and (for parallel methods) communication complexity; and methods for low-latency inference. Students will gain experience with common large-scale computing libraries and infrastructure, including Apache Spark and TensorFlow.


Students are required to have taken a CMU introductory machine learning course (10-401, 10-601, 10-701, or 10-715). A strong background in programming will also be necessary; suggested prerequisites include 15-210, 15-214, or equivalent. Students are expected to be familiar with Python or learn it during the course.


There will be no required textbooks, though we may suggest additional reading in the schedule below.


We will use Piazza for class discussions. Please go to this Piazza website to join the course forum (note: you must use a email account to join the forum). We strongly encourage students to post on this forum rather than emailing the course staff directly (this will be more efficient for both students and staff). Students should use Piazza to:

The course Academic Integrity Policy must be followed on the message boards at all times. Do not post or request homework solutions! Also, please be polite.

Course Staff

Teaching Assistants

Kushagr Arora
OH: Thursdays 3-4pm,

Saket Chaudhary
OH: Fridays 3-4pm,

Jiayong Hu
OH: Tuesdays 1:30-2:30pm,

Anwen Huang
OH: Mondays 2-4pm,

Tian Li
OH: Wednesdays 11am-12pm,

Daniel Mo
OH: Mondays 4-6pm,

Zach (Zeyu) Peng
OH: Mondays 12-2pm,

Kuo Tian
OH: Fridays 1-3pm,

Grading Policy

Grades will be based on the following components:

Academic Integrity Policy

Group studying and collaborating on problem sets are encouraged; working together is a great way to understand new material. Students are free to discuss the homework problems with anyone under the following conditions: Students are encouraged to read CMU's Policy on Cheating and Plagiarism.

A Note on Self Care

Please take care of yourself. Do your best to maintain a healthy lifestyle this semester by eating well, exercising, avoiding drugs and alcohol, getting enough sleep, and taking some time to relax. This will help you achieve your goals and cope with stress. All of us benefit from support during times of struggle. You are not alone. Besides the instructors, who are here to help you succeed, there are many helpful resources available on campus and an important part of the college experience is learning how to ask for help. Asking for support sooner rather than later is often helpful.

If you or anyone you know experiences any academic stress, difficult life events, or feelings like anxiety or depression, we strongly encourage you to seek support. Counseling and Psychological Services (CaPS) is here to help: call 412-268-2922 and visit their website at Consider reaching out to a friend, faculty, or family member you trust for help getting connected to the support that can help.


This course is based in part on material developed by William Cohen, Barnabas Poczos, Ameet Talwalkar, and Anthony Joseph.

Schedule (Subject to Change)

Date Topics Resources HW
1/13 Introduction MLSys: The New Frontier of ML Systems
1/15 Distributed Computing, MapReduce
1/17 Recitation: Spark toplogy basics + setup with Databricks
(Slides (Tian), Slides (Heather), Lab0)
1/20 No class (MLK Day)
1/22 Intro to Spark HW1 released
1/24 Recitation: Spark Transformations and Actions
(Lab2 (Notebook), Slides)
1/27 Data Cleaning
Spark: Joins, Structure, and DataFrames
1/29 Data Visualization Visualization for ML
A Tutorial on PCA
1/31 Recitation: Spark RDDs and DataFrames
(Lab3 (Notebook), Slides are those from Monday's lecture)
2/3 ML Review Deep Learning, Ch. 5.2-5.4
Math for ML (review)
HW1 due
2/5 Distributed Linear Regression, Part I HW2 released
2/7 Recitation: Linear Algebra Review
2/10 Distributed Linear Regression, Part II
2/12 Adv. Distributed Optimization
2/14 Recitation: Learning Rate Optimization
(Lab4 (Notebook), Slides)
2/17 Distributed Logistic Regression HW2 due
2/19 Partitioning and Locality HW3 released
2/21 Recitation: Probability Review
2/24 Large-Scale Data Structures
2/26 PCA
3/2 Project Proposals HW3 due
3/4 In-Class Midterm
3/9 No class (Spring Break)
3/11 No class (Spring Break)
3/16 All CMU classes cancelled
3/18 Deep Learning Deep Learning, Ch. 6
William Cohen's Autodiff Notes
3/23 ML Frameworks + TensorFlow
3/25 ML Hardware + TensorFlow Performant, scalable models in TensorFlow 2 with, tf.function & tf.distribute (TF World '19) [Video]
3/30 Optimization for DL Deep Learning, Ch. 8
4/1 Efficient Hyperparameter Tuning
Guest lecture: Liam Li
4/6 Parallel/Distributed DL
4/8 Low Latency Inference
4/13 TVM & DL Compilers
Guest lecture: Tianqi Chen
4/15 Productionizing Large-Scale ML
4/20 Federated Learning
4/22 Project Poster Session
4/27 TBA
4/29 In-Class Final