Course Overview

Large datasets pose difficulties across the machine learning pipeline. They are difficult to visualize, and it can be hard to determine what sorts of errors and biases may be present in them. They are computationally expensive to process, and the cost of learning is often hard to predict---for instance, an algorithm that runs quickly on a dataset that fits in memory may be exorbitantly expensive when the dataset is too large for memory. Large datasets may also display qualitatively different behavior in terms of which learning methods produce the most accurate predictions.

This course is intended to provide a student practical knowledge of, and experience with, the issues involving large datasets. Among the topics considered are: data cleaning, visualization, and pre-processing at scale; principles of parallel and distributed computing for machine learning; techniques for scalable deep learning; analysis of programs in terms of memory, disk usage, and (for parallel methods) communication complexity; and methods for low-latency inference. Students will gain experience with common large-scale computing libraries and infrastructure, including Apache Spark and TensorFlow.


Students are required to have taken a CMU introductory machine learning course (10-401, 10-601, 10-701, or 10-715). A strong background in programming will also be necessary; suggested prerequisites include 15-210, 15-214, or equivalent. Students are expected to be familiar with Python or learn it during the course.


There will be no required textbooks, though we may suggest additional reading in the schedule below.


We will use Piazza for class discussions. Please go to this Piazza website to join the course forum (note: you must use a email account to join the forum). We strongly encourage students to post on this forum rather than emailing the course staff directly (this will be more efficient for both students and staff). Students should use Piazza to:

The course Academic Integrity Policy must be followed on the message boards at all times. Do not post or request homework solutions! Also, please be polite.

Course Staff

Teaching Assistants

Kushagr Arora
Office Hours: Thursdays 3-4pm, GHC 5th floor commons

Saket Chaudhary
Office Hours: Fridays 3-5pm, GHC 5th floor commons

Jiayong Hu
Office Hours: Tuesdays 1:30-2:30pm, GHC 5th floor commons

Anwen Huang
Office Hours: Mondays 2-4pm, GHC 5th floor commons

Tian Li
Office Hours: Wednesdays 11am-12pm, GHC 5th floor commons

Daniel Mo
Office Hours: Mondays 4-6pm, GHC 5th floor commons

Zach (Zeyu) Peng
Office Hours: Mondays 12-2pm, Location TBA

Kuo Tian
Office Hours: Fridays 1-3pm, GHC 5th floor kitchen area

Grading Policy

Grades will be based on the following components:

Academic Integrity Policy

Group studying and collaborating on problem sets are encouraged; working together is a great way to understand new material. Students are free to discuss the homework problems with anyone under the following conditions: Students are encouraged to read CMU's Policy on Cheating and Plagiarism.

A Note on Self Care

Please take care of yourself. Do your best to maintain a healthy lifestyle this semester by eating well, exercising, avoiding drugs and alcohol, getting enough sleep, and taking some time to relax. This will help you achieve your goals and cope with stress. All of us benefit from support during times of struggle. You are not alone. Besides the instructors, who are here to help you succeed, there are many helpful resources available on campus and an important part of the college experience is learning how to ask for help. Asking for support sooner rather than later is often helpful.

If you or anyone you know experiences any academic stress, difficult life events, or feelings like anxiety or depression, we strongly encourage you to seek support. Counseling and Psychological Services (CaPS) is here to help: call 412-268-2922 and visit their website at Consider reaching out to a friend, faculty, or family member you trust for help getting connected to the support that can help.


This course is based in part on material developed by William Cohen, Barnabas Poczos, Ameet Talwalkar, and Anthony Joseph.

Schedule (Subject to Change)

Date Topics Reading HW
1/13 Introduction MLSys: The New Frontier of ML Systems
1/15 Distributed Computing, MapReduce
1/17 Recitation: Spark toplogy basics + setup with Databricks (Slides (Tian), Slides (Heather), Lab0)
1/20 No class (MLK Day)
1/22 Intro to Spark HW1 released
1/27 Data cleaning
1/29 Data summarization, exploration, visualization
2/3 ML Review HW1 due
2/5 Distributed Linear Regression, Part I
2/10 Distributed Linear Regression, Part II
2/12 Adv. Distributed Optimization
2/17 Distributed Logistic Regression
2/19 Large-Scale Data Structures
2/24 PCA, Part I
2/26 PCA, Part II
3/2 TBA
3/4 In-Class Midterm
3/9 No class (Spring Break)
3/11 No class (Spring Break)
3/16 Deep Learning
3/18 TVM & DL Compilers
Guest lecture: Tianqi Chen
3/23 Intro to TensorFlow
3/25 Computing with GPUs
3/30 Efficient Hyperparameter Tuning
Guest lecture: Liam Li
4/1 Advanced DL, Part I
4/6 Advanced DL, Part II
4/8 Low Latency Inference, Part I
4/13 Low Latency Inference, Part II
4/15 Productionizing Large-Scale ML
4/20 Federated Learning
4/22 Project Poster Session
4/27 TBA
4/29 In-Class Final