Course Overview

Large datasets pose difficulties across the machine learning pipeline. They are difficult to visualize, and it can be hard to determine what sorts of errors and biases may be present in them. They are computationally expensive to process, and the cost of learning is often hard to predict---for instance, an algorithm that runs quickly on a dataset that fits in memory may be exorbitantly expensive when the dataset is too large for memory. Large datasets may also display qualitatively different behavior in terms of which learning methods produce the most accurate predictions.

This course is intended to provide a student practical knowledge of, and experience with, the issues involving large datasets. Among the topics considered are: data cleaning, visualization, and pre-processing at scale; principles of parallel and distributed computing for machine learning; techniques for scalable deep learning; analysis of programs in terms of memory, disk usage, and (for parallel methods) communication complexity; and methods for low-latency inference. Students will gain experience with common large-scale computing libraries and infrastructure, including Apache Spark and TensorFlow.


Students are required to have taken a CMU introductory machine learning course (10-301, 10-315, 10-601, 10-701, or 10-715). A strong background in programming will also be necessary; suggested prerequisites include 15-210, 15-214, or equivalent. Students are expected to be familiar with Python or learn it during the course.


There will be no required textbooks, though we may suggest additional reading in the schedule below.


We will use Piazza for class discussions. Please go to this Piazza website to join the course forum (note: you must use a email account to join). We strongly encourage students to post on this forum rather than emailing the course staff directly (this will be more efficient for both students and staff). Students should use Piazza to:

The course Academic Integrity Policy must be followed on the message boards at all times. Do not post or request homework solutions! Also, please be polite.

Course Staff

Manik Bhandari
OH: Tues 3pm-4pm,

Amala Deshmukh
OH: Mon 10am-11am,

Tuhina Gupta
OH: Fri 11am-12pm,

Vivek Gupta
OH: Mon 1pm-2pm,

Anmol Jagetia
OH: Tues 11am-12pm,

Alec Jasen
OH: Fri 3pm-4pm,

Ignacio Maronna
OH: Wed 12pm-1pm,

Alex Schneidman
OH: Thurs 10am-11am,

Baljit Singh
OH: Wed 11am-12pm,

Karun Thankachan
OH: Thurs 4pm-5pm,

Grading Policy

Grades will be based on the following components:

10-605 vs. 10-805: All assignments, grading, and expectations will be the same for 10-605 and 10-805---except for the mini-project. Students enrolled in 10-805 will be expected to complete a more involved mini-project, requiring roughly twice the work of the mini-project for 10-605.

Gradescope: We will use Gradescope to collect PDF submissions of each problem set. Upon uploading your PDF, Gradescope will ask you to identify which page(s) contains your solution for each problem---this is a great way to double check that you haven’t left anything out.

Regrade Requests: If you believe an error was made during grading, you’ll be able to submit a regrade request on Gradescope. ***For each homework, regrade requests will be open for only 1 week after the grades have been published.*** This is to encourage you to check the feedback you’ve received early!

Academic Integrity Policy

Group studying and collaborating on problem sets are encouraged; working together is a great way to understand new material. Students are free to discuss the homework problems with anyone under the following conditions: Students are encouraged to read CMU's Policy on Cheating and Plagiarism.

A Note on Self Care

Please take care of yourself. Do your best to maintain a healthy lifestyle this semester by eating well, exercising, getting enough sleep, and taking some time to relax. This will help you achieve your goals and cope with stress. All of us benefit from support during times of struggle. You are not alone. Besides the instructors, who are here to help you succeed, there are many helpful resources available on campus and an important part of the college experience is learning how to ask for help. Asking for support sooner rather than later is often helpful.

If you or anyone you know experiences any academic stress, difficult life events, or feelings like anxiety or depression, we strongly encourage you to seek support. Counseling and Psychological Services (CaPS) is here to help: call 412-268-2922 and visit their website at Consider reaching out to a friend, faculty, or family member you trust for help getting connected to the support that can help.


This course is based in part on material developed by Heather Miller, William Cohen, Anthony Joseph, and Barnabas Poczos.

Previous course: 10-405/10-605, Spring 2020.

Schedule (Subject to Change)

Date Lecture Resources Announcements
Data Pre-Processing and Visualization, Distributed Computing
Sep 1 Introduction (slides, video)
Sep 3 Distributed Computing, Spark (slides, video) HW1 released
Sep 4 Recitation: Intro to Databricks, Spark (slides, video) Lab0
Sep 8 Visualization, PCA (slides, video) Tutorial on PCA
JL Theorem
Sep 10 Nonlinear Dimensionality Reduction (slides, video) t-SNE
Sep 11 Recitation: Spark cont. (video) Lab1
Basics of Large-Scale / Distributed Machine Learning
Sep 15 Distributed Linear Regression, part I (slides, video) HW1 due
HW2 released
Sep 17 Distributed Linear Regression, part II (slides, video)
Sep 18 Recitation: Linear Algebra Review (slides, video1, video2) Lab2
Sep 22 Kernel Approximations (slides, video)
Sep 24 Snorkel: Programming Training Data (video)
Guest Lecture: Paroma Varma
Snorkel blog
Sep 25 Recitation: HW Review (slides, video) Lab2
Sep 29 Logistic Regression, Hashing (slides, video) Hash kernels, I
Hash kernels, II
HW2 due
HW3 released
Oct 1 Randomized Algorithms (slides, video) Count-min sketch
Oct 2 Recitation: Probability Review (slides, video)
Oct 6 Practice Exam
Oct 8 Distributed Trees (slides, video) HW3 due
Oct 9 Recitation: HW Review (video)
Oct 13 Exam I
Scalable Deep Learning: Training, Tuning, and Inference
Oct 15 Deep Learning, Autodiff (slides, video) Deep Learning, Ch. 6
TensorFlow Quickstart
Demo (Video)
Oct 20 DL Frameworks, Hardware (slides, video) Mini-projects released
Oct 22 Large-Scale Optimization (slides, video) Optimization for Large-Scale ML HW4 released
Oct 27 Optimization for DL (slides, video) Deep Learning, Ch. 8 Mini-projects groups due
Oct 29 Parallel/Distributed DL (slides, video)
Oct 30 Recitation: Cloud Services, Learning Rates (slides, video)
Nov 3 No Class (Election Day)
Nov 5 Hyperparameter Tuning (slides, video) blog1, blog2 HW4 due
HW5 released
Nov 6 Recitation: HW Review (video)
Nov 10 Neural Architecture Search (slides, video) blog, RSWS Mini-project proposals due
Nov 12 Inference, Model Compression (slides, video)
Nov 13 Recitation: Mini-project Review (video) slides
Advanced Topics & Guest Lectures
Nov 17 Research Challenges in Large-Scale ML Systems at Facebook (video)
Guest Lecture: Kim Hazelwood
Nov 19 Productionizing Large-Scale ML
Guest Lecture: Angela Jiang (video)
HW5 due
Nov 20 Recitation: HW Review (video)
Nov 24 Mini-project Check-ins
Nov 26 No Class (Thanksgiving)
Dec 1 Federated Learning (slides, video)
Dec 3 ModelDB
Guest Lecture: Manasi Vartak (video)
Mini-projects due
Dec 8 Course Summary (slides, course review, exam II review)
Dec 10 Exam II