CMU 10-405/10-605

Course Overview

Large datasets pose difficulties across the machine learning pipeline. They are difficult to visualize and introduce computational, storage, and communication bottlenecks during data preprocessing and model training. Moreover, high capacity models often used in conjunction with large datasets introduce additional computational and storage hurdles during model training and inference. This course is intended to provide a student with the mathematical, algorithmic, and practical knowledge of issues involving learning with large datasets. Among the topics considered are: data cleaning, visualization, and pre-processing at scale; principles of parallel and distributed computing for machine learning; techniques for scalable deep learning; analysis of programs in terms of memory, computation, and (for parallel methods) communication complexity; and methods for low-latency inference.

Prerequisites

Students are required to have taken a CMU introductory machine learning course (10-301, 10-315, 10-601, 10-701, or 10-715). A strong background in programming will also be necessary; suggested prerequisites include 15-210, 15-214, or equivalent. Students are expected to be familiar with Python or learn it during the course.

Textbooks

There will be no required textbooks, though we may suggest additional reading in the schedule below.

Course Components

The requirements of this course consist of participating in homework assignments, a mini-project, and two exams. Attending lectures is not mandatory but is highly encouraged. The grading breakdown is the following:

22.5% Exam 1
22.5% Exam 2
40% Homework (5 Assignments, 8% Each)
15% Mini-Project (10-605 only)

10-405, 10-605 Differences

10-605 students will perform a mini-project, while 10-405 students will not.

Exams

You are required to attend all in person exams. The exams will be given during class. Please plan your travel accordingly as we will not be able accommodate individual travel needs (e.g. by offering the exam early).

If you have an unavoidable conflict with an exam (e.g. an exam in another course), notify us by filling out the exam conflict form which will be released on Piazza a few weeks before the exam.

Homework

The homeworks will be divided into two components: programming and written. The programming assignments will ask you to implement ML algorithms from scratch; they emphasize understanding of real-world applications of ML, building end-to-end systems, and experimental design. The written assignments will focus on core concepts, “on-paper” implementations of classic learning algorithms, derivations, and understanding of theory.

Mini Project (10-605 only)

Students will make create groups and participate in a Mini-project. Mini-Project details will be released later in the semester.

Piazza

We will use Piazza for class discussions. Please go to this Piazza website to join the course forum (note: you must use a cmu.edu email account to join). We strongly encourage students to post on this forum rather than emailing the course staff directly (this will be more efficient for both students and staff). Students should use Piazza to:

Ask clarifying questions about the course material.
Share useful resources with classmates (so long as they do not contain homework solutions).
Look for students to form study groups.
Answer questions posted by other students to solidify your own understanding of the material.

The course Academic Integrity Policy must be followed on the message boards at all times. Do not post or request homework solutions! Also, please be polite.

Gradescope

We use Gradescope to collect PDF submissions of open-ended questions on the homework (e.g. mathematical derivations, plots, short answers). The course staff will manually grade your submission, and you’ll receive personalized feedback explaining your final marks.

You will also submit your code for programming questions on the homework to Gradescope. After uploading your code, our grading scripts will autograde your assignment by running your program on a VM. This provides you with immediate feedback on the performance of your submission.

Regrade Requests

If you believe an error was made during manual grading, you’ll be able to submit a regrade request on Gradescope. For each homework, regrade requests will be open for only **1 week** after the grades have been published. This is to encourage you to check the feedback you’ve received early!

Course Staff

Instructional Staff

Ameet Talwalkar

Geoff Gordon

Date	Lecture	Resources	Announcements
Data Pre-Processing and Visualization, Distributed Machine Learning
Jan 17	Introduction (slides, recording)		HW1 Released
Jan 19	Recitation 1: Introduction to Pyspark and Databricks (slides, recording)	Lab Notebook
Jan 22	Distributed Systems, Map-Reduce (slides, recording)	Tutorial on PCA
Jan 24	Visualization: PCA (slides, recording)	JL Theorem
Jan 26	Recitation 2: Linear Algebra Review (slides, recording)	Lab Notebook	HW2 Released
Jan 29	Visualization: JL + t-SNE (slides, recording)		HW1 Due
Jan 31	Distributed Linear Regression (slides, recording)
Feb 2	Recitation 3: HW1 Review (recording)
Feb 5	Distributed PCA & Logistic Regression (slides, slides for Scaling Up, recording)
Feb 7	Scaling up linear models: Kernel form & Nyström (slides, recording)	Calculus Review Hash kernels, I Hash kernels, II Count-min sketch
Feb 9	Recitation 4: AWS Setup (slides, recording)		HW3 Released
Feb 12	Hashing Theory, CMS (slides, recording)
Feb 14	LSH (slides, recording)		HW2 Due
Feb 16	Recitation 5: Homework 2 Solutions (recording)
Feb 19	Scaling up nonlinear models: Kernel machines & random Fourier features (slides, recording)	LSH
Feb 21	Randomized Linear Algebra (slides, slides2, recording)		HW3 Due
Feb 23	Recitation 6: Practice Exam 1 Solutions (slides, recording)	Homework 3 Solutions (video)
Feb 26	Exam 1
Feb 28	No class		HW4 Released
Mar 1	Recitation 7: HW4 Part A Tutorial (recording)
Mar 4	Spring Break (No Classes)
Mar 6	Spring Break (No Classes)
Mar 8	Spring Break (No Classes)
Scalable Deep Learning: Training, Tuning, and Inference
Mar 11	Deep learning, DL frameworks, and Hardware (slides, recording)
Mar 13	Optimization for DL (GD, SGD) (slides, slides2, recording)		HW 4 Part A Due
Mar 15	Recitation 8: Tensorflow Tutorial (slides, code,recording)
Mar 18	Optimization for DL (Momentum, Adam) (slides, recording)		HW4 Part B Due, HW5 Released, Mini-project Released
Mar 20	Guest lecture: Alex Cabrera (recording)
Mar 22	Recitation 9: Optimization & Learning Rates (slides, code, recording)
Mar 25	Inference + Model Compression (slides, recording)
Mar 27	Hyperparameter Tuning (slides, recording)
Mar 29	Recitation 10: Homework 4 Solutions (recording)		Mini Project Survey Due
Apr 1	Parallel learning, Duality (slides, recording)
Advanced Topics
Apr 3	AutoML for Diverse Tasks (slides, recording)		HW5 Due
Apr 5	Recitation 11: Homework 5 Solutions (recording)
Apr 8	Scaling Laws for Foundation Models (slides, recording)
Apr 10	Training LLMs from Scratch at Mosaic, Guest Lecture: Jonathan Frankle
Apr 12	Spring Carnival (No Classes)
Apr 15	RL with large Data (slides, recording)
Apr 17	RL with large Data continued (recording)		Mini-Projects Due
Apr 19	Recitation 12: Exam 2 Office Hours
Apr 22	Course Summary and Exam Review (slides, recording)
Apr 24	Exam 2

Date

Lecture

Resources

Announcements

Data Pre-Processing and Visualization, Distributed Machine Learning

Jan 17

Introduction (slides, recording)

HW1 Released

Jan 19

Recitation 1: Introduction to Pyspark and Databricks (slides, recording)

Lab Notebook

Jan 22

Distributed Systems, Map-Reduce (slides, recording)

Tutorial on PCA

Jan 24

Visualization: PCA (slides, recording)

JL Theorem

Jan 26

Recitation 2: Linear Algebra Review (slides, recording)

Lab Notebook

HW2 Released

Jan 29

Visualization: JL + t-SNE (slides, recording)

HW1 Due

Jan 31

Distributed Linear Regression (slides, recording)

Feb 2

Recitation 3: HW1 Review (recording)

Feb 5

Distributed PCA & Logistic Regression (slides, slides for Scaling Up, recording)

Feb 7

Scaling up linear models: Kernel form & Nyström (slides, recording)

Calculus Review
Hash kernels, I
Hash kernels, II
Count-min sketch

Feb 9

Recitation 4: AWS Setup (slides, recording)

HW3 Released

Feb 12

Hashing Theory, CMS (slides, recording)

Feb 14

LSH (slides, recording)

HW2 Due

Feb 16

Recitation 5: Homework 2 Solutions (recording)

Feb 19

Scaling up nonlinear models: Kernel machines & random Fourier features (slides, recording)

LSH

Feb 21

Randomized Linear Algebra (slides, slides2, recording)

HW3 Due

Feb 23

Recitation 6: Practice Exam 1 Solutions (slides, recording)

Homework 3 Solutions (video)

Feb 26

Exam 1

Feb 28

No class

HW4 Released

Mar 1

Recitation 7: HW4 Part A Tutorial (recording)

Mar 4

Spring Break (No Classes)

Mar 6

Spring Break (No Classes)

Mar 8

Spring Break (No Classes)

Scalable Deep Learning: Training, Tuning, and Inference

Mar 11

Deep learning, DL frameworks, and Hardware (slides, recording)

Mar 13

Optimization for DL (GD, SGD) (slides, slides2, recording)

HW 4 Part A Due

Mar 15

Recitation 8: Tensorflow Tutorial (slides, code,recording)

Mar 18

Optimization for DL (Momentum, Adam) (slides, recording)

HW4 Part B Due, HW5 Released, Mini-project Released

Mar 20

Guest lecture: Alex Cabrera (recording)

Mar 22

Recitation 9: Optimization & Learning Rates (slides, code, recording)

Mar 25

Inference + Model Compression (slides, recording)

Mar 27

Hyperparameter Tuning (slides, recording)

Mar 29

Recitation 10: Homework 4 Solutions (recording)

Mini Project Survey Due

Apr 1

Parallel learning, Duality (slides, recording)

Advanced Topics

Apr 3

AutoML for Diverse Tasks (slides, recording)

HW5 Due

Apr 5

Recitation 11: Homework 5 Solutions (recording)

Apr 8

Scaling Laws for Foundation Models (slides, recording)

Apr 10

Training LLMs from Scratch at Mosaic, Guest Lecture: Jonathan Frankle

Apr 12

Spring Carnival (No Classes)

Apr 15

RL with large Data (slides, recording)

Apr 17

RL with large Data continued (recording)

Mini-Projects Due

Apr 19

Recitation 12: Exam 2 Office Hours

Apr 22

Course Summary and Exam Review (slides, recording)

Apr 24

Exam 2

10-405/10-605: ML with Large Datasets, Spring 2024

Course Overview

Prerequisites

Textbooks

Course Components

10-405, 10-605 Differences

Exams

Homework

Mini Project (10-605 only)

Piazza

Gradescope

Regrade Requests

Course Staff

Schedule (Subject to Change)

General Policies

Late Homework Policy

Extensions

Audit Policy

Pass/Fail Policy

Accommodations for Students with Disabilities

Academic Integrity Policies

Collaboration among Students

Previously Used Assignments

Generative AI

Policy Regarding “Found Code”

Duty to Protect One’s Work

Penalties for Violations of Course Policies

Acknowledgments