Course Overview

Large datasets pose difficulties across the machine learning pipeline. They are difficult to visualize and introduce computational, storage, and communication bottlenecks during data preprocessing and model training. Moreover, high capacity models often used in conjunction with large datasets introduce additional computational and storage hurdles during model training and inference. This course is intended to provide a student with the mathematical, algorithmic, and practical knowledge of issues involving learning with large datasets. Among the topics considered are: data cleaning, visualization, and pre-processing at scale; principles of parallel and distributed computing for machine learning; techniques for scalable deep learning; analysis of programs in terms of memory, computation, and communication complexity; and methods for low-latency inference.

Prerequisites

Students are required to have taken a CMU introductory machine learning course (10-301, 10-315, 10-601, 10-701, or 10-715). A strong background in programming will also be necessary; suggested prerequisites include 15-210, 15-214, or equivalent. Students are expected to be familiar with Python or learn it during the course.

Textbooks

There will be no required textbooks, though we may suggest additional reading in the schedule below.

Course Components

The requirements of this course consist of participating in lectures, homework assignments, a mini-project, and two exams. The grading breakdown is:

Exams

You are required to attend all exams in person. The exams will be given during class. Please plan your travel accordingly as we will not be able accommodate individual travel needs (e.g. by offering the exam early). If you have an unavoidable conflict with an exam (e.g. an exam in another course), you must notify us at least 3 weeks before the exam.

Homework

The homeworks will be divided into two components: programming and written. The programming assignments will ask you to implement ML algorithms from scratch; they emphasize understanding of real-world applications of ML, building end-to-end systems, and experimental design. The written assignments will focus on core concepts, “on-paper” implementations of classic learning algorithms, derivations, and understanding of theory.

Mini-Project

Students will create groups and participate in a mini-project. Mini-Project details will be released later in the semester.

Quizzes

Participation will be measured via in-class quizzes. We will not provide make-up quizzes for days that you miss class. However, your lowest 4 quiz grades will be dropped.

Piazza

We will use Piazza for class discussions. Please go to this Piazza website to join the course forum (note: you must use a cmu.edu email account to join). We strongly encourage students to post on this forum rather than emailing the course staff directly (this will be more efficient for both students and staff). Students should use Piazza to:

The course Academic Integrity Policy must be followed on the message boards at all times. Do not post or request homework solutions! Also, please be polite.

Gradescope

We use Gradescope to collect PDF submissions of open-ended questions on the homework (e.g. mathematical derivations, plots, short answers). The course staff will manually grade your submission, and you’ll receive personalized feedback explaining your final marks.

You will also submit your code for programming questions on the homework to Gradescope. After uploading your code, our grading scripts will autograde your assignment by running your program on a VM. This provides you with immediate feedback on the performance of your submission.

Regrade Requests

If you believe an error was made during manual grading, you’ll be able to submit a regrade request on Gradescope. For each homework, regrade requests will be open for only **1 week** after the grades have been published. This is to encourage you to check the feedback you’ve received early!

Course Staff

Instructional Staff

Virginia Smith

Jacob Rast

Teaching Assistants

Christopher Berman
OH: Tuesdays, 3pm to 4pm

Zijun Ding
OH: Thursdays, 4pm to 5pm

Juan Hernandez Gomez
OH: Mondays, 2pm to 3pm

Atharva Anand Joshi
OH: Fridays, 1:30pm to 2:30pm

Runzhe Liang
OH: Mondays, 3pm to 4pm

Jin Rong Song
OH: Thursdays, 3pm to 4pm

Glenn Xu
OH: Wednesdays, 12:30pm to 1:30pm

Yiyan (Vivian) Zhai
OH: Fridays, 4pm to 5pm




Schedule (Subject to Change)

Date Class Type Topic Resources Announcements
Mon Aug 26 Lecture Introduction (slides, video) HW1 Released
Data Pre-Processing and Visualization, Distributed Machine Learning
Wed Aug 28 Lecture Distributed Computing, Spark (slides, video)
Fri Aug 30 Recitation Recitation 1: PySpark and Databricks (slides, video) Lab Notebook
Mon Sep 2 No Class No class (Labor Day)
Wed Sep 4 Lecture Visualization, Dimensionality Reduction (slides, video) Visualization for ML
Tutorial on PCA
JL Theorem
t-SNE
Fri Sep 6 Recitation Recitation 2: Linear Algebra Review (slides, video) Lab Notebook
Mon Sep 9 Lecture Distributed Linear Regression, part I (slides, video) HW1 due, HW2 Released
Wed Sep 11 Lecture Distributed Linear Regression, part II (slides, video)
Fri Sep 13 Recitation Recitation 3: Cloud Computing / AWS (slides, Part A, Part B )
Mon Sep 16 Lecture Kernel Approximations (slides, video) Kernels background
RFF
Nystrom
Wed Sep 18 Lecture Logistic Regression, Hashing, part I (slides, video)
Fri Sep 20 Recitation Recitation 4: Probability Review (slides, video)
Mon Sep 23 Lecture Logistic Regression, Hashing, part II (slides, video) Hash kernels, I
Hash kernels, II
HW2 due, HW3 Released
Wed Sep 25 Lecture Randomized Algorithms (slides, video) Count-min sketch
LSH
Fri Sep 27 Recitation No recitation
Mon Sep 30 Lecture Pratyush Maini (CMU, Datology) (video)
Wed Oct 2 Lecture Practice Exam 1
Fri Oct 4 Recitation Recitation 5: Practice Exam 1 Solutions (video) HW3 due
Mon Oct 7 Lecture Distributed Trees (slides, video)
Wed Oct 9 Exam Exam 1
Mon Oct 14 No Class Fall Break (No Classes)
Wed Oct 16 No Class Fall Break (No Classes)
Scalable Deep Learning: Training, Tuning, and Inference
Mon Oct 21 Lecture Deep Learning Intro (slides, video)
Wed Oct 23 Lecture Large-Scale Optimization (slides, video) HW4 Released
Fri Oct 25 Recitation Recitation 6: PyTorch Tutorial (video) Lab Notebook
Mon Oct 28 Lecture Optimization for DL (slides, video)
Wed Oct 30 Lecture Parallel/Distributed DL (slides, video)
Fri Nov 1 Recitation Recitation 7: Optimization & Learning Rates (slides, video) Lab Notebook
Mon Nov 4 Lecture ML Hardware / Efficient Finetuning (slides, video) HW4 due
Wed Nov 6 Lecture Hyperparameter Tuning (slides, video)
Fri Nov 8 Recitation Recitation 8: HW5 Techniques and Experimental Design
Mon Nov 11 Lecture Inference, Model Compression
Wed Nov 13 Lecture Federated Learning Mini-Project Survey Due
Fri Nov 15 Recitation Recitation 9: TBA
Mon Nov 18 Lecture Matt Fredrickson (CMU, Gray Swan AI)
Wed Nov 20 Lecture Scaling Laws & Synthetic Data
Fri Nov 22 Recitation Recitation 10: TBA
Mon Nov 25 Lecture Safety at Scale
Wed Nov 27 No Class Thanksgiving Break (No Classes)
Fri Nov 29 No Class Thanksgiving Break (No Classes) Mini-Project Final Report Due
Mon Dec 2 Lecture Course Summary
Wed Dec 4 Exam Exam 2


General Policies

Late Homework Policy

You have 4 total grace days that can be used to submit late homework assignments without penalty. We will automatically keep a tally of these grace days for you; they will be applied greedily. You may not use more than 2 grace days on any single homework assignment. Additionally, please note:

Extensions

In general, we do not grant extensions on assignments. There are several exceptions: For any of the above situations, you may request an extension by emailing Jacob Rast (jrast@andrew.cmu.edu). The email should be sent as soon as you are aware of the conflict and at least 5 days prior to the deadline. In the case of an emergency, no notice is needed.

Audit Policy

Official auditing of the course (i.e. taking the course for an “Audit” grade) is not permitted this semester.

Unofficial auditing of the course (i.e. watching the lectures online or attending them in person) is welcome and permitted without prior approval. Unofficial auditors will not be given access to course materials such as homework assignments and exams.

Pass/Fail Policy

Pass/Fail is allowed in this class, no permission is required from the course staff. The grade for the Pass cutoff will depend on your program. Be sure to check with your program / department as to whether you can count a Pass/Fail course towards your degree requirements.

Accommodations for Students with Disabilities

If you have a disability and have an accommodations letter from the Disability Resources office, I encourage you to discuss your accommodations and needs with Jacob Rast as early in the semester as possible. I will work with you to ensure that accommodations are provided as appropriate. If you suspect that you may have a disability and would benefit from accommodations but are not yet registered with the Office of Disability Resources, I encourage you to contact them at access@andrew.cmu.edu.

Academic Integrity Policies

Read this Carefully

Collaboration among Students

Previously Used Assignments

Some of the homework assignments used in this class may have been used in prior versions of this class, or in classes at other institutions, or elsewhere. Solutions to them may be, or may have been, available online, or from other people or sources. It is explicitly forbidden to use any such sources, or to consult people who have solved these problems before. It is explicitly forbidden to search for these problems or their solutions on the internet. You must solve the homework assignments completely on your own. We will be actively monitoring your compliance. Collaboration with other students who are currently taking the class is allowed, but only under the conditions stated above.

Generative AI

Students are encouraged to use generative AI tools (such as ChatGPT) the same way they would interact with any other collaborator. In particular, any substantive collaboration should be disclosed, and all solutions should be entirely the student’s own work. To ensure the latter, any notes or results from a collaboration should remain closed while writing up the solution, so that no material is accidentally transferred.

Policy Regarding “Found Code”

You are encouraged to read books and other instructional materials, both online and offline, to help you understand the concepts and algorithms taught in class. These materials may contain example code or pseudo code, which may help you better understand an algorithm or an implementation detail. However, when you implement your own solution to an assignment, you must put all materials aside, and write your code completely on your own, starting “from scratch”. Specifically, you may not use any code you found or came across. If you find or come across code that implements any part of your assignment, you must disclose this fact in your collaboration statement.

Duty to Protect One’s Work

Students are responsible for proactively protecting their work from copying and misuse by other students. If a student’s work is copied by another student, the original author is also considered to be at fault and in gross violation of the course policies. It does not matter whether the author allowed the work to be copied or was merely negligent in preventing it from being copied. When overlapping work is submitted by different students, both students will be punished.

To protect future students, do not post your solutions publicly, neither during the course nor afterwards.

Penalties for Violations of Course Policies

All violations (even first one) of course policies will always be reported to the university authorities (your Department Head, Associate Dean, Dean of Student Affairs, etc.) as an official Academic Integrity Violation and will carry severe penalties.
  1. The penalty for the first violation is a one-and-a-half letter grade reduction. For example, if your final letter grade for the course was to be an A-, it would become a C+.
  2. The penalty for the second violation is failure in the course, and can even lead to dismissal from the university.


Acknowledgments

This course is based in part on material developed by Ameet Talwalkar, Geoffrey Gordon, Heather Miller, Barnabas Poczos, William Cohen, and Anthony Joseph.

Previous courses: 10-405/10-605, Spring 2024; 10-605/10-805, Fall 2023; 10-405/10-605, Spring 2023; 10-605/10-805, Fall 2022; 10-405/10-605, Spring 2022, 10-605/10-805, Fall 2021; 10-405/10-605, Spring 2021; 10-605/10-805, Fall 2020; 10-405/10-605, Spring 2020.