Course Overview

Large datasets pose difficulties across the machine learning pipeline. They are difficult to visualize, and it can be hard to determine what sorts of errors and biases may be present in them. They are computationally expensive to process, and the cost of learning is often hard to predict---for instance, an algorithm that runs quickly on a dataset that fits in memory may be exorbitantly expensive when the dataset is too large for memory. Large datasets may also display qualitatively different behavior in terms of which learning methods produce the most accurate predictions.

This course is intended to provide a student practical knowledge of, and experience with, the issues involving large datasets. Among the topics considered are: data cleaning, visualization, and pre-processing at scale; principles of parallel and distributed computing for machine learning; techniques for scalable deep learning; analysis of programs in terms of memory, disk usage, and (for parallel methods) communication complexity; and methods for low-latency inference. Students will gain experience with common large-scale computing libraries and infrastructure, including Apache Spark and TensorFlow.

Prerequisites

Students are required to have taken a CMU introductory machine learning course (10-301, 10-315, 10-601, 10-701, or 10-715). A strong background in programming will also be necessary; suggested prerequisites include 15-210, 15-214, or equivalent. Students are expected to be familiar with Python or learn it during the course.

Textbooks

There will be no required textbooks, though we may suggest additional reading in the schedule below.

Piazza

We will use Piazza for class discussions. Please go to this Piazza website to join the course forum (note: you must use a cmu.edu email account to join). We strongly encourage students to post on this forum rather than emailing the course staff directly (this will be more efficient for both students and staff). Students should use Piazza to:

The course Academic Integrity Policy must be followed on the message boards at all times. Do not post or request homework solutions! Also, please be polite.

Course Staff

Manik Bhandari
OH: Tues 3pm-4pm, cmu.zoom.us/j/99109599745

Amala Deshmukh
OH: Mon 10am-11am, cmu.zoom.us/j/92361553984

Tuhina Gupta
OH: Fri 11am-12pm, cmu.zoom.us/j/4379984841

Vivek Gupta
OH: Mon 1pm-2pm, cmu.zoom.us/j/4973281089

Anmol Jagetia
OH: Tues 11am-12pm, cmu.zoom.us/j/7298849766

Alec Jasen
OH: Fri 3pm-4pm, cmu.zoom.us/j/99957839794

Ignacio Maronna
OH: Wed 12pm-1pm, cmu.zoom.us/j/97733681817

Alex Schneidman
OH: Thurs 10am-11am, cmu.zoom.us/j/98350996321

Baljit Singh
OH: Wed 11am-12pm, cmu.zoom.us/j/97360856788

Karun Thankachan
OH: Thurs 4pm-5pm, cmu.zoom.us/j/4126264519

Grading Policy

Grades will be based on the following components:

10-605 vs. 10-805: All assignments, grading, and expectations will be the same for 10-605 and 10-805---except for the mini-project. Students enrolled in 10-805 will be expected to complete a more involved mini-project, requiring roughly twice the work of the mini-project for 10-605.

Gradescope: We will use Gradescope to collect PDF submissions of each problem set. Upon uploading your PDF, Gradescope will ask you to identify which page(s) contains your solution for each problem---this is a great way to double check that you haven’t left anything out.

Regrade Requests: If you believe an error was made during grading, you’ll be able to submit a regrade request on Gradescope. ***For each homework, regrade requests will be open for only 1 week after the grades have been published.*** This is to encourage you to check the feedback you’ve received early!

Academic Integrity Policy

Group studying and collaborating on problem sets are encouraged; working together is a great way to understand new material. Students are free to discuss the homework problems with anyone under the following conditions: Students are encouraged to read CMU's Policy on Cheating and Plagiarism.

A Note on Self Care

Please take care of yourself. Do your best to maintain a healthy lifestyle this semester by eating well, exercising, getting enough sleep, and taking some time to relax. This will help you achieve your goals and cope with stress. All of us benefit from support during times of struggle. You are not alone. Besides the instructors, who are here to help you succeed, there are many helpful resources available on campus and an important part of the college experience is learning how to ask for help. Asking for support sooner rather than later is often helpful.

If you or anyone you know experiences any academic stress, difficult life events, or feelings like anxiety or depression, we strongly encourage you to seek support. Counseling and Psychological Services (CaPS) is here to help: call 412-268-2922 and visit their website at https://www.cmu.edu/counseling/. Consider reaching out to a friend, faculty, or family member you trust for help getting connected to the support that can help.

Acknowledgments

This course is based in part on material developed by Heather Miller, William Cohen, Anthony Joseph, and Barnabas Poczos.

Previous course: 10-405/10-605, Spring 2020.


Schedule (Subject to Change)

Date Lecture Resources Announcements
Data Pre-Processing and Visualization, Distributed Computing
Sep 1 Introduction (slides, video)
Sep 3 Distributed Computing, Spark (slides, video) HW1 released
Sep 4 Recitation: Intro to Databricks, Spark (slides, video) Lab0
Sep 8 Visualization, PCA (slides, video) Tutorial on PCA
JL Theorem
Sep 10 Nonlinear Dimensionality Reduction (slides, video) t-SNE
Sep 11 Recitation: Spark cont. (video) Lab1
Basics of Large-Scale / Distributed Machine Learning
Sep 15 Distributed Linear Regression, part I (slides, video) HW1 due
HW2 released
Sep 17 Distributed Linear Regression, part II (slides, video)
Sep 18 Recitation: Linear Algebra Review (slides, video1, video2) Lab2
Sep 22 Kernel Approximations (slides, video)
Sep 24 Snorkel: Programming Training Data (video)
Guest Lecture: Paroma Varma
Snorkel blog
Sep 25 Recitation: HW Review (slides, video) Lab2
Sep 29 Logistic Regression, Hashing (slides, video) Hash kernels, I
Hash kernels, II
HW2 due
HW3 released
Oct 1 Randomized Algorithms (slides, video) Count-min sketch
LSH
Oct 2 Recitation: Probability Review (slides, video)
Oct 6 Practice Exam
Oct 8 Distributed Trees (slides, video) HW3 due
Oct 9 Recitation: HW Review (video)
Oct 13 Exam I
Scalable Deep Learning: Training, Tuning, and Inference
Oct 15 Deep Learning, Autodiff (slides, video) Deep Learning, Ch. 6
TensorFlow Quickstart
Demo (Video)
Oct 20 DL Frameworks, Hardware (slides, video) Mini-projects released
Oct 22 Large-Scale Optimization (slides, video) Optimization for Large-Scale ML HW4 released
Oct 27 Optimization for DL (slides, video) Deep Learning, Ch. 8 Mini-projects groups due
Oct 29 Parallel/Distributed DL (slides, video)
Oct 30 Recitation: Cloud Services, Learning Rates (slides, video)
Nov 3 No Class (Election Day)
Nov 5 Hyperparameter Tuning (slides, video) blog1, blog2 HW4 due
HW5 released
Nov 6 Recitation: HW Review (video)
Nov 10 Neural Architecture Search (slides, video) blog, RSWS Mini-project proposals due
Nov 12 Inference, Model Compression (slides, video)
Nov 13 Recitation: Mini-project Review (video) slides
Advanced Topics & Guest Lectures
Nov 17 Research Challenges in Large-Scale ML Systems at Facebook (video)
Guest Lecture: Kim Hazelwood
Nov 19 Productionizing Large-Scale ML
Guest Lecture: Angela Jiang (video)
HW5 due
Nov 20 Recitation: HW Review (video)
Nov 24 Mini-project Check-ins
Nov 26 No Class (Thanksgiving)
Dec 1 Federated Learning (slides, video)
Dec 3 ModelDB
Guest Lecture: Manasi Vartak (video)
Mini-projects due
Dec 8 Course Summary (slides, course review, exam II review)
Dec 10 Exam II