Large datasets pose difficulties across the machine learning pipeline. They are difficult to visualize, and it can be hard to determine what sorts of errors and biases may be present in them. They are computationally expensive to process, and the cost of learning is often hard to predict---for instance, an algorithm that runs quickly on a dataset that fits in memory may be exorbitantly expensive when the dataset is too large for memory. Large datasets may also display qualitatively different behavior in terms of which learning methods produce the most accurate predictions.
This course is intended to provide a student practical knowledge of, and experience with, the issues involving large datasets. Among the topics considered are: data cleaning, visualization, and pre-processing at scale; principles of parallel and distributed computing for machine learning; techniques for scalable deep learning; analysis of programs in terms of memory, disk usage, and (for parallel methods) communication complexity; and methods for low-latency inference. Students will gain experience with common large-scale computing libraries and infrastructure, including Apache Spark and TensorFlow.
Students are required to have taken a CMU introductory machine learning course (10-401, 10-601, 10-701, or 10-715). A strong background in programming will also be necessary; suggested prerequisites include 15-210, 15-214, or equivalent. Students are expected to be familiar with Python or learn it during the course.
TextbooksThere will be no required textbooks, though we may suggest additional reading in the schedule below.
We will use Piazza for class discussions. Please go to this Piazza website to join the course forum (note: you must use a cmu.edu email account to join). We strongly encourage students to post on this forum rather than emailing the course staff directly (this will be more efficient for both students and staff). Students should use Piazza to:
- Ask clarifying questions about the course material.
- Share useful resources with classmates (so long as they do not contain homework solutions).
- Look for students to form study groups.
- Answer questions posted by other students to solidify your own understanding of the material.
OH: Tues 3pm-4pm, cmu.zoom.us/j/99109599745
OH: Mon 10am-11am, cmu.zoom.us/j/92361553984
OH: Fri 11am-12pm, cmu.zoom.us/j/4379984841
OH: Mon 2pm-3pm, cmu.zoom.us/j/4973281089
OH: Tues 11am-12pm, cmu.zoom.us/j/7298849766
OH: Fri 3pm-4pm, cmu.zoom.us/j/99957839794
OH: Wed 12pm-1pm, cmu.zoom.us/j/97733681817
OH: Thurs 10am-11am, cmu.zoom.us/j/98350996321
OH: Wed 11am-12pm, cmu.zoom.us/j/97360856788
OH: Thurs 4pm-5pm, cmu.zoom.us/j/4126264519
Grades will be based on the following components:
- Assignments (40%): There will be 5 homework assignments. Each
each assignment will have equal weight.
- Late submissions will not be accepted.
- There is one exception to this rule: You are given 2 "late days" (self-granted 24-hr extensions) that you can use to give yourself extra time without penalty. At most one late day can be used per assignment.
- Mini-Project (15%): The mini-project is an opportunity to get hands-on experience applying machine learning at scale.
- You must work in teams of 2-3 people.
- There will be two deliverables: a project proposal and project report.
- Additional details to follow.
- Exam I (20%) and Exam II (20%): These exams will cover material from the lectures and assignments.
- Quizzes (5%): We will have short weekly quizzes in canvas that correspond to lecture material. When calculating your final grade, we will drop your lowest weekly quiz grade.
- Bonus: On Piazza, the top student “endorsed answer” answerers can earn bonus points.
10-605 vs. 10-805: All assignments, grading, and expectations will be the same for 10-605 and 10-805---except for the mini-project. Students enrolled in 10-805 will be expected to complete a more involved mini-project, requiring roughly twice the work of the mini-project for 10-605.
Gradescope: We will use Gradescope to collect PDF submissions of each problem set. Upon uploading your PDF, Gradescope will ask you to identify which page(s) contains your solution for each problem---this is a great way to double check that you haven’t left anything out.Regrade Requests: If you believe an error was made during grading, you’ll be able to submit a regrade request on Gradescope. ***For each homework, regrade requests will be open for only 1 week after the grades have been published.*** This is to encourage you to check the feedback you’ve received early!
Academic Integrity PolicyGroup studying and collaborating on problem sets are encouraged; working together is a great way to understand new material. Students are free to discuss the homework problems with anyone under the following conditions:
- Students must submit their own homework solutions and understand the solutions that they submit.
- Students must list the names of their collaborators (i.e., anyone with whom the assignment was discussed).
- Students may not use old homework solutions from other classes under any circumstances, unless the instructor grants special permission.
A Note on Self Care
Please take care of yourself. Do your best to maintain a healthy lifestyle this semester by eating well, exercising, getting enough sleep, and taking some time to relax. This will help you achieve your goals and cope with stress. All of us benefit from support during times of struggle. You are not alone. Besides the instructors, who are here to help you succeed, there are many helpful resources available on campus and an important part of the college experience is learning how to ask for help. Asking for support sooner rather than later is often helpful.
If you or anyone you know experiences any academic stress, difficult life events, or feelings like anxiety or depression, we strongly encourage you to seek support. Counseling and Psychological Services (CaPS) is here to help: call 412-268-2922 and visit their website at https://www.cmu.edu/counseling/. Consider reaching out to a friend, faculty, or family member you trust for help getting connected to the support that can help.
Previous course: 10-405/10-605, Spring 2020.
Schedule (Subject to Change)
|Sep 1||Introduction (slides, video)|
|Sep 3||Distributed Computing, Spark (slides, video)||HW1 released|
|Sep 4||Recitation: Intro to Databricks, Spark (slides, video)||Lab0|
|Sep 8||Visualization, PCA (slides, video)|
|Sep 10||Nonlinear Dimensionality Reduction (slides, video)|
|Sep 11||Recitation: Spark cont. (video)||Lab1|
|Sep 15||Distributed Linear Regression, part I (slides, video)||HW1 due
|Sep 17||Distributed Linear Regression, part II (slides, video)|
|Sep 18||Recitation: Linear Algebra Review (slides, video1, video2)||Lab2|
|Sep 22||Kernel Approximations (slides, video)|
Snorkel: Programming Training Data (video)
Guest Lecture: Paroma Varma
|Sep 25||Recitation: HW Review (slides, video)||Lab2|
|Sep 29||Distributed Trees||HW2 due
|Oct 1||Logistic Regression, Hashing|
|Oct 6||Randomized Algorithms|
|Oct 8||TBA||HW3 due|
|Oct 13||Exam I|
|Oct 15||Deep Learning, Autodiff||HW4 released|
|Oct 20||TensorFlow, DL Hardware|
|Oct 22||Large-Scale Optimization, part I|
|Oct 27||Large-Scale Optimization, part II|
|Oct 29||Parallel/Distributed DL||HW4 due
|Nov 3||No Class (Election Day)|
|Nov 5||Hyperparameter Tuning||Mini-project proposals due
|Nov 10||Inference, Model Compression|
|Nov 12||Neural Architecture Search|
|Nov 17||Guest Lecture: Kim Hazelwood|
|Nov 19||Productionizing Large-Scale ML||HW5 due|
|Nov 26||No Class (Thanksgiving)|
|Dec 1||Federated Learning|
Guest Lecture: Manasi Vartak
|Dec 8||Course Summary|
|Dec 10||Exam II|