Large datasets pose difficulties across the machine learning pipeline. They are difficult to visualize, and it can be hard to determine what sorts of errors and biases may be present in them. They are computationally expensive to process, and the cost of learning is often hard to predict---for instance, an algorithm that runs quickly on a dataset that fits in memory may be exorbitantly expensive when the dataset is too large for memory. Large datasets may also display qualitatively different behavior in terms of which learning methods produce the most accurate predictions.
This course is intended to provide a student practical knowledge of, and experience with, the issues involving large datasets. Among the topics considered are: data cleaning, visualization, and pre-processing at scale; principles of parallel and distributed computing for machine learning; techniques for scalable deep learning; analysis of programs in terms of memory, disk usage, and (for parallel methods) communication complexity; and methods for low-latency inference. Students will gain experience with common large-scale computing libraries and infrastructure, including Apache Spark and TensorFlow.
Students are required to have taken a CMU introductory machine learning course (10-401, 10-601, 10-701, or 10-715). A strong background in programming will also be necessary; suggested prerequisites include 15-210, 15-214, or equivalent. Students are expected to be familiar with Python or learn it during the course.
TextbooksThere will be no required textbooks, though we may suggest additional reading in the schedule below.
We will use Piazza for class discussions. Please go to this Piazza website to join the course forum (note: you must use a cmu.edu email account to join the forum). We strongly encourage students to post on this forum rather than emailing the course staff directly (this will be more efficient for both students and staff). Students should use Piazza to:
- Ask clarifying questions about the course material.
- Share useful resources with classmates (so long as they do not contain homework solutions).
- Look for students to form study groups.
- Answer questions posted by other students to solidify your own understanding of the material.
OH: Thursdays 3-4pm, cmu.zoom.us/j/795420637
OH: Fridays 3-4pm, cmu.zoom.us/j/243447197
OH: Tuesdays 1:30-2:30pm, cmu.zoom.us/j/415105898
OH: Mondays 2-4pm, cmu.zoom.us/j/109213097
OH: Wednesdays 11am-12pm, cmu.zoom.us/j/590701416
OH: Mondays 4-6pm, cmu.zoom.us/j/986459270
Zach (Zeyu) Peng
OH: Mondays 12-2pm, cmu.zoom.us/j/583470275
OH: Fridays 1-3pm, cmu.zoom.us/j/894607086
Grades will be based on the following components:
- Assignments (25%): There will be 5 homework assignments. Each
each assignment will have equal weight.
- Late submissions will not be accepted.
- There is one exception to this rule: You are given 2 "late days" (self-granted 24-hr extensions) which you can use to give yourself extra time without penalty. At most one late day can be used per assignment.
- There is one TA responsible for each assignment, as indicated in the schedule below. Direct all communication regarding the assignment to this TA.
- Midterm (20%) and Final (25%): These in-person exams will cover material from the lectures and assignments.
- Project (25%): The project is an opportunity to get hands-on experience applying machine learning at scale. We will not consider projects that can easily be executed on a laptop.
- You must work in teams of 4-6 people.
- There will be two deliverables: a project proposal and project report.
- Additional details to follow.
- Class Participation (5%): Participation will be recorded via in-class quizzes that will be carried out in most classes. To get full credit for class participation you need to attend at least 80% of the lectures based on the polls we conduct.
- Bonus: On Piazza, the top student “endorsed answer” answerers can earn bonus points.
Academic Integrity PolicyGroup studying and collaborating on problem sets are encouraged; working together is a great way to understand new material. Students are free to discuss the homework problems with anyone under the following conditions:
- Students must submit their own homework solutions and understand the solutions that they submit.
- Students must list the names of their collaborators (i.e., anyone with whom the assignment was discussed).
- Students may not use old homework solutions from other classes under any circumstances, unless the instructor grants special permission.
A Note on Self Care
Please take care of yourself. Do your best to maintain a healthy lifestyle this semester by eating well, exercising, avoiding drugs and alcohol, getting enough sleep, and taking some time to relax. This will help you achieve your goals and cope with stress. All of us benefit from support during times of struggle. You are not alone. Besides the instructors, who are here to help you succeed, there are many helpful resources available on campus and an important part of the college experience is learning how to ask for help. Asking for support sooner rather than later is often helpful.
If you or anyone you know experiences any academic stress, difficult life events, or feelings like anxiety or depression, we strongly encourage you to seek support. Counseling and Psychological Services (CaPS) is here to help: call 412-268-2922 and visit their website at https://www.cmu.edu/counseling/. Consider reaching out to a friend, faculty, or family member you trust for help getting connected to the support that can help.
Schedule (Subject to Change)
|1/13||Introduction||MLSys: The New Frontier of ML Systems|
|1/15||Distributed Computing, MapReduce|
|1/17||Recitation: Spark toplogy basics + setup with Databricks
(Slides (Tian), Slides (Heather), Lab0)
|1/20||No class (MLK Day)|
|1/22||Intro to Spark||HW1 released|
|1/24||Recitation: Spark Transformations and Actions
(Lab2 (Notebook), Slides)
Spark: Joins, Structure, and DataFrames
|1/29||Data Visualization||Visualization for ML
A Tutorial on PCA
|1/31||Recitation: Spark RDDs and DataFrames
(Lab3 (Notebook), Slides are those from Monday's lecture)
|2/3||ML Review||Deep Learning, Ch. 5.2-5.4
Math for ML (review)
|2/5||Distributed Linear Regression, Part I||HW2 released|
|2/7||Recitation: Linear Algebra Review
|2/10||Distributed Linear Regression, Part II|
|2/12||Adv. Distributed Optimization|
|2/14||Recitation: Learning Rate Optimization
(Lab4 (Notebook), Slides)
|2/17||Distributed Logistic Regression||HW2 due|
|2/19||Partitioning and Locality||HW3 released|
|2/21||Recitation: Probability Review
|2/24||Large-Scale Data Structures|
|3/2||Project Proposals||HW3 due|
|3/9||No class (Spring Break)|
|3/11||No class (Spring Break)|
|3/16||All CMU classes cancelled|
|3/18||Deep Learning||Deep Learning, Ch. 6
William Cohen's Autodiff Notes
|3/23||ML Frameworks + TensorFlow|
|3/25||ML Hardware + TensorFlow||Performant, scalable models in TensorFlow 2 with tf.data, tf.function & tf.distribute (TF World '19) [Video]|
|3/30||Optimization for DL||Deep Learning, Ch. 8|
|4/1||Efficient Hyperparameter Tuning
Guest lecture: Liam Li
|4/8||Low Latency Inference|
|4/13||TVM & DL Compilers
Guest lecture: Tianqi Chen
|4/15||Productionizing Large-Scale ML|
|4/22||Project Poster Session