CMU 10-405/10-605

Course Overview

Large datasets pose difficulties across the machine learning pipeline. They are difficult to visualize, and it can be hard to determine what sorts of errors and biases may be present in them. They are computationally expensive to process, and the cost of learning is often hard to predict---for instance, an algorithm that runs quickly on a dataset that fits in memory may be exorbitantly expensive when the dataset is too large for memory. Large datasets may also display qualitatively different behavior in terms of which learning methods produce the most accurate predictions.

This course is intended to provide a student practical knowledge of, and experience with, the issues involving large datasets. Among the topics considered are: data cleaning, visualization, and pre-processing at scale; principles of parallel and distributed computing for machine learning; techniques for scalable deep learning; analysis of programs in terms of memory, disk usage, and (for parallel methods) communication complexity; and methods for low-latency inference. Students will gain experience with common large-scale computing libraries and infrastructure, including Apache Spark and TensorFlow.

Prerequisites

Students are required to have taken a CMU introductory machine learning course (10-401, 10-601, 10-701, or 10-715). A strong background in programming will also be necessary; suggested prerequisites include 15-210, 15-214, or equivalent. Students are expected to be familiar with Python or learn it during the course.

Textbooks

There will be no required textbooks, though we may suggest additional reading in the schedule below.

Piazza

We will use Piazza for class discussions. Please go to this Piazza website to join the course forum (note: you must use a cmu.edu email account to join the forum). We strongly encourage students to post on this forum rather than emailing the course staff directly (this will be more efficient for both students and staff). Students should use Piazza to:

Ask clarifying questions about the course material.
Share useful resources with classmates (so long as they do not contain homework solutions).
Look for students to form study groups.
Answer questions posted by other students to solidify your own understanding of the material.

The course Academic Integrity Policy must be followed on the message boards at all times. Do not post or request homework solutions! Also, please be polite.

Course Staff

Teaching Assistants

Kushagr Arora
OH: Thursdays 3-4pm

Saket Chaudhary
OH: Fridays 3-4pm

Jiayong Hu
OH: Tuesdays 1:30-2:30pm

Anwen Huang
OH: Mondays 2-4pm

Tian Li
OH: Wednesdays 11am-12pm

Daniel Mo
OH: Mondays 4-6pm

Zach (Zeyu) Peng
OH: Mondays 12-2pm

Kuo Tian
OH: Fridays 1-3pm

Grading Policy

Grades will be based on the following components:

Assignments (25%): There will be 5 homework assignments. Each each assignment will have equal weight.
- Late submissions will not be accepted.
- There is one exception to this rule: You are given 2 "late days" (self-granted 24-hr extensions) which you can use to give yourself extra time without penalty. At most one late day can be used per assignment.
- There is one TA responsible for each assignment, as indicated in the schedule below. Direct all communication regarding the assignment to this TA.
Midterm (20%) and Final (25%): These in-person exams will cover material from the lectures and assignments.
Project (25%): The project is an opportunity to get hands-on experience applying machine learning at scale. We will not consider projects that can easily be executed on a laptop.

You must work in teams of 4-6 people.
There will be two deliverables: a project proposal and project report.
Additional details to follow.

Class Participation (5%): Participation will be recorded via in-class quizzes that will be carried out in most classes. To get full credit for class participation you need to attend at least 80% of the lectures based on the polls we conduct.
Bonus: On Piazza, the top student “endorsed answer” answerers can earn bonus points.

Academic Integrity Policy

Group studying and collaborating on problem sets are encouraged; working together is a great way to understand new material. Students are free to discuss the homework problems with anyone under the following conditions:

Students must submit their own homework solutions and understand the solutions that they submit.
Students must list the names of their collaborators (i.e., anyone with whom the assignment was discussed).
Students may not use old homework solutions from other classes under any circumstances, unless the instructor grants special permission.

Students are encouraged to read CMU's Policy on Cheating and Plagiarism.

A Note on Self Care

Please take care of yourself. Do your best to maintain a healthy lifestyle this semester by eating well, exercising, avoiding drugs and alcohol, getting enough sleep, and taking some time to relax. This will help you achieve your goals and cope with stress. All of us benefit from support during times of struggle. You are not alone. Besides the instructors, who are here to help you succeed, there are many helpful resources available on campus and an important part of the college experience is learning how to ask for help. Asking for support sooner rather than later is often helpful.

If you or anyone you know experiences any academic stress, difficult life events, or feelings like anxiety or depression, we strongly encourage you to seek support. Counseling and Psychological Services (CaPS) is here to help: call 412-268-2922 and visit their website at https://www.cmu.edu/counseling/. Consider reaching out to a friend, faculty, or family member you trust for help getting connected to the support that can help.

Acknowledgments

This course is based in part on material developed by William Cohen, Barnabas Poczos, Ameet Talwalkar, and Anthony Joseph.

Schedule (Subject to Change)

Date	Topics	Resources	HW
1/13	Introduction	MLSys: The New Frontier of ML Systems
1/15	Distributed Computing, MapReduce
1/17	Recitation: Spark toplogy basics + setup with Databricks (Slides (Tian), Slides (Heather), Lab0)
1/20	No class (MLK Day)
1/22	Intro to Spark		HW1 released
1/24	Recitation: Spark Transformations and Actions (Lab2 (Notebook), Slides)
1/27	Data Cleaning Spark: Joins, Structure, and DataFrames
1/29	Data Visualization	Visualization for ML A Tutorial on PCA t-SNE
1/31	Recitation: Spark RDDs and DataFrames (Lab3 (Notebook), Slides are those from Monday's lecture)
2/3	ML Review	Deep Learning, Ch. 5.2-5.4 Math for ML (review)	HW1 due
2/5	Distributed Linear Regression, Part I		HW2 released
2/7	Recitation: Linear Algebra Review (Slides)
2/10	Distributed Linear Regression, Part II
2/12	Adv. Distributed Optimization
2/14	Recitation: Learning Rate Optimization (Lab4 (Notebook), Slides)
2/17	Distributed Logistic Regression		HW2 due
2/19	Partitioning and Locality		HW3 released
2/21	Recitation: Probability Review (Slides)
2/24	Large-Scale Data Structures
2/26	PCA
3/2	Project Proposals		HW3 due
3/4	*In-Class Midterm*
3/9	No class (Spring Break)
3/11	No class (Spring Break)
3/16	*All CMU classes cancelled*
3/18	Deep Learning	Deep Learning, Ch. 6 William Cohen's Autodiff Notes
3/23	ML Frameworks + TensorFlow
3/25	ML Hardware + TensorFlow	Performant, scalable models in TensorFlow 2 with tf.data, tf.function & tf.distribute (TF World '19) [Video]
3/30	Optimization for DL	Deep Learning, Ch. 8
4/1	Efficient Hyperparameter Tuning Guest lecture: Liam Li	MLD blog post	HW4 released
4/6	Parallel/Distributed DL
4/8	Inference and Model Compression
4/10	Project check-ins		HW4 due, HW5 released
4/13	TVM & DL Compilers Guest lecture: Tianqi Chen
4/15	Productionizing Large-Scale ML
4/20	Federated Learning		HW5 due
4/22	*Project Presentations I*
4/24	*Project Presentations II*
4/27	Course summary
4/29	*In-Class Final*

Completed Projects from Spring 2020

DeepGenre: Deep Neural Networks for Genre Classification in Literary Works

In this project, we attempt to address a multi-label text classification problem that predominately features very long text-based inputs. In particular, we focus on using the Gutenberg Project dataset and use the main text of an e-book to infer its genre. This is motivated partly because many e-books, especially new ones, may have few or no labeled genres; an automated approach would help curators and librarians assign correct genres for better cataloguing of library resources. We propose to use feature engineering combined with a distributed approach using deep neural networks to tackle long textual inputs. We evaluate and benchmark various models according to our custom metrics in order to determine their effectiveness.

Predicting Hotness from Million Song Dataset

Andrew Alini, Christopher Benson, Harsh Jain, Sabyasachi Mohanty, Varun Natu

The pace of music production is ever increasing, so it can be hard for those in the music industry to consume and rate new music as it is released. To address this, we created a system that rates the “hotness” of a new unseen song using the MCFF audio features of a song. We trained on the Million Song Dataset, and took advantage of distributed ML frameworks such as SparkML, Tensorflow, and Horovod.

Automated Road Network Extraction and Route Travel Time Estimation from Satellite Imagery

Phalguna Dasaratha Mankar, Karan Vasant Hebbar, Sameed Qureshi, Vedant Sanil, Zeeshan Ashraf Shaikh, Sharath Srikanth Chellappa

We focused on data from the SpaceNet Challenge. The aim of this challenge is to build models at scale that are able to use satellite imagery to not only detect the network of roads but also provide an estimate of the travel time along the different routes. With this data, we aimed to answer the following questions: (1) Is it possible to use satellite images to accurately identify roads? (2) Is it possible to build a network (graph structure) from the identified roads? (3) Is it possible to estimate travel times using the graph network (on the detected edges)?

The data that we had was first divided into the image data and the metadata. The image data consisted of approximately 2500 annotated images in total for training, and approximately 930 test images. The images were present in TIF format. The metadata consisted of GeoJSON data, linestring data (road graphs), and TIF geodata images. For the GeoTiff processing we used the GDAL and CV2 package.

With this processed data, the end goal of our model was to be able to segment out roads from the satellite images and predict travel times for the roads. The processed 8-bit image was first fed into 4 separately trained UNet-inspired models in parallel. The models have a ResNet34 encoder and a U-Net decoder. These models output the segmentation masks of the roads. For robustness, the output from the four models was superimposed to create a final segmentation output. The segmented image was then smoothed out, small gaps were closed out, spurious connections were removed. An attempt was also made to clean hanging edges and connect terminal vertices near non-connected nodes. From the final cleaned segmented image, we extracted the skeleton using skimage. From the skeleton the graph was extracted using sknw - a python library to convert a skeleton to a graph.

We stored the data on AWS s3 bucket and accessed this in the ec2 instance created. We used AWS g4xdn.large ec2 instance and GCP VM instances to train the model for 50 epochs which took around 3-4 hours to train. With this, we were able to accurately identify the roads from the satellite images and build a road network from the identified roads.

Exploring Relationships Between Subreddits

Reddit is a forum where people can comment on many different topics organized by subreddits. The question we set out to answer is: What is the relationships between different subreddits across time and how do we interpret them? For this project, our methods included TF-IDF for tokenization, LDA topic modeling and PCA for dimensionality reduction, and t-SNE and K-means clustering for evaluation. We used Dataproc on Google Cloud Platform (GCP) as our cluster service provider and a 403 GB public Reddit dataset available via GCP’s BigQuery platform. We found that 1. The size of clusters and membership of subreddits change over time and 2. LDA outperforms PCA in terms of interpretability but underperforms in terms of silhouette score.

Predicting the Stock Market with Reddit Comments

The question we wanted to answer was whether we could predict the stock market trends using Reddit comments. We used sentiment analysis and PCA to preprocess the data and used logistic regression and hyperparameter searching to acquire the best performing model. We used PySpark to train our models on AWS EMR machines and AWS S3 bucket to store datasets. Eventually, our best performing model achieved around 65% accuracy on the testing dataset. One important lesson we have learned is that large scale machine learning tasks can be time-consuming in terms of both implementation and training. So it is the best to design and plan carefully ahead of time.

Year and Decade Prediction on Million Song with Spark

We try to predict the year and decade of a song based on its timbre feature on the Million Song dataset. We explore different dimension reduction techniques (PCA and t-SNE) and different distributed ML models including Logistic Regression, Naive Bayes, and Random Forest. We use Spark with MLlib to build our pipeline and run the distributed model on AWS EMR. We find that the decade pattern is more clear than the year pattern from the visualization result, which is consistent with the experimental result that predicting the decade is much easier. Another interesting observation is that for the difficult setting (predicting the year) the Random Forest performs much better than other methods (also cost much more to train), but for the easy setting (predicting the decade) the gap between models is smaller, where the light-weight model is preferred.

Comparison of dimension reduction algorithms on Million Song Dataset

This project made a comparison of dimension reduction methods using the Million Song dataset. The algorithms compared include PCA, AutoEncoder and LargeVis.The metrics we used include basic ones like running time, memory usage, and model linearity. A regression model was run with dimension reduced data to predict song hotness, the RMSE and Pearson R value of the regression is also used as metrics for algorithms comparison. We discovered that PCA was the most efficient model for this case because it was much faster, took less memory and preserves more information. As a built-in algorithm in pyspark library, PCA is also much easier to use.

Hotness Prediction on Million Song Dataset

In this project, we are trying to evaluate the performance of multiple models in predicting song hotness based on a large number of features in the Million Song Dataset. We mainly focus on three machine learning models, Linear Regression, Decision Tree regression and Gradient Boosting Regression. We use AWS EBS for data storage and we run spark on EMR instances for distributed training. Our results have shown the pros and cons of each model in terms of accuracy, time, storage, communication costs, east of use and interpretability.

Genre Classification on Million Song Dataset

We used machine learning techniques to predict genre label based on songs’ metadata and audio features. We compared the performance of Logistic Regression, Random Forest and Neural Networks on genre classification. Early stage data pre-processing was run on EC2, while later model training and testing was done in pySpark with MLlib running on AWS EMR. As a result, Logistic Regression performed best with cross validation accuracy of 0.9351, while Random Forest produced cross validation accuracy of 0.6018 and Neural Network 0.6934. We have strong reasons to believe that the data might be linearly separable in high dimensions(e.g. separable by high dimensional planes). In such situations, models with linear classification boundaries, such as logistic regression, tend to perform better than those with non-linear classification boundaries(Random Forest, Neural Networks).

NIH Chest X-ray Image Classification

We used NIH Chest X-rays images to make disease predictions. We did both data augmentation including horizontal flip and rotation as well as patient-level dataset split for data preprocessing. We used TensorFlow and compared three CNN models: ResNet50, MobileNetV2, and EfficientNet-B4 for both their accuracy and efficiency. The data preprocessing and training were done in multiple steps on AWS. We found MobileNet to provide both the best accuracy as well as the best efficiency, while EfficientNet which gives state of the art results on ImageNet was not performing as well for our task.

Predicting Reddit Score with ML

In this project, we try to predict the score of a Reddit comment as a regression problem given information about that comment and other information in the thread. We applied linear regression, random forests and multi-layer perceptrons to this task. We utilized Microsoft Azure and Google Cloud Platform to process our dataset and run our experiments. We found that random forests had the best performance and achieved a mean average error of only 3.63 points.

New York Cab Fare Prediction

Rohit Prakash Barnwal, Bharat Gaind, Abhinav Gupta, Yu-Ning Huang, Anmol Jagetia, Ignacio Maronna Musetti

The goal of this project was to efficiently perform fare prediction on NYC cab data, which is 140 GBs. And explore if augmentation with other supplementary datasets provides additional information for the ML model to improve its performance. Our team wrote the code in Spark and PySpark for data cleaning and employed clever tricks to join the diverse datasets. A linear regression model was a good fit for data, and we used the Spark's MLlib library to fit the model. This allowed us to train the model in a distributed environment with 11 instances of m5.xlarge on the AWS EMR cluster. We achieved an RMSE of 3.847 for our prediction model, and provide an interesting analysis of model performance with different fields in the data. The project presented a great hands-on experience with big data and deploying a large scale Machine Learning pipeline.

Forecasting Web Page View using Bidirectional LSTM with Auxiliary Features

In this project, we explored the possibility for forecasting click trends of webpages using past page view series (~365 GB) from Wikistats PageView Dataset. Aiming to leverage both local contextual information and global periodic patterns, we constructed a machine learning model consisting of three parts- a bidirectional LSTM model, a Dense Layer fed by extracted auxiliary features (such as mean, std, and location of spikes), and an output layer. All raw data is stored on Amazon S3 bucket; data cleaning and processing work are completed on Amazon Elastic Map Reduce (Amazon EMR), using Spark and Python3 for parallel processing; model building, training and validating are completed on EC2, using TensorFlow 2.1.0 and Keras APIs. As a result, our model successfully learned the general curvature of page view of webpages and reduced the average Mean Absolute Error to 10.

Stack Exchange Answer Ranking and Accepted Answer Prediction

In this project, we use machine learning techniques to predict potential accepted answers and to evaluate the quality of answers (answer rankings) on Stack Exchange (Q&A websites). The ML techniques we used include TF-IDF for textual similarity, a base RoBERTa model for contextual representations, random forests, XGBoost with LambdaMART, linear regression, logistic regression, and MLPs. We ran our experiments on GCP and AWS using Spark (PySpark) and a number of Python libraries, including xmltodict, matplotlib, multiprocessing, joblib, pandas, numpy, sklearn, transformers, xgboost4j, matplotlib, seaborn. For the accepted answer prediction task, we achieved a 80.03% question-wise accuracy with BERT, and for the answer ranking task, we achieved 89.76% avgNDCG with XGBoost.

Popularity Prediction for Reddit Comments

Aditya Anantharaman, Atabak Ashfaq, Manik Bhandari, Preetansh Goyal, Pratik Jayarao, Siddhanth Pillay

The project aims to explore the usefulness of NLP features from comment body and additional Reddit features in predicting the popularity of comments. We formulated the problem as a Classification(Popular or not) and Regression problem (upvote prediction) and use Logistic Regression, Random Forests and Linear SVC for the prediction tasks. The project was implemented using Azure and Databricks and we relied on pyspark/MLLib for running distributed processing with python libraries like NLTK for feature engineering. The results show the ability of NLP in predicting comment score while also show scope for improvement which motivates exploring parent and contextual features. One interesting takeaway was that using more and more data beyond a limit did not improve the performance of our simple models like Logistic Regression.

Recommender System on Amazon Ratings

Dipak Krishnan, Vinay Sisodiya Sisodiya, Shuo Wang, Yue Yin, Varun Baldwa, Theodore Li

We built a recommender system on the amazon rating dataset containing 233 million reviews. We used collaborative filtering and content-based filtering to train the model. We used Spark and Databricks on the small experimental datasets and did our training with the Amazon EMR cluster, which has 7-10 m5xlarge nodes. Since our dataset's overall rating is skewed, regarding the goal of training a meaningful recommendation system, our model is successful as shown in RMSE and MAE. Large scale data can cause various unexpected problems in ML pipeline.

Million Song Dataset: Year and Culture

In this project, we want to predict the creation year of a song. We explore and compare three machine learning models (Logistic Regression, Naive Bayes and Decision Tree) and one deep learning model (Multilayer Perceptron). The models are implemented on Microsoft Azure and AWS platform. Among these models, the Multilayer Perceptron gets the best performance with 61.8% accuracy.

Music Genre Classification on Million Song Dataset

The problem we are interested in is how different models performs on music genre classification according to different metrics. The ML methods we chose are Logistic Regression, Random Forest, Naive Bayes and Neural Network, with respect to metrics include accuracy, time, memory, precision by labels and ease of use. We use AWS EMR, EC2 as server, S3 as intermediate data storage, and Apache Spark, MLlib as our distributed ML environment. The final result shows that LR and NN have the highest accuracy, RF is more flexible for users to trade off between accuracy and time/memory, and Naive Bayes consumes smallest amount of time and memory but not perform well on accuracy since it is a simple probabilistic model. Some other interesting takeaways include some features, such as “artist name”, are more important in genre classifications(i.e have largest weight), and simple models may have great accuracy difference based on how features are pre-partitioned.

Song Popularity Prediction on Million Song Dataset

In this project, our group creates a machine learning pipeline to predict the song popularity, also known as song hotness, for the million song dataset including 1,000,000 songs. After data cleaning, PCA analysis and feature engineering are performed to generate the training, validation and test dataset which are stored in the AWS S3 bucket. Linear regression, random forest and gradient boosted tree models are trained and evaluated through AWS EMR using Pyspark and MLlib. The model with the best performance is the gradient boosted tree model which has a mean absolute error less than 0.15. This pipeline demonstrates its effectiveness and accuracy for large scale dataset.

Dimension-Reduction Methods for Image Classification

This project presents a systematic comparison of dimensionality reduction techniques in the context of large-scale image classification. Given the nature of large image datasets, it is not immediately clear what the best way to reduce the dimension of the images would be. Therefore, we chose to implement 3 different dimension-reduction techniques (PCA, KPCA, and Deep Autoencoding) and compare their performance with respect to selected metrics (e.g. runtime, memory usage, scalability, reconstruction quality, classification performance) using Spark and Tensorflow. Ultimately, we found that PCA had the best performance with respect to our metrics, but Deep Autoencoding was the most scalable. However, further study is needed to corroborate our results.

Temporal Topic models for Reddit Score Prediction

We set out to see if, given the observable features at the time of the creation of a Reddit post, it be possible to predict the score it would eventually attain and to figure out the infrastructure and pipeline design required to support such a production-level large-scale ML system.

To predict the score of Reddit we used the Reddit Post Dataset (300 GB), from which we extracted metadata information (time, subreddit) and content(title and post text) to create features (topics model at a month level, word2Vec embedding, sentiment) to train regression models - Linear Regression, Random Forests, and Gradient Boosted Trees. Their performance was compared using RMSE and R2 as metrics.

The raw data was pulled in from Google BigQuery to the five node (n1-standard-4, 15GB Ram, 4vCPU) Dataproc cluster that we used to complete data processing (2 days and 6 hours for 200GB) and machine learning (~100s for linear regression, ~90s for the random forest, ~550s for GBTree)

The results varied from subreddit to subreddit, ranging from RMSE of 0.09 (OneAmericaNews subreddit) to 8287 (gifs subreddit) and R2-values on appropriate subreddits of 0.16 to -0.12. (ethtrader to Futurology)

We found that the heavily skewed scores (most scores are zero) and the temporal relevancy of Reddit posts require a more nuanced feature engineering approach. Also, we learned the phases of a production ML pipeline (from Data Ingestion to Model Scoring) and got a handle on estimating the infrastructure required to support it.

Song Recommendation System on Million Song Dataset

Deeptha Anil Kumar, Bhumi Dinesh Bhanushali, Varsha Kuppur Rajendra, Kathan Nilesh Mehta

Our main aim of the project was to build the ML pipeline for the Million Song Dataset and recommend top songs that the user would prefer to listen to in an optimized way. For this, we performed a comparative analysis of three Machine Learning techniques based on computation time and precision. These methods were - Popularity-based model, content-based model and collaborative model. We used AWS to load the data, extracted the CSV version of it and uploaded it to Microsoft Azure cloud platform for using it on scale with Databricks. Collaborative model gave the best precision results, followed by the content-based model and popularity-based model. In terms of computation time, the popularity-based model was naive, simple to implement and took the minimum time to train followed by the collaborative model and content-based model. Apart from the comparative study, we also learnt how to deal with real world large datasets and the majority of our time was dedicated to data loading, data cleaning and feature extraction rather than running each ML technique.

NYC Taxi Fare Prediction

Zeeshan Ahmed, Abhishek Bamotra, Vivek Gupta, Jimmy Herman, Baljit Singh, Pranav Thombre

We sought to answer the following questions: What is the projected total fare for a taxi trip given the time, date, and pickup & drop-off location? What features are most relevant in predicting a taxi fare? What degree of prediction accuracy is satisfactory? What other metrics should be considered when selecting our model? We evaluated 4 ML models: Linear Regression, Decision Tree, Random Forest and Gradient Boosted Trees. All of our work was completed on Spark using the ML library and Pyspark API. We used both Azure Databricks and AWS EMR for our project. We found the Random Forest model to have the lowest RMSE in fare prediction and the linear regression model had the fastest training time, lowest inference latency, smallest model size and was the most interpretable. However, across all metrics, the decision tree was the most balance model. We learned valuable lessons in handling large dataset, managing cloud resources/expenses, experienced distributed ML on AWS and Azure, configured Spark executors and mitigated failure scenarios and much more through this project.

Million Song Dataset for Music Mode and Genre Classification

The major question we are trying to solve in this project is to use the million song dataset to evaluate the performance of different machine learning methods under specific contexts. Specifically, we are curious about whether it is possible to predict some general information about a song, such as mode and genre, when given some musical features and artist information about the song. Therefore, we have chosen four machine learning models, including logistic regression, random forest, gradient boost tree, and multi-layer perceptron to predict music mode and also classify different genres for songs. Multiple tools and platforms were used in this project, such as AWS S3 and Azure Databricks.

Finding Patterns in Darknet Market Listings

Arthur Dzieniszewski, Gregory Howe, Jennifer Lee, Ulani Qi, Alexander Schneidman

Our goal for this project was to find interpretable patterns or useful groupings among the darknet market listings. To solve this problem, we used several dimensionality reduction techniques like PCA, tSNE, and VDSH. We ran all of our computation and data preprocessing on AWS EMR clusters and AWS CUDA instances. Our most successful technique, VDSH, revealed that the listings could be clustered into smaller groups where the nearest neighbor would be a listing for a similar product.

10-405/10-605: ML with Large Datasets, Spring 2020