Project proposal #1

This assignment consists of running a full machine learning (ML) pipeline using some python libraries for big data. The 4 main objectives are:

identify performance bottlenecks when using a specific library
identify syntactic differences among the different libraries
identify operations that are best suited to one particular library
get acquainted with these libraries and knowing what is supported from pandas, scikit-learn and numpy

Your first task is to repeat experiments found here. In these experiments, authors compare koalas[pyspark] and dask on various database-like operations.

You may use GCP to create a cluster similar to i3.4xlarge AWS and a machine similar to i3.16xlarge, up to what credits allow.

After repeating that experiment, your next task is to modify them to integrate code that uses Modin, JobLib and RapidsAI.

Besides running your experiments with the NYC taxi driver dataset, choose two other datasets: one smaller and one larger than the NYC taxi. They can be samples of the taxi dataset or other datasets.

For the taxi dataset, the machine learning task is to build a model to predict the target variable "fare_amount".

Suggested ML models: XGBRegressor and LogisticRegression (both will perform predictions, but the second one will perform classification. In that case, you need to discretize the "fare_amount" variable).

A full ML pipeline consists of:

reading the data
preprocessing (that may include cleaning, filtering, feature selection etc)
training and validation (use cross-validation and tune parameters)
testing

Define scoring metrics to evaluate the models (accuracy, precision, recall, f-measure, error etc).

Suggested structure for the report: (it can be a notebook with comments)

Brief background on PySpark, Dask, Modin, JobLib, Rapids and Koalas
Materials and methods

Machines used and their characteristics
Datasets description

Experiment #1: repeat NYC taxi driver dataset study

report comparisons of execution times for each operation defined in the blog

Experiment #2

Run all datasets using Dask+Modin, Dask+Rapids, Dask+Modin+Rapids and Koalas
You may need to use cProfile or yappi to profile your codes (pycallgraph may not work because the code is somewhat complex)

Discussion and conclusions