Should this be a challenge?
File EVENTS.csv.gz
contains records of several events of patients collected during their
stay in the ICU (Intensive Care Unit). This file, compressed, occupies
4.2 Gigabytes.
You need to perform a full data analysis and learning with this data,
possibly using the machine learning pipeline we studied in class. You
do not need to use Apache Beam, but you are required to use a suitable
tool in order to efficiently process this data.
Among the tasks you will perform are:
- statistical analysis and visualization of data for each
patient (SUBJECT_ID). Note: one patient can have more than one
hospital admission (HADM_ID).
An example of visualization graph is shown below.
- Predict length of stay. For this task you will need to choose a
window size to train your data (too many days will delay decisions
about patients when the system is deployed, too few days will
probably produce a very shortsighted predictor. Choose with
care).
Use wisely the resources you learned (map-reduce, pyspark, bigquery,
multiprocessing, multithreading, pipelines etc) in order to analyze
this data.
More tables can be found in the
bucket directory and
information about the database schema with a description of columns
can be found here.
You should hand in a report of your work (it can be a pdf or a
commented and annotated notebook).
Deadline: June 1st, 2024