Ciência de Dados em Larga Escala
Lista de secções
-
-
-
-
-
-
-
-
Dates of partial evaluations
TEST #1: April 10th (FC1 226 - Math Dept)9:00-10:50TEST #2: May 29th (FC4 126 - Biology Dept)9:00-10:50 -
-
-
Dates of final evaluations
NORMAL (Regular season): June 14th (time: 9-12, room FC1 219)
RECURSO (Appeal season): July 1st (time: 9-12, rooms FC6 142/FC6 146)
(keep checking here for updates)
-
Projects (Deadline to submit your work: June 1st)
Three kinds of projects are proposed: one oriented to performance analysis, a second one oriented to the construction of infrastructures and a third one on a machine learning application.
You can propose your own project, as long as it covers the course materials.
Please, organize yourselves in groups of at most two students and choose one kind of project.
Proposed Projects:
-
Por favor, responda aqui qual foi o projeto que o seu grupo escolheu ou se propõe algum projeto diferente.
Please, tell me what project your group chose or if you are proposing a new one.
-
-
-
-
Carregado 5/03/24 às 11:31
-
-
13/03: Clouds (cont. from slide 26), Virtualization, MapReduce
-
-
-
Most recent survey about the many types of scheduling algorithms studied in the context of MapReduce computations.
(Exclusion criteria are not very good, but this is less important than the contents)
-
20/03: Apache Spark
-
-
27/03: Happy Easter!
-
03/04: FCUP activities day - no classes
-
10/04: TESTE #1 (9:00-10:50) FC1 226 - Math Dept
Topics:
- Concept of cloud and types of clouds
- Concept of virtualization
- Types of computer architectures
- programming models, data distribution
- advantages and disadvantages
- characteristics
- programming models, data distribution
- Exercises given in practical classes
Suggested book chapters and sections (Cloud Computing - Theory and Practice, by Dan Marinescu. 1st edition - Chapters and sections may change if you use the second edition)- C1: intro, s1.3, s1.4, s1.5, s1.6, s1.7- C2: intro, s2.1, s2.2, s2.9, s2.10- C3: intro, s3.2, s3.7, s3.8, s3.9, s3.10- C4: intro, s4.1, s4.2, s4.6, s4.7, s4.8, s4.9, s4.10- C5: intro, s5.1, s5.2, s5.3, s5.4- C6: intro -
17/04: Apache Beam, Dask & cia
-
-
-
24/04: modin, joblib (from slide 22 of dask & cia above) and Graph Neural Networks
-
-
01/05: Labor day (holiday - dia do trabalhador)
-
08/05: Academic week (semana académica - no class)
-
15/05: An alternative to GNNs / Programming for GPUs
-
-
-
22/05: Programming for GPUs (cont.) and opportunities for parallelization in ML
-
-
29/05: TEST #2 (FC4 126 - Biology Dept) 9:00-10:50
Contents:
- Data distribution and schedulers
- apache beam, dask, modin, joblib
- GNNs and pytorch geometric
- cupy, numba, cudnn, rapids-ai
Review the links suggested in theoretical and practical classesReview practical classes
-
-
-
-
-
Upload here your pdf or html containing the report of this class.
-
-
-
Upload your notebook here with your comments and explanations.
-
-
After redeeming your coupon and checking for your credits (look for "credits" in the menu on the left), start exploring the platform.
Install the Google Cloud toolkit.
Try the platform.
-
-
-
-
-
-
- Follow this pyspark tutorial.
- After finishing, reimplement your wordcount program using pyspark and compare with the sequential and apache beam implementations.
- pyspark has support to pandas and run pandas-like operations in parallel.
- A good tutorial on pyspark and dataframes.
-
-
-
Review exercises
-
Exercises of book by Dan Marinescu:
- C2: Problems 3, 6, 9
- C3: Problem 1, but applied to the Google Cloud Platform, Problems 2, 3, 4, 5, 7, 8, 9, 10
- C4: Problems 4, 5, 6, 7, 8, 9
- C5: Problems 1, 2, 3, 4
- C6: Problem 7, 8
- C8: 3, 5
Review past mini-tasks
-
-
-
-
-
-
-
Accelerating dataframe operations using cuDF.
-