Introduction to data processing with Python and Spark
Keywords |
Classification |
Keyword |
CNAEF |
Informatics |
Instance: 2023/2024 - SP (of 17-05-2024 to 08-06-2024)
Cycles of Study/Courses
Teaching Staff - Responsibilities
Teaching language
Portuguese
Objectives
The course is designed for people with basic Python programming knowledge who aim to develop skills in analyzing large volumes of data.
Learning outcomes and competences
At the end of the course, trainees should have acquired programming knowledge in Apache Spark and should be able to implement data analysis algorithms, namely:
- Know the MapReduce programming model and build basic programs using transformations and actions;
- Know the RDD and dataframe models;
- Process structured data with SparkSQL;
- Reading and writing data files.
Working method
Presencial
Program
- Introduction to the MapReduce programming model;
- The HDFS storage model and RDD representation;
- Actions and transformations;
- Processing with Key-Value pairs;
- Definition of Lambda functions;
- Understand the processing flow with DAGs;
- Configuration of the parallelism level;
- Introduction to SparkSQL;
- Reading and writing structured files;
- Work with missing and incorrect data;
- Structured operations.
Mandatory literature
Jules S. Damji, Brooke Wenig, Tathagata Das, Denny Lee; Learning Spark -- Lightning-Fast Big Data Analysis, O'Reilly, 2020. ISBN: 978-1492050049
Teaching methods and learning activities
The course operates in-person mode and comprises 21 hours of theoretical-practical contact.
The Theoretical-Practical sessions will be supported by the projection of content and the provision of dedicated notes.
During the presentation classes, and using examples, small programs will be developed in PySpark in an interactive way. Theoretical-practical classes take place in a laboratory with computers, and encourage trainees to solve small sets of various problems, with support from the trainer.
Evaluation Type
Distributed evaluation with final exam
Assessment Components
designation |
Weight (%) |
Exame |
80,00 |
Trabalho laboratorial |
20,00 |
Total: |
100,00 |
Amount of time allocated to each course unit
designation |
Time (hours) |
Estudo autónomo |
60,00 |
Frequência das aulas |
21,00 |
Total: |
81,00 |
Eligibility for exams
no requirements
Calculation formula of final grade
NCP - Grade of the Pratical Component
NE - Grade of the Exam
Final Grade = 0.2 x NCP + 0.8 x NE (between 0 and 20)