Big Data Engineering
Keywords |
Classification |
Keyword |
CNAEF |
Informatics |
Instance: 2024/2025 - 2S 
Cycles of Study/Courses
Acronym |
No. of Students |
Study Plan |
Curricular Years |
Credits UCN |
Credits ECTS |
Contact hours |
Total Time |
MECD |
26 |
Syllabus |
1 |
- |
6 |
42 |
162 |
M.EGI |
30 |
Syllabus |
1 |
- |
6 |
42 |
162 |
Teaching Staff - Responsibilities
Teaching language
Suitable for English-speaking students
Objectives
Extracting information from large sets of data -- known as “big data” –has been the driver for several large and small companies in the last years and has imposed a specific set of challenges, that this course addresses. The goal of this curricular unit is to familiarize the student with the major paradigms, challenges, and approaches at developing big data applications and systems.
Learning outcomes and competences
After completing this curricular unit, the student should:
1) be able to distinguish the different concepts that support parallel and distributed computing including data processing;
2) understand existing big data storage and processing architectures and systems;
3) be able to develop big data applications, namely search-based applications and learning-based applications, and characterize their performance;
4) be able to identify and discuss challenges in developing and using big data applications and models.
Working method
Presencial
Pre-requirements (prior knowledge) and co-requirements (common knowledge)
Programming; learning from data.
Program
1. Fundamental concepts of parallel computing: performance measurements, memory management and data location, limitations of parallel computing and parallel programming models.
2. Models for parallel programming with data: Map-reduce model; key-value data organization; relation with the Hadoop distributed file system and with resource management; Spark model, resilient and distributed datasets, actions and transformations, DAG-based execution, and variable broadcasting.
3. Application development and performance characterization: search (Spark SQL) and learning (Spark mmlib); debugging, measurements, and tunning of tasks, jobs, and stages in Spark.
4. Challenges in developing and using big data applications and models, including privacy and anonymity, learning result interpretability, bias, and vulnerabilities.
Mandatory literature
Hien Luu;
Beginning Apache Spark 3, Apress, 2021. ISBN: 978-1-4842-7382-1
Jules S. Damji, Brooke Wenig, Tathagata Das, Denny Lee;
Learning Spark -- Lightning-Fast Big Data Analysis, 2nd Edition, O'Reilly, 2020. ISBN: 978-1492050049
Complementary Bibliography
Tom White;
Hadoop: The Definitive Guide: Storage and Analysis at Internet Scale 4th Edition, O'Reilly, 2015. ISBN: 978-1491901632
Tomcy John, Pankaj Misra; Data Lake for Enterprises: Lambda Architecture for building enterprise data systems, Packt Publishing, 2017. ISBN: 978-1787281349
Teaching methods and learning activities
1) Exploration of the fundamental concepts in parallel programming with data, big data programming models and system architectures, and applications, through a) lectures, b) autonomous search for scientific papers, use case reports; c) flipped classroom technique with self-learning of previously identified content and with later discussion of these concepts in the classroom.
2) Autonomous exploration, presentation, and discussion of scientific papers.
3) Project including the specification, development, test, and performance characterization of big data applications using the technologies and concepts discussed in the course.
Evaluation Type
Distributed evaluation with final exam
Assessment Components
Designation |
Weight (%) |
Exame |
50,00 |
Trabalho prático ou de projeto |
50,00 |
Total: |
100,00 |
Amount of time allocated to each course unit
Designation |
Time (hours) |
Elaboração de projeto |
60,00 |
Estudo autónomo |
60,00 |
Frequência das aulas |
42,00 |
Total: |
162,00 |
Eligibility for exams
Developing the project and attending class.
Calculation formula of final grade
CF = 0,5*T + 0,5*P; if ( T < 10,0 or P < 10,0 ) then CF =MIN(CF, 9.0)
T - test
P - project
Special assessment (TE, DA, ...)
Students taking exams under special regimes are expected to previously submit the project required for this course as ordinary students.Students not atteding the classes have to submit and present their work in the established deadlines. These later students should take the initiative to establish with the teatcher periodic meetings to report work progress.
Classification improvement
The classification improvement will be carried out by single individual proof with two components: 1. examination of appeal; 2. An additional component that allows assessing the skills assessed through the work developed in the distributed evaluation. The classification improvement can be made at the time of the feature of this edition or subsequent editions. The improvement of final grade takes place at the corresponding appeal period in the current edition of the course or the subsequent one.