Big Data Engineering

Code:

MECD07

Acronym:

EGD

Keywords
Classification	Keyword
CNAEF	Informatics

Instance: 2024/2025 - 2S

Active?	Yes
Responsible unit:	Department of Informatics Engineering
Course/CS Responsible:	Master in Data Science and Engineering

Cycles of Study/Courses

Acronym	No. of Students	Study Plan	Curricular Years	Credits UCN	Credits ECTS	Contact hours	Total Time
MECD	26	Syllabus	1	-	6	42	162
M.EGI	30	Syllabus	1	-	6	42	162

Teaching Staff - Responsibilities

Teacher	Responsibility
Jorge Manuel Gomes Barbosa

Teaching - Hours

Recitations:

3,00

Type	Teacher	Classes	Hour
Recitations	Totals	2	6,00
Recitations	Jorge Manuel Gomes Barbosa		6,00

Teaching language

Suitable for English-speaking students

Objectives

Extracting information from large sets of data -- known as “big data” –has been the driver for several large and small companies in the last years and has imposed a specific set of challenges, that this course addresses. The goal of this curricular unit is to familiarize the student with the major paradigms, challenges, and approaches at developing big data applications and systems.

Learning outcomes and competences

After completing this curricular unit, the student should:
1) be able to distinguish the different concepts that support parallel and distributed computing including data processing;
2) understand existing big data storage and processing architectures and systems;
3) be able to develop big data applications, namely search-based applications and learning-based applications, and characterize their performance;
4) be able to identify and discuss challenges in developing and using big data applications and models.

Working method

Presencial

Pre-requirements (prior knowledge) and co-requirements (common knowledge)

Programming; learning from data.

Program

1. Fundamental concepts of parallel computing: performance measurements, memory management and data location, limitations of parallel computing and parallel programming models.
2. Models for parallel programming with data: Map-reduce model; key-value data organization; relation with the Hadoop distributed file system and with resource management; Spark model, resilient and distributed datasets, actions and transformations, DAG-based execution, and variable broadcasting.
3. Application development and performance characterization: search (Spark SQL) and learning (Spark mmlib); debugging, measurements, and tunning of tasks, jobs, and stages in Spark.
4. Challenges in developing and using big data applications and models, including privacy and anonymity, learning result interpretability, bias, and vulnerabilities.

Mandatory literature

Hien Luu; Beginning Apache Spark 3, Apress, 2021. ISBN: 978-1-4842-7382-1
Jules S. Damji, Brooke Wenig, Tathagata Das, Denny Lee; Learning Spark -- Lightning-Fast Big Data Analysis, 2nd Edition, O'Reilly, 2020. ISBN: 978-1492050049

Complementary Bibliography

Tom White; Hadoop: The Definitive Guide: Storage and Analysis at Internet Scale 4th Edition, O'Reilly, 2015. ISBN: 978-1491901632
Tomcy John, Pankaj Misra; Data Lake for Enterprises: Lambda Architecture for building enterprise data systems, Packt Publishing, 2017. ISBN: 978-1787281349

Teaching methods and learning activities

1) Exploration of the fundamental concepts in parallel programming with data, big data programming models and system architectures, and applications, through a) lectures, b) autonomous search for scientific papers, use case reports; c) flipped classroom technique with self-learning of previously identified content and with later discussion of these concepts in the classroom.
2) Autonomous exploration, presentation, and discussion of scientific papers.
3) Project including the specification, development, test, and performance characterization of big data applications using the technologies and concepts discussed in the course.

Evaluation Type

Distributed evaluation with final exam

Assessment Components

Designation	Weight (%)
Exame	50,00
Trabalho prático ou de projeto	50,00
Total:	100,00

Amount of time allocated to each course unit

Designation	Time (hours)
Elaboração de projeto	60,00
Estudo autónomo	60,00
Frequência das aulas	42,00
Total:	162,00

Eligibility for exams

Developing the project and attending class.

Calculation formula of final grade

CF = 0,5*T + 0,5*P; if ( T < 10,0 or P < 10,0 ) then CF =MIN(CF, 9.0)
T - test
P - project

Special assessment (TE, DA, ...)

Students taking exams under special regimes are expected to previously submit the project required for this course as ordinary students.Students not atteding the classes have to submit and present their work in the established deadlines. These later students should take the initiative to establish with the teatcher periodic meetings to report work progress.

Classification improvement

The classification improvement will be carried out by single individual proof with two components: 1. examination of appeal; 2. An additional component that allows assessing the skills assessed through the work developed in the distributed evaluation. The classification improvement can be made at the time of the feature of this edition or subsequent editions. The improvement of final grade takes place at the corresponding appeal period in the current edition of the course or the subsequent one.

Recommend this page Top

Copyright 1996-2025 © Faculdade de Engenharia da Universidade do Porto I Terms and Conditions I Accessibility I Index A-Z I Guest Book
Page generated on: 2025-06-14 at 20:17:51 | Acceptable Use Policy | Data Protection Policy | Complaint Portal