Big Data Engineering

Code:

MECD07

Acronym:

EGD

Keywords
Classification	Keyword
CNAEF	Informatics

Instance: 2021/2022 - 2S (of 26-02-2022 to 16-07-2022)

Active?	Yes
Responsible unit:	Department of Electrical and Computer Engineering
Course/CS Responsible:	Master in Data Science and Engineering

Cycles of Study/Courses

Acronym	No. of Students	Study Plan	Curricular Years	Credits UCN	Credits ECTS	Contact hours	Total Time
MECD	27	Syllabus	1	-	6	42	162

Last updated on 2021-07-27.

Fields changed: Objectives, Resultados de aprendizagem e competências, Métodos de ensino e atividades de aprendizagem, Bibliografia Complementar, Programa, Componentes de Avaliação e Ocupação, Bibliografia Obrigatória, Fórmula de cálculo da classificação final

Teaching language

Suitable for English-speaking students

Objectives

Extracting information from large sets of data -- known as “big data” –has been the driver for several large and small companies in the last years and has imposed a specific set of challenges, that this course addresses. The goal of this curricular unit is to familiarize the student with the major paradigms, challenges, and approaches at developing big data applications and systems.

Learning outcomes and competences

After completing this curricular unit, the student should:
1) be able to distinguish the different concepts that support parallel and distributed computing including data processing;
2) understand existing big data storage and processing architectures and systems;
3) be able to develop big data applications, namely search-based applications and learning-based applications, and characterize their performance;
4) be able to identify and discuss challenges in developing and using big data applications and models.

Working method

Presencial

Pre-requirements (prior knowledge) and co-requirements (common knowledge)

Programming; learning from data.

Program

1. Fundamental concepts of parallel computing: performance measurements, types of processors, memory management and data location, limitations of parallel computing, types of parallelism, stages in parallelization, parallel programming models, and data parallelism.
2. Models for parallel programming with data: CUDA/GPU model, organization in threads and mapping to multi-dimensional data; Map-reduce model, key-value data organization, execution stages, speculative execution, relation with the Hadoop distributed file system and with resource management; Spark model, resilient and distributed datasets, variable broadcasting, streaming mode.
3. Application development and performance characterization: search (Hadoop Pig, Spark SQL) and learning (Spark mmlib, deeplearning on GPU/tensorflow); debugging, measurements, and tunning of tasks, jobs, and stages in Spark, Hadoop, and tensorflow.
4. Challenges in developing and using big data applications and models, including privacy and anonymity, learning result interpretability, bias, and vulnerabilities.

Mandatory literature

Jules S. Damji, Brooke Wenig, Tathagata Das, Denny Lee; Learning Spark -- Lightning-Fast Big Data Analysis, 2nd Edition, O'Reilly, 2020. ISBN: 978-1492050049
David B. Kirk and Wen-mei W. Hwu; Programming Massively Parallel Processors - A Hands-on Approach, Morgan Kaufmann, 2017. ISBN: 978-0128119860

Complementary Bibliography

Tom White; Hadoop: The Definitive Guide: Storage and Analysis at Internet Scale 4th Edition, O'Reilly, 2015. ISBN: 978-1491901632
Tomcy John, Pankaj Misra; Data Lake for Enterprises: Lambda Architecture for building enterprise data systems, Packt Publishing, 2017. ISBN: 978-1787281349

Teaching methods and learning activities

1) Exploration of the fundamental concepts in parallel programming with data, big data programming models and system architectures, and applications, through a) lectures, b) autonomous search for scientific papers, use case reports, and other information available online, c) flipped classroom technique with self-learning of previously identified content and with later discussion of these concepts in the classroom.
2) Autonomous exploration, presentation, and discussion of scientific papers.
3) Project including the pecification, development, test, and performance characterization of big data applications using the technologies and concepts discussed in the course.

Evaluation Type

Distributed evaluation without final exam

Assessment Components

Designation	Weight (%)
Exame	50,00
Trabalho prático ou de projeto	50,00
Total:	100,00

Amount of time allocated to each course unit

Designation	Time (hours)
Elaboração de projeto	60,00
Estudo autónomo	60,00
Frequência das aulas	42,00
Total:	162,00

Eligibility for exams

Developing the project and attending class.

Calculation formula of final grade

CF = 0,5*T + 0,5*P; if ( T < 10,0 or P < 10,0 ) then CF =MIN(CF, 9.0)
T - test
P - project

Classification improvement

The classification of the Project can be improved in the next occurrence of the course. The test grade can be improved in re-sit one exam.

Recommend this page Top

Copyright 1996-2025 © Faculdade de Engenharia da Universidade do Porto I Terms and Conditions I Accessibility I Index A-Z I Guest Book
Page generated on: 2025-06-17 at 05:28:19 | Acceptable Use Policy | Data Protection Policy | Complaint Portal