Big Data Engineering
Keywords |
Classification |
Keyword |
CNAEF |
Informatics |
Instance: 2021/2022 - 2S (of 26-02-2022 to 16-07-2022) 
Cycles of Study/Courses
Acronym |
No. of Students |
Study Plan |
Curricular Years |
Credits UCN |
Credits ECTS |
Contact hours |
Total Time |
MECD |
27 |
Syllabus |
1 |
- |
6 |
42 |
162 |
Teaching language
Suitable for English-speaking students
Objectives
Extracting information from large sets of data -- known as “big data” –has been the driver for several large and small companies in the last years and has imposed a specific set of challenges, that this course addresses. The goal of this curricular unit is to familiarize the student with the major paradigms, challenges, and approaches at developing big data applications and systems.
Learning outcomes and competences
After completing this curricular unit, the student should:
1) be able to distinguish the different concepts that support parallel and distributed computing including data processing;
2) understand existing big data storage and processing architectures and systems;
3) be able to develop big data applications, namely search-based applications and learning-based applications, and characterize their performance;
4) be able to identify and discuss challenges in developing and using big data applications and models.
Working method
Presencial
Pre-requirements (prior knowledge) and co-requirements (common knowledge)
Programming; learning from data.
Program
1. Fundamental concepts of parallel computing: performance measurements, types of processors, memory management and data location, limitations of parallel computing, types of parallelism, stages in parallelization, parallel programming models, and data parallelism.
2. Models for parallel programming with data: CUDA/GPU model, organization in threads and mapping to multi-dimensional data; Map-reduce model, key-value data organization, execution stages, speculative execution, relation with the Hadoop distributed file system and with resource management; Spark model, resilient and distributed datasets, variable broadcasting, streaming mode.
3. Application development and performance characterization: search (Hadoop Pig, Spark SQL) and learning (Spark mmlib, deeplearning on GPU/tensorflow); debugging, measurements, and tunning of tasks, jobs, and stages in Spark, Hadoop, and tensorflow.
4. Challenges in developing and using big data applications and models, including privacy and anonymity, learning result interpretability, bias, and vulnerabilities.
Mandatory literature
Jules S. Damji, Brooke Wenig, Tathagata Das, Denny Lee;
Learning Spark -- Lightning-Fast Big Data Analysis, 2nd Edition, O'Reilly, 2020. ISBN: 978-1492050049
David B. Kirk and Wen-mei W. Hwu;
Programming Massively Parallel Processors - A Hands-on Approach, Morgan Kaufmann, 2017. ISBN: 978-0128119860
Complementary Bibliography
Tom White;
Hadoop: The Definitive Guide: Storage and Analysis at Internet Scale 4th Edition, O'Reilly, 2015. ISBN: 978-1491901632
Tomcy John, Pankaj Misra; Data Lake for Enterprises: Lambda Architecture for building enterprise data systems, Packt Publishing, 2017. ISBN: 978-1787281349
Teaching methods and learning activities
1) Exploration of the fundamental concepts in parallel programming with data, big data programming models and system architectures, and applications, through a) lectures, b) autonomous search for scientific papers, use case reports, and other information available online, c) flipped classroom technique with self-learning of previously identified content and with later discussion of these concepts in the classroom.
2) Autonomous exploration, presentation, and discussion of scientific papers.
3) Project including the pecification, development, test, and performance characterization of big data applications using the technologies and concepts discussed in the course.
Evaluation Type
Distributed evaluation without final exam
Assessment Components
Designation |
Weight (%) |
Exame |
50,00 |
Trabalho prático ou de projeto |
50,00 |
Total: |
100,00 |
Amount of time allocated to each course unit
Designation |
Time (hours) |
Elaboração de projeto |
60,00 |
Estudo autónomo |
60,00 |
Frequência das aulas |
42,00 |
Total: |
162,00 |
Eligibility for exams
Developing the project and attending class.
Calculation formula of final grade
CF = 0,5*T + 0,5*P; if ( T < 10,0 or P < 10,0 ) then CF =MIN(CF, 9.0)
T - test
P - project
Classification improvement
The classification of the Project can be improved in the next occurrence of the course. The test grade can be improved in re-sit one exam.