Esta página em português Ajuda Autenticar-se
You are in:: Start > M4114
Authentication

Site map
Options

# Statistics and Data Analysis

 Code: M4114 Acronym: M4114 Level: 400

Keywords
Classification Keyword
OFICIAL Mathematics

## Instance: 2020/2021 - 2S

 Active? Yes Responsible unit: Department of Mathematics Course/CS Responsible: Master's degree in Data Science

### Cycles of Study/Courses

Acronym No. of Students Study Plan Curricular Years Credits UCN Credits ECTS Contact hours Total Time
M:DS 24 Official Study Plan since 2018_M:DS 1 - 6 42 162

### Teaching Staff - Responsibilities

Teacher Responsibility
Joaquim Fernando Pinto da Costa

### Teaching - Hours

 Theoretical and practical : 3,00
Type Teacher Classes Hour
Theoretical and practical Totals 1 3,00
Joaquim Fernando Pinto da Costa 2,00
Maria Paula de Pinho de Brito Duarte Silva 1,00

English

### Objectives

Train students in multivariate data analysis methods in order to extract essential information from a potentially voluminous set of data with a focus on supervised and unsupervised learning methods.

### Learning outcomes and competences

1. Understanding  the theoretical foundations of the methodologies taught.
2. Ability to extract  essential information from a set of real data, using the methodologies taught

And in particular:
- Recognize different problems of unsupervised classification and supervised classification and solve them using the methods addressed and using software R;
- Prepare, solve and present computational data mining projects, where the various models presented are discussed, evaluated and compared in concrete cases.
- Solve computational and non-computational exercises on the methodologies addressed

Presencial

### Pre-requirements (prior knowledge) and co-requirements (common knowledge)

Previous knowledge on random variables, probability distribution, sample statistics, confidence intervals and hypothesis tests is required. Those are usual contents of an introductory course on Probability and Statistics for undergrduate students.

### Program

1    Brief summary of random vectors. Multivariate normal distribution.
Resampling methods.
Selection of Linear Models and Regularization (Ridge and Lasso Regression). Bias-variance tradeoff.
Feature screening for ultrahigh dimensional predictors.
Clustering: Partition methods, hierarchical methods, probabilistic method and model based clustering.
Statistical decision theory. Bayes rules of minimum error and minimum cost.
Logistic regression.
Nonparametric estimation of probability density functions: kernel and the kth nearest neighbor methods.
Factorial Analysis: Principal Component Analysis, Simple and Multiple Correspondence Analysis.
Multidimensional Scaling.

### Mandatory literature

James Gareth 070; An introduction to statistical learning. ISBN: 978-1-4614-7137-0
Everitt Brian S.; Applied multivariate data analysis. ISBN: 978-0-470-71117-0
000040365. ISBN: 0-387-95284-5

### Complementary Bibliography

000098707. ISBN: 978-0-521-86116-8
Sharma, Subhash; Applied multivariate techniques. ISBN: 0-471-31064-6
Hair Jr Joseph F.; Multivariate data analysis. ISBN: 0-13-515309-3
Jianqing Fan and Runze Li and Cun-Hui Zhang ; Statistical Foundation of Data Science , Chapman and Hall/CRC; 1 edition, 2019. ISBN: 978-1466510845

### Teaching methods and learning activities

Classes will be simultaneously theoretical and practical, with several examples of application and always making use of statistical software. The used software will be SPSS or the free programming language R (depending on the masters course).

R Project

### keywords

Physical sciences > Mathematics > Statistics

### Evaluation Type

Distributed evaluation without final exam

### Assessment Components

designation Weight (%)
Teste 60,00
Trabalho escrito 40,00
Total: 100,00

### Amount of time allocated to each course unit

designation Time (hours)
Estudo autónomo 120,00
Frequência das aulas 42,00
Total: 162,00

### Eligibility for exams

Attendency is not mandatory. The computational project, which must be presented by the students, is mandatory.

### Calculation formula of final grade

1. Evaluation will be distributed without a final examination. There is however  an exam in  the second evaluation period (“época de recurso”).

2. Exam in  the second evaluation period (“época de recurso”): students who have failed in the tests and project (final mark less than 9.5) can take an exam in the second evaluation period (“época de recurso”) and take one or both parts. For each part, the final score is the maximum of the marks obtained by test and exam. The mark obtained in the written assignment/project cannot be improved in any evaluation period.

The subject is divided into two parts. Each part consists of a practical work and a test / exam. For each student the marks of the works and tests are given by:
Scores_of_works: max (2/3*work1 +1/3*work2, 1/2*work1+1/2*work2)
Scores_of_tests: max (2/3*Test1 +1/3*Test2, 1/2*Test1+1/2*Test2)

Final Score: 0.6* Scores_of_tests+0.4*Scores_of_works. The same procedure applies in the case of the two parts of the exam.

Approval is subject to the value of Score_of_tests (or both parts of the exam), calculated by the above formula,  being equal to or higher than 7.0 values (on a scale of 0 to 20).

The practical works consist of the analysis of a real database, using the methods taught, using software.
It should be done by groups of 2 students.

### Classification improvement

Improvement of the final mark: students that  have succeed and attend the exam  (“época de recurso”) in order to improve their final mark, have to take both parts. The mark obtained in the written assignment/project cannot be improved in any evaluation period. The evaluation formula is the same (see above).