Code: | M4114 | Acronym: | M4114 | Level: | 400 |
Keywords | |
---|---|
Classification | Keyword |
OFICIAL | Mathematics |
Active? | Yes |
Responsible unit: | Department of Mathematics |
Course/CS Responsible: | Master in Data Science |
Acronym | No. of Students | Study Plan | Curricular Years | Credits UCN | Credits ECTS | Contact hours | Total Time |
---|---|---|---|---|---|---|---|
M:DS | 25 | Official Study Plan since 2018_M:DS | 1 | - | 6 | 42 | 162 |
Train students in multivariate data analysis methods in order to extract essential information from a potentially voluminous set of data with a focus on supervised and unsupervised learning methods.
1. Understanding the theoretical foundations of the methodologies taught.
2. Ability to extract essential information from a set of real data, using the methodologies taught
And in particular:
- Recognize different problems of unsupervised classification and supervised classification and solve them using the methods addressed and using software R;
- Prepare, solve and present computational data mining projects, where the various models presented are discussed, evaluated and compared in concrete cases.
- Solve computational and non-computational exercises on the methodologies addressed
Previous knowledge on random variables, probability distribution, sample statistics, confidence intervals and hypothesis tests is required. Those are usual contents of an introductory course on Probability and Statistics for undergrduate students.
1 Brief summary of random vectors. Multivariate normal distribution.
Resampling methods.
Selection of Linear Models and Regularization (Ridge and Lasso Regression). Bias-variance tradeoff.
Feature screening for ultrahigh dimensional predictors.
Clustering: Partition methods, hierarchical methods, probabilistic method and model based clustering.
Statistical decision theory. Bayes rules of minimum error and minimum cost.
Linear and quadratic discriminant analysis.
Logistic regression.
Nonparametric estimation of probability density functions: kernel and the kth nearest neighbor methods.
Factorial Analysis: Principal Component Analysis, Simple and Multiple Correspondence Analysis.
Multidimensional Scaling.
Classes will be simultaneously theoretical and practical, with several examples of application and always making use of statistical software. The used software will be SPSS or the free programming language R (depending on the masters course).
designation | Weight (%) |
---|---|
Teste | 60,00 |
Trabalho escrito | 40,00 |
Total: | 100,00 |
designation | Time (hours) |
---|---|
Estudo autónomo | 120,00 |
Frequência das aulas | 42,00 |
Total: | 162,00 |
1. Evaluation will be distributed with a final examination. There is also an exam in the second evaluation period (“época de recurso”).
2. Grade Improvement: Students who want to improve their final classification can attend the second exam ("época de recriuso") and they must complete both parts. The work cannot be improved.
The subject is divided into two parts; Parte I corresponding to about 2/3 of the classes and Part II to 1/3. Each part consists of a practical work and an exam. For each student the marks of the works and exams are given by:
Scores_of_works: max (2/3*work1 +1/3*work2, 1/2*work1+1/2*work2)
Scores_of_exams: max (2/3*Exam1 +1/3*Exam2, 1/2*Exam1+1/2*Exam2)
Final Score: 0.6* Scores_of_exams+0.4*Scores_of_works. The same procedure applies in the case of the two parts of the second exam ("época de recurso").
Approval is subject to the value of Score_of_exams being equal to or higher than 7.0 values (on a scale of 0 to 20).
The practical works consist of the analysis of a real database, using the methods taught, using software.
It should be done by groups of 2 students.