Abstract (EN):
Machine learning is increasingly used in the most diverse applications and domains, whether in healthcare to predict pathologies or in the financial sector to detect fraud. Data utility is one of the linchpins for efficiency and accuracy in machine learning. However, when it contains personal information, full access may be restricted due to laws and regulations aiming to protect individuals' privacy. Therefore, data owners must ensure that any data shared guarantees such privacy. Removal or transforming private information (de -identification) are among the most common techniques. Intuitively, one can anticipate that reducing detail or distorting information would result in losses for predictive model performance, which might be reflected in a trade-off. However, previous work concerning classification tasks using de-identified data generally demonstrates that predictive performance can be preserved in specific applications, wherefore no trade-off is proven. In this paper, we aim to evaluate the existence of a trade-off between data privacy and predictive performance in classification tasks. We leverage a large set of privacy-preserving techniques and learning algorithms to assess re-identification ability and the impact of transformed variants on predictive performance. Unlike previous literature, we confirm that the higher the level of privacy (lower re-identification risk), the higher the impact on predictive performance, pointing towards clear evidence of a trade-off.
Idioma:
Inglês
Tipo (Avaliação Docente):
Científica
Nº de páginas:
12