Go to:
Logótipo
Comuta visibilidade da coluna esquerda
Você está em: Start > Publications > View > Learning Word Embeddings from the Portuguese Twitter Stream: A Study of Some Practical Aspects
Publication

Publications

Learning Word Embeddings from the Portuguese Twitter Stream: A Study of Some Practical Aspects

Title
Learning Word Embeddings from the Portuguese Twitter Stream: A Study of Some Practical Aspects
Type
Article in International Conference Proceedings Book
Year
2017
Authors
Eugénio Oliveira
(Author)
FEUP
View Personal Page You do not have permissions to view the institutional email. Search for Participant Publications View Authenticus page View ORCID page
Pedro Saleiro
(Author)
FEUP
View Personal Page You do not have permissions to view the institutional email. Search for Participant Publications Without AUTHENTICUS Without ORCID
Luís António Diniz Fernandes de Morais Sarmento
(Author)
Other
Eduarda Mendes Rodrigues
(Author)
FEUP
View Personal Page You do not have permissions to view the institutional email. Search for Participant Publications Without AUTHENTICUS Without ORCID
Carlos Soares
(Author)
FEUP
View Personal Page You do not have permissions to view the institutional email. Search for Participant Publications View Authenticus page View ORCID page
Conference proceedings International
Pages: 880-891
18th EPIA Conference on Artificial Intelligence, EPIA 2017
5 September 2017 through 8 September 2017
Indexing
Scientific classification
CORDIS: Physical sciences > Computer science > Informatics ; Physical sciences > Computer science
FOS: Engineering and technology > Electrical engineering, Electronic engineering, Information engineering
Other information
Authenticus ID: P-00M-YFF
Abstract (EN): This paper describes a preliminary study for producing and distributing a large-scale database of embeddings from the Portuguese Twitter stream. We start by experimenting with a relatively small sample and focusing on three challenges: volume of training data, vocabulary size and intrinsic evaluation metrics. Using a single GPU, we were able to scale up vocabulary size from 2048 words embedded and 500K training examples to 32768 words over 10M training examples while keeping a stable validation loss and approximately linear trend on training time per epoch. We also observed that using less than 50% of the available training examples for each vocabulary size might result in overfitting. Results on intrinsic evaluation show promising performance for a vocabulary size of 32768 words. Nevertheless, intrinsic evaluation metrics suffer from over-sensitivity to their corresponding cosine similarity thresholds, indicating that a wider range of metrics need to be developed to track progress. © Springer International Publishing AG 2017.
Language: English
Type (Professor's evaluation): Scientific
Documents
We could not find any documents associated to the publication.
Recommend this page Top
Copyright 1996-2025 © Faculdade de Direito da Universidade do Porto  I Terms and Conditions  I Acessibility  I Index A-Z
Page created on: 2025-07-13 at 04:10:30 | Privacy Policy | Personal Data Protection Policy | Whistleblowing