Go to:
Logótipo
Comuta visibilidade da coluna esquerda
Você está em: Start > Publications > View > Semi-Automatic Creation of a Reference News Corpus for Fine-Grained Multi-Label Scenarios
Publication

Publications

Semi-Automatic Creation of a Reference News Corpus for Fine-Grained Multi-Label Scenarios

Title
Semi-Automatic Creation of a Reference News Corpus for Fine-Grained Multi-Label Scenarios
Type
Article in International Conference Proceedings Book
Year
2011
Authors
sarmento, l
(Author)
Other
The person does not belong to the institution. The person does not belong to the institution. The person does not belong to the institution. Without AUTHENTICUS Without ORCID
oliveira, e
(Author)
FEUP
View Personal Page You do not have permissions to view the institutional email. Search for Participant Publications View Authenticus page View ORCID page
Conference proceedings International
Pages: 749-754
6th Iberian Information Systems and Technologies Conference
Chaves, PORTUGAL, JUN 15-18, 2011
Indexing
Publicação em ISI Web of Knowledge ISI Web of Knowledge - 0 Citations
Publicação em Scopus Scopus - 0 Citations
Scientific classification
FOS: Natural sciences > Computer and information sciences
Other information
Authenticus ID: P-002-ZHS
Abstract (EN): In this paper we tackle the problem of creating a reference corpus for the classification of news items in fine-grained multi-label scenarios. These scenarios are particularly challenging for text classification techniques, and the availability of reference corpora is one important bottleneck for developing and testing new classification strategies. We propose a semiautomatic approach for creating a reference corpus that uses three auxiliary classification methods - one based on Support Vector Machines, one based on Nearest Neighbor Classifiers and another based on a dictionary-based classification heuristic - for suggesting to human annotators topic-related labels that can be used to describe different facets of a given news item being annotated. Using such approach, we semi-automatically produce a corpus of 1,600 news items with 865 different labels, having in average 3.63 labels per news item. We evaluate the contribution of each of the auxiliary classification methods to the annotation process and we conclude that: (i) none of the methods alone is capable of suggesting all relevant labels, (ii) a dictionary-based classification heuristic contributes significantly and (iii) the Nearest Neighbor classifier performs very efficiently in the most extreme multi-label part of the problem and is robust to the very unbalanced item-to-class distribution.
Language: English
Type (Professor's evaluation): Scientific
Contact: jft@fe.up.pt; las@fe.up.pt; eco@fe.up.pt
No. of pages: 6
Documents
We could not find any documents associated to the publication.
Related Publications

Of the same authors

Comparing Verb Synonym Resources for Portuguese (2010)
Article in International Conference Proceedings Book
teixeira, j; sarmento, l; oliveira, e
A Bootstrapping Approach for Training a NER with Conditional Random Fields (2011)
Article in International Conference Proceedings Book
teixeira, j; sarmento, l; oliveira, e
Recommend this page Top
Copyright 1996-2025 © Faculdade de Direito da Universidade do Porto  I Terms and Conditions  I Acessibility  I Index A-Z
Page created on: 2025-07-12 at 13:00:34 | Privacy Policy | Personal Data Protection Policy | Whistleblowing