Go to:
Logótipo
Você está em: Start > Publications > View > ACE-2005-PT: corpus for event extraction in portuguese
Map of Premises
Principal
Publication

ACE-2005-PT: corpus for event extraction in portuguese

Title
ACE-2005-PT: corpus for event extraction in portuguese
Type
Article in International Conference Proceedings Book
Year
2024
Authors
Cunha, Luís Filipe
(Author)
Other
The person does not belong to the institution. The person does not belong to the institution. The person does not belong to the institution. Without AUTHENTICUS Without ORCID
Silvano, Maria da Purificação
(Author)
FLUP
View Personal Page You do not have permissions to view the institutional email. Search for Participant Publications Without AUTHENTICUS Without ORCID
Campos, Ricardo
(Author)
Other
The person does not belong to the institution. The person does not belong to the institution. The person does not belong to the institution. Without AUTHENTICUS Without ORCID
Jorge, Alípio
(Author)
FCUP
View Personal Page You do not have permissions to view the institutional email. Search for Participant Publications View Authenticus page Without ORCID
Conference proceedings International
Pages: 661-666
SIGIR 2024: The 47th International ACM SIGIR Conference on Research and Development in Information Retrieval
Washington, 2024
Indexing
Crossref
Other information
Resumo (PT):
Abstract (EN): Event extraction is an NLP task that commonly involves identifying the central word (trigger) for an event and its associated arguments in text. ACE-2005 is widely recognised as the standard corpus in this field. While other corpora, like PropBank, primarily focus on annotating predicate-argument structure, ACE-2005 provides comprehensive information about the overall event structure and semantics. However, its limited language coverage restricts its usability. This paper introduces ACE-2005-PT, a corpus created by translating ACE-2005 into Portuguese, with European and Brazilian variants. To speed up the process of obtaining ACE-2005-PT, we rely on automatic translators. This, however, poses some challenges related to automatically identifying the correct alignments between multi-word annotations in the original text and in the corresponding translated sentence. To achieve this, we developed an alignment pipeline that incorporates several alignment techniques: lemmatization, fuzzy matching, synonym matching, multiple translations and a BERT-based word aligner. To measure the alignment effectiveness, a subset of annotations from the ACE-2005-PT corpus was manually aligned by a linguist expert. This subset was then compared against our pipeline results which achieved exact and relaxed match scores of 70.55% and 87.55% respectively. As a result, we successfully generated a Portuguese version of the ACE-2005 corpus, which has been accepted for publication by LDC.
Language: English
Type (Professor's evaluation): Scientific
Documents
File name Description Size
3626772.3657872 1266.77 KB
Recommend this page Top
Copyright 1996-2025 © Faculdade de Medicina Dentária da Universidade do Porto  I Terms and Conditions  I Acessibility  I Index A-Z
Page created on: 2025-07-19 at 14:34:38 | Privacy Policy | Personal Data Protection Policy | Whistleblowing | Electronic Yellow Book