Go to:
Logótipo
Você está em: Start > Publications > View > Building resources for Emakhuwa: machine translation and news classification benchmarks
Map of Premises
Principal
Publication

Building resources for Emakhuwa: machine translation and news classification benchmarks

Title
Building resources for Emakhuwa: machine translation and news classification benchmarks
Type
Article in International Conference Proceedings Book
Year
2024
Authors
Ali, Felermino
(Author)
Other
View Personal Page You do not have permissions to view the institutional email. Search for Participant Publications Without AUTHENTICUS Without ORCID
Conference proceedings International
Pages: 14842-14857
2024 Conference on Empirical Methods in Natural Language ProcessingExportar publicação no formato APA
Miami, 2024
Indexing
Publicação em Scopus Scopus - 0 Citations
Crossref
Other information
Authenticus ID: P-018-3EK
Resumo (PT):
Abstract (EN): This paper introduces a comprehensive collection of NLP resources for Emakhuwa, Mozambique’s most widely spoken language. The resources include the first manually translated news bitext corpus between Portuguese and Emakhuwa, news topic classification datasets, and monolingual data. We detail the process and challenges of acquiring this data and present benchmark results for machine translation and news topic classification tasks. Our evaluation examines the impact of different data types—originally clean text, post-corrected OCR, and back-translated data—and the effects of fine-tuning from pre-trained models, including those focused on African languages.Our benchmarks demonstrate good performance in news topic classification and promising results in machine translation. We fine-tuned multilingual encoder-decoder models using real and synthetic data and evaluated them on our test set and the FLORES evaluation sets. The results highlight the importance of incorporating more data and potential for future improvements.All models, code, and datasets are available in the https://huggingface.co/LIACC repository under the CC BY 4.0 license.
Language: English
Type (Professor's evaluation): Scientific
Documents
File name Description Size
2024.emnlp-main.824 401.76 KB
Related Publications

Of the same authors

Expanding FLORES+ benchmark for more low-resource settings: Portuguese-Emakhuwa machine translation evaluation (2024)
Article in International Conference Proceedings Book
Ali, Felermino; Cardoso, Henrique Lopes ; Sousa-Silva, Rui
Detecting loanwords in Emakhuwa: an extremely low-resource bantu language exhibiting significant borrowing from portuguese (2024)
Article in International Conference Proceedings Book
Ali, Felermino; Cardoso, Henrique Lopes ; Sousa-Silva, Rui
Recommend this page Top
Copyright 1996-2025 © Faculdade de Medicina Dentária da Universidade do Porto  I Terms and Conditions  I Acessibility  I Index A-Z
Page created on: 2025-08-06 at 06:22:34 | Privacy Policy | Personal Data Protection Policy | Whistleblowing | Electronic Yellow Book