Resumo (PT):
Abstract (EN):
As part of the Open Language Data Initiative
shared tasks, we have expanded the FLORES+
evaluation set to include Emakhuwa, a lowresource
language widely spoken in Mozambique.
We translated the dev and devtest sets
from Portuguese into Emakhuwa, and we detail
the translation process and quality assurance
measures used. Our methodology involved
various quality checks, including postediting
and adequacy assessments. The resulting
datasets consist of multiple reference
sentences for each source. We present baseline
results from training a Neural Machine
Translation system and fine-tuning existing
multilingual translation models. Our findings
suggest that spelling inconsistencies remain
a challenge in Emakhuwa. Additionally,
the baseline models underperformed on this
evaluation set, underscoring the necessity for
further research to enhance machine translation
quality for Emakhuwa. The data is publicly
available at https://huggingface.co/
datasets/LIACC/Emakhuwa-FLORES
Language:
English
Type (Professor's evaluation):
Scientific
Contact:
Disponível em: https://arxiv.org/abs/2408.11457