Go to:
Logótipo
Comuta visibilidade da coluna esquerda
Você está em: Start > Publications > View > Data Collection Pipeline for Low-Resource Languages: A Case Study on Constructing a Tetun Text Corpus
Publication

Data Collection Pipeline for Low-Resource Languages: A Case Study on Constructing a Tetun Text Corpus

Title
Data Collection Pipeline for Low-Resource Languages: A Case Study on Constructing a Tetun Text Corpus
Type
Article in International Conference Proceedings Book
Year
2024
Authors
de Jesus G.
(Author)
Other
The person does not belong to the institution. The person does not belong to the institution. The person does not belong to the institution. View Authenticus page Without ORCID
Sérgio Nunes
(Author)
FEUP
View Personal Page You do not have permissions to view the institutional email. Search for Participant Publications View Authenticus page View ORCID page
Conference proceedings International
Pages: 4368-4380
Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Torino, 2024
Indexing
Other information
Authenticus ID: P-010-MB8
Abstract (EN): This paper proposes Labadain Crawler, a data collection pipeline tailored to automate and optimize the process of constructing textual corpora from the web, with a specific target to low-resource languages. The system is built on top of Nutch, an open-source web crawler and data extraction framework, and incorporates language processing components such as a tokenizer and a language identification model. The pipeline efficacy is demonstrated through successful testing with Tetun, one of Timor-Leste's official languages, resulting in the construction of a high-quality Tetun text corpus comprising 321.7k sentences extracted from over 22k web pages. The contributions of this paper include the development of a Tetun tokenizer, a Tetun language identification model, and a Tetun text corpus, marking an important milestone in Tetun text information retrieval.
Language: English
Type (Professor's evaluation): Scientific
No. of pages: 12
Documents
We could not find any documents associated to the publication.
Recommend this page Top
Copyright 1996-2025 © Centro de Desporto da Universidade do Porto I Terms and Conditions I Acessibility I Index A-Z
Page created on: 2025-10-25 04:53:25 | Privacy Policy | Personal Data Protection Policy | Whistleblowing | Electronic Yellow Book