help

Você está em: Start > Publications > View > Data Collection Pipeline for Low-Resource Languages: A Case Study on Constructing a Tetun Text Corpus

Map of Premises

Publication

Publication Search

Data Collection Pipeline for Low-Resource Languages: A Case Study on Constructing a Tetun Text Corpus

Title

Data Collection Pipeline for Low-Resource Languages: A Case Study on Constructing a Tetun Text CorpusExport publication in the APA format Export publication in the EXCEL format Export publication in the RIS format

Type

Article in International Conference Proceedings Book

Date

2024

Title

Data Collection Pipeline for Low-Resource Languages: A Case Study on Constructing a Tetun Text Corpus

Type

Article in International Conference Proceedings Book

Year

2024

Authors

de Jesus G.

(Author)

Other

The person does not belong to the institution. The person does not belong to the institution. The person does not belong to the institution. View Authenticus page Without ORCID

Sérgio Nunes

(Author)

FEUP

View Personal Page You do not have permissions to view the institutional email. Search for Participant Publications View Authenticus page View ORCID page

Conference proceedings International

Title: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) Search for Conference Proceedings Publications

Pages: 4368-4380

Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Torino, 2024

Indexing

Scopus - 1 Citation

Other information

Authenticus ID: P-010-MB8

Abstract (EN): This paper proposes Labadain Crawler, a data collection pipeline tailored to automate and optimize the process of constructing textual corpora from the web, with a specific target to low-resource languages. The system is built on top of Nutch, an open-source web crawler and data extraction framework, and incorporates language processing components such as a tokenizer and a language identification model. The pipeline efficacy is demonstrated through successful testing with Tetun, one of Timor-Leste's official languages, resulting in the construction of a high-quality Tetun text corpus comprising 321.7k sentences extracted from over 22k web pages. The contributions of this paper include the development of a Tetun tokenizer, a Tetun language identification model, and a Tetun text corpus, marking an important milestone in Tetun text information retrieval.

Language: English

Type (Professor's evaluation): Scientific

No. of pages: 12

Documents

We could not find any documents associated to the publication.

Recommend this page Top

Copyright 1996-2025 © Faculdade de Medicina Dentária da Universidade do Porto I Terms and Conditions I Acessibility I Index A-Z
Page created on: 2025-12-05 at 14:15:35 | Privacy Policy | Personal Data Protection Policy | Whistleblowing | Electronic Yellow Book