Você está em: Início > Publicações > Visualização > Tokenizing micro-blogging messages using a text classification approach

Mapa das Instalações

Publicação

Pesquisa de Publicações

Publicações

Tokenizing micro-blogging messages using a text classification approach

Título

Tokenizing micro-blogging messages using a text classification approachExportar publicação no formato APA Exportar publicação no formato EXCEL Exportar publicação no formato RIS

Tipo

Artigo em Livro de Atas de Conferência Internacional

Data

2010

Título

Tokenizing micro-blogging messages using a text classification approach

Tipo

Artigo em Livro de Atas de Conferência Internacional

Ano

2010

Autores

laboreiro, g

(Autor)

Outra

A pessoa não pertence à instituição. A pessoa não pertence à instituição. A pessoa não pertence à instituição. Sem AUTHENTICUS Sem ORCID

sarmento, l

(Autor)

Outra

A pessoa não pertence à instituição. A pessoa não pertence à instituição. A pessoa não pertence à instituição. Sem AUTHENTICUS Sem ORCID

teixeira, j

(Autor)

Outra

A pessoa não pertence à instituição. A pessoa não pertence à instituição. A pessoa não pertence à instituição. Sem AUTHENTICUS Sem ORCID

oliveira, e

(Autor)

FEUP

Ver página pessoal Sem permissões para visualizar e-mail institucional Pesquisar Publicações do Participante Ver página do Authenticus Ver página ORCID

Ata de Conferência Internacional

Título: International Conference on Information and Knowledge Management, Proceedings Pesquisar Publicações da Ata de Conferência

Páginas: 81-87

Proceedings of the Fourth Workshop on Analytics for Noisy Unstructured Text Data, AND 2010, Toronto, Ontario, Canada, October 26th, 2010 (in conjunction with CIKM 2010)

Indexação

ISI Web of Knowledge

Scopus - 1 Citação

Outras Informações

ID Authenticus: P-007-WYN

DOI: 10.1145/1871840.1871853

Abstract (EN): The automatic processing of microblogging messages may be problematic, even in the case of very elementary operations such as tokenization. The problems arise from the use of non-standard language, including media-specific words (e.g. "2day", "gr8", "tl;dr", "loool"), emoticons (e.g. "(Ã²-Ã³)", "(=â§-â§=)"), non-standard letter casing (e.g. "dr. Fred") and unusual punctuation (e.g. "â¯. ..", "!!?", "","). Additionally, spelling errors are abundant (e.g. "I;m"), and we can frequently find more than one language (with different tokenization requirements) in the same short message. For being efficient in such environment, manually-developed rule-based tokenizer systems have to deal with many conditions and exceptions, which makes them difficult to build and maintain. We present a text classification approach for tokenizing Twitter messages, which address complex cases successfully and which is relatively simple to set up and maintain. For that, we created a corpus consisting of 2500 manually tokenized Twitter messages - a task that is simple for human annotators - and we trained an SVM classifier for separating tokens at certain discontinuity characters. For comparison, we created a baseline rule-based system designed specifically for dealing with typical problematic situations. Results show that we can achieve F-measures of 96% with the classification-based approach, much above the performance obtained by the baseline rule-based tokenizer (85%). Also, subsequent analysis allowed us to identify typical tokenization errors, which we show that can be partially solved by adding some additional descriptive examples to the training corpus and re-training the classifier. Â© 2010 ACM.

Idioma: Inglês

Tipo (Avaliação Docente): Científica

Nº de páginas: 7

Documentos

Não foi encontrado nenhum documento associado à publicação.

Recomendar Página Voltar ao Topo

Copyright 1996-2025 © Faculdade de Direito da Universidade do Porto I Termos e Condições I Acessibilidade I Índice A-Z
Página gerada em: 2025-12-09 às 18:15:12 | Política de Privacidade | Política de Proteção de Dados Pessoais | Denúncias | Livro Amarelo Eletrónico