Resumo (PT):
It is dicult to determine the country of origin of the author
of a short message based only on the text. This is an even
more complex problem when more than one country uses
the same native language. In this paper, we address the
specic problem of detecting the two main variants of the
Portuguese language - European and Brazilian - in Twitter
micro-blogging data, by proposing and evaluating a set
of high-precision features. We follow an automatic classication
approach using a Nave Bayes classier, achieving 95%
accuracy. We nd that our system is adequate for real-time
tweet classication.
Abstract (EN):
It is difficult to determine the country of origin of the author of a short message based only on the text. This is an even more complex problem when more than one country uses the same native language. In this paper, we address the specific problem of detecting the two main variants of the Portuguese language - European and Brazilian - in Twitter micro-blogging data, by proposing and evaluating a set of high-precision features. We follow an automatic classification approach using a Naïve Bayes classifier, achieving 95% accuracy. We find that our system is adequate for real-time tweet classification. Copyright 2013 ACM.
Language:
English
Type (Professor's evaluation):
Scientific
No. of pages:
6