Abstract (EN):
Classifying web queries into a set of categories is a crucial task to better understand the user's intent behind a query, contextualize their search and provide more relevant results to the user. However, web queries are typically short and ambiguous making the query classification a non-trivial problem. In this article, we present a new automatic approach for identifying and characterizing queries in the health domain. This method makes use of the search engine counts through a semantic similarity measure called Normalized Google Distance (NGD) combined with Support Vector Machines to classify queries into three dimensions: health-related, severity and semantic type. To evaluate our methods, we used two datasets in different languages, Portuguese and English, and built another for evaluating the last dimension. Overall, the results achieved were satisfactory. The most generic classification obtains better results than more specific ones. The NGD proved to be a valuable assent in query classification.
Language:
English
Type (Professor's evaluation):
Scientific
No. of pages:
4