Abstract (EN):
E-commerce has become an essential aspect of modern life, providing consumers globally with convenience and accessibility. However, the high volume of short and noisy product descriptions in text streams of massive e-commerce platforms translates into an increased number of clusters, presenting challenges for standard model-based stream clustering algorithms. Standard LDA-based methods often lead to clusters dominated by single elements, effectively failing to manage datasets with varied cluster sizes. Our proposed Community-Based Topic Modeling with Contextual Outlier Handling (CB-TMCOH) algorithm introduces an approach to outlier detection in text data using transformer models for similarity calculations and graph-based clustering. This method efficiently separates outliers and improves clustering in large text datasets, demonstrating its utility not only in e-commerce applications but also proving effective for news and tweets datasets.
Language:
English
Type (Professor's evaluation):
Scientific
No. of pages:
11