A Study of Clustering Approaches Applied to Customer Reviews in the Digital Era
DOI:
https://doi.org/10.31357/ait.v4i02.8023Keywords:
Clustering, Customer Segmentation, Language Model, Marketing, Text AnalysisAbstract
The digital revolution has reshaped the landscape of business transactions, with online platforms generating vast amounts of text data through customer reviews. This paper explores the transformative potential of harnessing this data for customer segmentation, comparing traditional methods such as Term Frequency-Inverse Document Frequency (TF-IDF) and Bag-of-Words (BoW) with state-of-the-art Large Language Models (LLMs) for sentence embeddings. The primary objective is to identify the most effective approach for customer segmentation based on textual data by conducting a comprehensive analysis using clustering approaches. The study investigates the impact of LLMs, specifically BERT, RoBERTa, XLNet, and MPNet, in contrast to TF-IDF and BoW. Through experimentation and evaluation metrics, including the Silhouette Score, Davies-Bouldin Index, and Calinski-Harabasz Index, the research sheds light on the nuanced effectiveness of each method. While LLMs, particularly RoBERTa, demonstrate superior clustering performance, the study acknowledges the subtle impact of spelling correction on these models. The findings provide valuable insights for businesses seeking to understand customer sentiments and preferences, enabling more targeted and personalized strategies in the dynamic digital age. This research contributes to the evolving field of customer analytics by offering a comparative analysis of clustering approaches, laying the foundation for future advancements in text-based customer segmentation.
Downloads
Published
How to Cite
License
Copyright (c) 2025 M.N.S. Tissera, P.P.G.D. Asanka, R.A.C.P. Rajapakse

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
The Authors hold the copyright of their manuscripts, and all articles are circulated under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, as long as that the original work is properly cited.
The use of general descriptive names, trade names, trademarks, and so forth in this publication, even if not specifically identified, does not imply that these names are not protected by the relevant laws and regulations. The authors are responsible for securing any permissions needed for the reuse of copyrighted materials included in the manuscript.