Learning to Use Normalization Techniques for Preprocessing and Classification of Text Documents

Authors

  • K.M.G.S. Karunarathna Department of Economics and Statistics, Faculty of Social Sciences and Languages, Sabaragamuwa University of Sri Lanka, Belihuloya
  • R.A.H.M. Rupasingha Department of Economics and Statistics, Faculty of Social Sciences and Languages, Sabaragamuwa University of Sri Lanka, Belihuloya

Abstract

Text classification is the most substantial area in natural language processing. In this task, the text document is divided into various types according to the researcher’s purpose. In the text classification process, the basic phase is text preprocessing. In text preprocessing, cleaning, and preparing text data are significant tasks. To accomplish these tasks under the text preprocessing, normalization techniques play a major role. Different kinds of normalization techniques are available. In this research, we mainly focus on different normalization techniques and the way of applying them to text preprocessing. Normalization techniques reduce the words of the text files and change the word form to another form. It helps to analyze the unstructured texts and predefine the text into standard form. This causes to improve the efficiency and performance of the text classification process. For text classification, it is important to extract the most reliable and relevant words of the text files, because feature extraction causes successful classification. This study includes the lowercasing, tokenization, stop word removal, and lemmatization as normalization techniques. 200 text documents from two different domains, namely, formal news articles and informal letters obtained from the Internet in the English language were evaluated using these normalization techniques. The experimental results show the effectiveness of the use of normalization techniques for the preprocessing and classification of text documents and for comparison between before and after using normalization techniques to the text files. Based on the comparison, we identified that these normalization techniques help to clean and prepare text data for effective and accurate text document classification.

KEYWORDS: Preprocessing, Normalization, Techniques, Cleaning documents, Text classificatio

Downloads

Published

2022-07-15