Detection of Malicious URLs using Machine Learning based on Lexical Features
DOI:
https://doi.org/10.31357/contre.v1i1.7384Keywords:
malicious URLs, lexical features, cybercrime, classifiers, machine learningAbstract
As the digital world evolves, the risk of valuable information being exposed to unauthorized parties is increasing. One common vulnerability is the use of malicious Uniform Resource Locator (URL), which are fraudulent links spread across various platforms such as social media and emails. Traditional methods of identifying these URLs, such as blacklisting and heuristic search, rely heavily on syntax or keyword matching, but struggle to keep up with the evolving tactics of cyber attackers. Hence, this paper proposes a solution for detecting malicious URLs and their types based on lexical features. Lexical features in a URL refer to the components that convey semantic and lexical meaning. These can include domain names, path lengths, special characters, and other elements that can be analyzed for patterns or anomalies. In our proposed method, we use 23 different lexical features that focus on the semantic and lexical meaning of the URLs. An Exploratory Data Analysis (EDA) is used to filter the most important lexical features that effectively contribute to predictions. With these carefully curated features, we address the problem as a multi-classification task, aiming to assess the performance of three distinct classifiers: Random Forest, which currently stands as the domain's best solution and a pure bagging technique, as well as XG Boost and Light GBM, both of which utilize boosting techniques. With the proposed method, we could achieve over 93\% accuracy for all the three classifiers while Random Forest achieving the best performance.