Comparative Analysis of Popular Supervised Machine Learning Algorithms for Detecting Malicious Universal Resource Locators

Authors

  • Zambia Diko University of Fort Hare, Computer Science Department, Alice, 5700, Eastern Cape, South Africa
  • Khulumani Sibanda Walter Sisulu University, Applied Informatics and mathematical Sciences department, Buffalo City Campus, East London, 5200, Eastern Cape, South Africa

DOI:

https://doi.org/10.13052/jcsm2245-1439.13513

Keywords:

Malicious universal resource locators, URLs, detection, random forest, RF, light gradient boosting, LightGBM, extreme gradient boosting, XGBoost, supervised machine learning algorithms

Abstract

Malicious Universal Resource Locators (URLs), also referred to as malicious websites have become a serious cause for concern for cyber security administrators of various organisations, institutions, Agencies, businesses and companies. These websites host malware, spam, drive by links and phishing. Unfortunately, Internet users worldwide visit such malicious sites and become the victims of cybercrimes like credit card credentials theft, theft of personal information, monetary savings or investments. Multitudes of researchers have embarked on attempts to design and implement response solutions to malicious URLs threat. The approaches are largely divided into two groups, the traditional approaches (Blacklising and Heuristics) and the data driven approaches (statistical methods, machine learning methods, data mining methods, and deep learning methods). In some instances, there are divergent views on which algorithm is the best to be used for building models. To our knowledge, there are still few works that have taken an initiative to comparatively analyse the performance of machine learning algorithms which have been identified by various authors as being the most suitable to use for building detection models. This study therefore focused on the Light Gradient Boost, Extreme Gradient Boost and the Random Forest algorithms. For the study’s experiments, a malicious URLs dataset was downloaded from Kaggle.com databases. The study’s results demonstrated that the hostname_length was the most important feature to focus on when building malicious URL detection models using the three above mentioned algorithms. The results also revealed two more features that had importance; the count_www and the count_dir, when using Extreme Gradient Boosting and the Random Forest. The study will in future explore hybrid models where advantages of various algorithms will be exploited to be combined in order to improve performance. Other models that will be considered include Support Vector Machine, Neural Networks and Deep learning models.

Downloads

Download data is not yet available.

Author Biographies

Zambia Diko, University of Fort Hare, Computer Science Department, Alice, 5700, Eastern Cape, South Africa

Zambia Diko received the bachelor of science (Hon) in computer science from University of Fort Hare. She recently successfully completed her masters degree, passed with a Distinction at the same University being supervised by Prof Khulumani Sibanda. Her research interests are in machine learning modelling and Artificial Intelligence.

Khulumani Sibanda, Walter Sisulu University, Applied Informatics and mathematical Sciences department, Buffalo City Campus, East London, 5200, Eastern Cape, South Africa

Khulumani Sibanda is the Head of department as well as Ass. Professor in the department of Applied Informatics and Mathematical Sciences, Faculty of Engineering, Built Environment and Information Technology at Walter Sisulu University in South Africa. He holds a PhD in Computer Science and his research interests are in Machine Learning and Artificial Intelligence. Much of his research has been on prediction and classification models. He has supervised to completion 5 PhD and 21 MSc students.

References

CSIS, no date. Available at https://www.csis.org/programs/strategic-technologies-program/significant-cyber-incidents.

Ani Petrosyan, Jun 23, 2023, Annual number of malware attacks worldwide from 2015 to 2022, Available at https://www.statista.com/statistics/873097/malware-attacks-per-year-worldwide/.

Staff writer, 2023. South Africa’s websites are under attack, Available at https:/businesstech.co.za/news/internet/710414/south-africas-websites-are-under-attack/.

Van Niekerk, B. 2017. An analysis of cyber-incidents in South Africa. The African Journal of Information and Communication (AJIC), 20, 113–132. Available at https:/doi.org/10.23962/10539/23573.

Hu Z and Yuan Z, no date. A Review of Data-driven Approaches for Malicious Website Detection, Available at https:/arxiv.org/ftp/arxiv/papers/2305/2305.09084.pdf.

Sinha S, Bailey M and Jahanian F, 2008. “Shades of Grey: On the effectiveness of reputation-based “blacklists”,” in 2008 3rd International Conference on Malicious and Unwanted Software (MALWARE), pp. 57–64.

Hassan I. Ul, Ali R. H., Abideen Z. Ul, Khan T. A., and Kouatly R.,2022. “Significance of machine learning for detection of malicious websites on an unbalanced dataset,” Digital, vol. 2.

Al-Ahmadi S, 2020. “A Deep Learning Technique for Web Phishing Detection Combined URL Features and Visual Similarity,” International Journal of Computer Networks & Communications (IJCNC), vol. 5.

Raja, A.S., Vinodini, R. and Kavitha, A., 2021. “Lexical features based malicious URL detection using machine learning techniques.” Materials Today: Proceedings, 47, pp. 163–166.

Kamboj, Akshit, Kumar, Priyanshu, Bairwa, Amit, Joshi, Sandeep, 2022. “Detection of malware in downloaded files using various machine learning models.” Egyptian Informatics Journal. 24. doi: nolinkurl10.1016/j.eij.2022.12.002.

Singh, J. and Singh, J., 2020. “Detection of malicious software by analyzing the behavioural artifacts using machine learning algorithms.” Information and Software Technology, 121, p. 106273.

Sahingoz, Ozgur, Buber, Ebubekir, Demir, Onder, Diri, Banu, 2019. “Machine learning based phishing detection from URLs.” Expert Systems with Applications. 117. 345–357.

Yang, W., Zuo, W. and Cui, B., 2019. “Detecting malicious URLs via a keyword-based convolutional gated-recurrent-unit neural network.” IEEE Access, 7, pp. 29891–29900.

Li, T., Kou, G. and Peng, Y., 2020. “Improving malicious URLs detection via feature engineering: Linear and nonlinear space transformation methods.” Information Systems, 91, p. 101494.

Mondal, D.K., Singh, B.C., Hu, H., Biswas, S., Alom, Z. and Azim, M.A., 2021. “SeizeMaliciousURL: A novel learning approach to detect malicious URLs.” Journal of Information Security and Applications, 62, p. 102967.

Hoang, X.D., 2018. “A Website Defacement Detection Method Based on Machine Learning Techniques.” In In SoICT ’18: Ninth International Symposium on Information and Communication Technology, December 6–7, 2018, Da Nang City, Viet Nam. ACM, New York, NY, USA, 6 pages. https:/doi.org/10.1145/3287921.3287975.

Afzal, S., Asim, M., Javed, A.R., Beg, M.O. and Baker, T., 2021. “Urldeepdetect: A deep learning approach for detecting malicious urls using semantic vector models.” Journal of Network and Systems Management, 29, pp. 1–27.

Mabandla L and Sibanda K, 2022. A Comparative Analysis of Light Weight Intrusion Detection Models for the Internet of Things, Southern Africa Telecommunication Networks and Applications Conference (SATNAC), ISBN: 978-0-6397-2773-8.

Omotehinwa T. O, Oyewola D. O and Dada E. G, 2023. A Light Gradient-Boosting Machine algorithm with Tree-Structured Parzen Estimator for breast cancer diagnosis, Healthcare Analytics, Volume 4, 100218, ISSN 2772-4425, Available at https:/doi.org/10.1016/j.health.2023.100218.

Elegbede O and Sibanda K, 2022. An Analysis of Bias and Variance of the XGBoost, SVM and ANN algorithms using a Mobile Money Fraud Case, Southern Africa Telecommunication Networks and Applications Conference (SATNAC), ISBN: 978-0-6397-2773-8.

Osman A.I.A, Ahmed A.N, Chow M.F, Huang Y.F, El-Shafie A,2021. Extreme gradient boosting (Xgboost) model to predict the groundwater levels in Selangor Malaysia, Ain Shams Engineering Journal, Volume 12, Issue 2, pg. 1545–1556, ISSN 2090-4479, https:/doi.org/10.1016/j.asej.2020.11.011.

javapoint (https://www.javatpoint.com/machine-learning-random-forest-algorithm).

Oshiro T.M, Perez P. S and Baranauskas J. A, 2012. How Many Trees in a Random Forest? Conference Paper in Lecture Notes in Computer Science ⋅ July 2012 DOI: nolinkurl10.1007/978-3-642-31537-4_13 P. Perner (Ed.): MLDM 2012, LNAI 7376, pp. 154–168, 2012. Springer-Verlag Berlin Heidelberg.

Downloads

Published

2024-09-03

How to Cite

1.
Diko Z, Sibanda K. Comparative Analysis of Popular Supervised Machine Learning Algorithms for Detecting Malicious Universal Resource Locators. JCSANDM [Internet]. 2024 Sep. 3 [cited 2024 Oct. 14];13(05):1105-28. Available from: https://journals.riverpublishers.com/index.php/JCSANDM/article/view/25723

Issue

Section

Cyber Security Issues and Solutions