Improving Phishing Website Detection Using a Hybrid Two-level Framework for Feature Selection and XGBoost Tuning
DOI:
https://doi.org/10.13052/jwe1540-9589.2237Keywords:
XGBoost, artificial intelligence, web security, swarm intelligence, metaheuristics optimization, firefly algorithmAbstract
In the last few decades, the World Wide Web has become a necessity that offers numerous services to end users. The number of online transactions increases daily, as well as that of malicious actors. Machine learning plays a vital role in the majority of modern solutions. To further improve Web security, this paper proposes a hybrid approach based on the eXtreme Gradient Boosting (XGBoost) machine learning model optimized by an improved version of the well-known metaheuristics algorithm. In this research, the improved firefly algorithm is employed in the two-tier framework, which was also developed as part of the research, to perform both the feature selection and adjustment of the XGBoost hyper-parameters. The performance of the introduced hybrid model is evaluated against three instances of well-known publicly available phishing website datasets. The performance of novel introduced algorithms is additionally compared against cutting-edge metaheuristics that are utilized in the same framework. The first two datasets were provided by Mendeley Data, while the third was acquired from the University of California, Irvine machine learning repository. Additionally, the best performing models have been subjected to SHapley Additive exPlanations (SHAP) analysis to determine the impact of each feature on model decisions. The obtained results suggest that the proposed hybrid solution achieves a superior performance level in comparison to other approaches, and that it represents a perspective solution in the domain of web security.
Downloads
References
Benyamin Abdollahzadeh and Farhad Soleimanian Gharehchopogh. A multi-objective optimization algorithm for feature selection problems. Engineering with Computers, pages 1–19, 2021.
Nadheera AlHosni, Luka Jovanovic, Milos Antonijevic, Milos Bukumira, Miodrag Zivkovic, Ivana Strumberger, Joseph P Mani, and Nebojsa Bacanin. The xgboost model for network intrusion detection boosted by enhanced sine cosine algorithm. In International Conference on Image Processing and Capsule Networks, pages 213–228. Springer, 2022.
Shi Cheng and Yuhui Shi. Diversity control in particle swarm optimization. In 2011 IEEE Symposium on Swarm Intelligence, pages 1–9. IEEE, 2011.
Joaquín Derrac, Salvador García, Daniel Molina, and Francisco Herrera. A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms. Swarm and Evolutionary Computation, 1(1):3–18, 2011.
Dheeru Dua and Casey Graff. UCI machine learning repository, 2017.
Tome Eftimov, Peter Korošec, and B Koroušic Seljak. Disadvantages of statistical comparison of stochastic optimization algorithms. Proceedings of the Bioinspired Optimizaiton Methods and their Applications, BIOMA, pages 105–118, 2016.
Milton Friedman. The use of ranks to avoid the assumption of normality implicit in the analysis of variance. Journal of the american statistical association, 32(200):675–701, 1937.
Gene V Glass. Testing homogeneity of variances. American Educational Research Journal, 3(3):187–190, 1966.
Ali Asghar Heidari, Seyedali Mirjalili, Hossam Faris, Ibrahim Aljarah, Majdi Mafarja, and Huiling Chen. Harris hawks optimization: Algorithm and applications. Future generation computer systems, 97:849–872, 2019.
Dijana Jovanovic, Milos Antonijevic, Milos Stankovic, Miodrag Zivkovic, Marko Tanaskovic, and Nebojsa Bacanin. Tuning machine learning models using a group search firefly algorithm for credit card fraud detection. Mathematics, 10(13):2272, 2022.
Dervis Karaboga. Artificial bee colony algorithm. scholarpedia, 5(3):6915, 2010.
Antonio LaTorre, Daniel Molina, Eneko Osaba, Javier Poyatos, Javier Del Ser, and Francisco Herrera. A prescription of methodological guidelines for comparing bio-inspired optimization algorithms. Swarm and Evolutionary Computation, 67:100973, 2021.
Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 4765–4774. Curran Associates, Inc., 2017.
Seyedali Mirjalili. Sca: a sine cosine algorithm for solving optimization problems. Knowledge-based systems, 96:120–133, 2016.
S. Rahnamayan, H. R. Tizhoosh, and M. M. A. Salama. Quasi-oppositional differential evolution. In 2007 IEEE Congress on Evolutionary Computation, pages 2229–2236, 2007.
Samuel S Shapiro and RS Francia. An approximate analysis of variance test for normality. Journal of the American statistical Association, 67(337):215–216, 1972.
David J Sheskin. Handbook of parametric and nonparametric statistical procedures. Chapman and Hall/CRC, 2020.
Siamak Talatahari, Hadi Bayzidi, and Meysam Saraee. Social network search for global optimization. IEEE Access, 9:92815–92863, 2021.
Susana M Vieira, Uzay Kaymak, and João MC Sousa. Cohen’s kappa coefficient as a performance measure for feature selection. In International conference on fuzzy systems, pages 1–8. IEEE, 2010.
G Vrbancic. Phishing websites dataset. Mendeley Data, 1, 2020.
Xin-She Yang. Firefly algorithms for multimodal optimization. In International symposium on stochastic algorithms, pages 169–178. Springer, 2009.
Xin-She Yang. Firefly algorithms for multimodal optimization. In Osamu Watanabe and Thomas Zeugmann, editors, Stochastic Algorithms: Foundations and Applications, pages 169–178, Berlin, Heidelberg, 2009. Springer Berlin Heidelberg.
Xin-She Yang. Bat algorithm for multi-objective optimisation. International Journal of Bio-Inspired Computation, 3(5):267–274, 2011.
Xin-She Yang and He Xingshi. Firefly algorithm: Recent advances and applications. International Journal of Swarm Intelligence, 1(1):36–50, 2013.
Miodrag Zivkovic, Luka Jovanovic, Milica Ivanovic, Nebojsa Bacanin, Ivana Strumberger, and P Mani Joseph. Xgboost hyperparameters tuning by fitness-dependent optimizer for network intrusion detection. In Communication and Intelligent Systems, pages 947–962. Springer, 2022.