Phisher Fighter: Website Phishing Detection System Based on URL and Term Frequency-Inverse Document Frequency Values




Phishing, Machine Learning, Logistic Regression, Random Forest, Support Vector Machine, TF-IDF


Fundamentally, phishing is a common cybercrime that is indulged by the intruders or hackers on naive and credible individuals and make them to reveal their unique and sensitive information through fictitious websites. The primary intension of this kind of cybercrime is to gain access to the ad hominem or classified information from the recipients. The obtained data comprises of information that can very well utilized to recognize an individual. The purloined personal or sensitive information is commonly marketed in the online dark market and subsequently these information will be bought by the personal identity brigands. Depending upon the sensitivity and the importance of the stolen information, the price of a single piece of purloined information would vary from few dollars to thousands of dollars. Machine learning (ML) as well as Deep Learning (DL) are powerful methods to analyse and endeavour against these phishing attacks. A machine learning based phishing detection system is proposed to protect the website and users from such attacks. In order to optimize the results in a better way, the TF-IDF (Term Frequency-Inverse Document Frequency) value of webpages is employed within the system. ML methods such as LR (Logistic Regression), RF (Random Forest), SVM (Support Vector Machine), NB (Naive Bayes) and SGD (Stochastic Gradient Descent) are applied for training and testing the obtained dataset. Henceforth, a robust phishing website detection system is developed with 90.68% accuracy.


Download data is not yet available.

Author Biographies

E. Sri Vishva, Vellore Institute of Technology, Vellore, Tamil Nadu, India

E. Sri Vishva is currently pursuing the bachelor’s degree in Computer science and Engineering Vellore Institute of Technology, Vellore. He is currently a junior student in the School of Computer Science and Engineering. His research areas include Information Security and Machine Learning.

D. Aju, Vellore Institute of Technology, Vellore, Tamil Nadu, India

D. Aju received his PhD. in Computer Science and Engineering from Vellore Institute of Technology, Vellore, India. He received his M.Tech. degree in Computer Science and IT from Manonmaniam Sundaranar University, Tirunelveli. He received his M.C.A degree from Madras University, India. Presently, he is working as Associate Professor at Vellore Institute of Technology in the department of Information Security, School of Computer Science and Engineering. He has published more than 30 research articles in different reputed international peer-reviewed journals. And, he has served as reviewer for few international peer-reviewed journals. He is having more than 16 years of teaching and research experience. Consecutively, he has received research awards from 2014 to 2019 for his outstanding contribution towards research and publication at Vellore Institute of Technology. His research area of interest includes Digital Image Processing, Medical Imaging, Computer Graphics, Cyber Security and Digital Forensics.


Zhuang, W., Jiang, Q., and Xiong, T. (2012, June). An intelligent anti-phishing strategy model for phishing website detection. In 2012 32nd International Conference on Distributed Computing Systems Workshops (pp. 51–56). IEEE.

Alkhozae, M. G., and Batarfi, O. A. (2011). Phishing websites detection based on phishing characteristics in the webpage source code. International Journal of Information and Communication Technology Research, 1(6).

Sahingoz, O. K., Buber, E., Demir, O., and Diri, B. (2019). Machine learning based phishing detection from URLs. Expert Systems with Applications, 117, 345–357.

Ali, W. (2017). Phishing website detection based on supervised machine learning with wrapper features selection. International Journal of Advanced Computer Science and Applications, 8(9), 72–78.

Sankhyan, R., Shetty, A., Dhanopia, L., Kaspale, C., and Dantal, P. G. (2018). PDS-Phishing Detection Systems. Safety, 5(04).

Rao, R. S., and Pais, A. R. (2019). Detection of phishing websites using an efficient feature-based machine learning framework. Neural Computing and Applications, 31(8), 3851–3873.

Varshney, G., Misra, M., and Atrey, P. K. (2016). A survey and classification of web phishing detection schemes. Security and Communication Networks, 9(18), 6266–6284.

Hara, M., Yamada, A., and Miyake, Y. (2009, March). Visual similarity-based phishing detection without victim site information. In 2009 IEEE Symposium on Computational Intelligence in Cyber Security (pp. 30–36). IEEE.

Bergholz, A., Paaß, G., D’Addona, L., and Dato, D. (2010). A real-life study in phishing detection. In Proceedings of the conference on email and anti-spam (CEAS) (Vol. 1, pp. 1–10).

Subasi, A., Molah, E., Almkallawi, F., and Chaudhery, T. J. (2017, November). Intelligent phishing website detection using random forest classifier. In 2017 International Conference on Electrical and Computing Technologies and Applications (ICECTA) (pp. 1–5). IEEE.

Afroz, S., and Greenstadt, R. (2009, September). Phishzoo: An automated web phishing detection approach based on profiling and fuzzy matching. In Proc. 5th IEEE Int. Conf. Semantic Comput.(ICSC) (pp. 1–11).

Ubing, A. A., Jasmi, S. K. B., Abdullah, A., Jhanjhi, N. Z., and Supramaniam, M. (2019). Phishing Website detection: An improved accuracy through feature selection and ensemble learning. International Journal of Advanced Computer Science And Applications, 10(1), 252–257.

Ali, W. (2017). Phishing website detection based on supervised machine learning with wrapper features selection. International Journal of Advanced Computer Science and Applications, 8(9), 72–78.

Basnet, R. B., Sung, A. H., and Liu, Q. (2012, June). Feature selection for improved phishing detection. In International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems (pp. 252–261). Springer, Berlin, Heidelberg.

Alnajim, A., and Munro, M. (2009, April). An anti-phishing approach that uses training intervention for phishing websites detection. In 2009 Sixth International Conference on Information Technology: New Generations (pp. 405–410). IEEE.

Jain, A. K., and Gupta, B. B. (2019). A machine learning based approach for phishing detection using hyperlinks information. Journal of Ambient Intelligence and Humanized Computing, 10(5), 2015–2028.

He, M., Horng, S. J., Fan, P., Khan, M. K., Run, R. S., Lai, J. L., … and Sutanto, A. (2011). An efficient phishing webpage detector. Expert systems with applications, 38(10), 12018–12027.

Li, Y., Yang, Z., Chen, X., Yuan, H., and Liu, W. (2019). A stacking model using URL and HTML features for phishing webpage detection. Future Generation Computer Systems, 94, 27–39.

Chiew, K. L., Choo, J. S. F., Sze, S. N., and Yong, K. S. (2018). Leverage website favicon to detect phishing websites. Security and Communication Networks, 2018.

Dadgar, S. M. H., Araghi, M. S., and Farahani, M. M. (2016, March). A novel text mining approach based on TF-IDF and Support Vector Machine for news classification. In 2016 IEEE International Conference on Engineering and Technology (ICETECH) (pp. 112–116). IEEE.

Whittaker, C., Ryner, B., and Nazif, M. (2010). Large-scale automatic classification of phishing pages.

Stone, A. (2007). Natural-language processing for intrusion detection. Computer, 40(12), 103–105.

Yu, W. D., Nargundkar, S., and Tiruthani, N. (2009, July). Phishcatch – a phishing detection tool. In Proceedings of the 2009 33rd Annual IEEE International Computer Software and Applications Conference – Volume 02 (pp. 451–456).

Zhang, Y., Hong, J. I., and Cranor, L. F. (2007, May). Cantina: a content-based approach to detecting phishing web sites. In Proceedings of the 16th international conference on World Wide Web (pp. 639–648).

Jain, A. K., and Gupta, B. B. (2016). A novel approach to protect against phishing attacks at client side using auto-updated white-list. EURASIP Journal on Information Security, 2016(1), 1–11.

Futai, Z., Yuxiang, G., Bei, P., Li, P., and Linsen, L. (2016, October). Web phishing detection based on graph mining. In 2016 2nd IEEE international conference on computer and communications (ICCC) (pp. 1061–1066). IEEE.

HR, M. G., Adithya, M. V., and Vinay, S. (2020). Development of anti-phishing browser based on random forest and rule of extraction framework. Cybersecurity, 3(1), 1–14.

Alkhalil, Z., Hewage, C., Nawaf, L., and Khan, I. (2021). Phishing Attacks: Recent Comprehensive Study and a New Anatomy. Frontiers in Computer Science, 3, 6.

Vrbančič, G., Fister Jr, I., and Podgorelec, V. (2020). Datasets for phishing websites detection. Data in Brief, 33, 106438.

Yi, P., Guan, Y., Zou, F., Yao, Y., Wang, W., and Zhu, T. (2018). Web phishing detection using a deep learning framework. Wireless Communications and Mobile Computing, 2018.

Dua, D. and Graff, C. (2019). UCI Machine Learning Repository []. Irvine, CA: University of California, School of Information and Computer Science.

Wang, W. Y. (2017). “liar, liar pants on fire”: A new benchmark dataset for fake news detection. arXiv preprint arXiv:1705.00648.