Classification of Phishing Email Using Word Embedding and Machine Learning Techniques


  • Somesha M. Information Security Research Lab, Department of Computer Science and Engineering, National Institute of Technology karnataka, Surathkal, Karnataka, India, 575025
  • Alwyn R. Pais Information Security Research Lab, Department of Computer Science and Engineering, National Institute of Technology karnataka, Surathkal, Karnataka, India, 575025



Email phishing detection, Word embedding, Machine Learning, Word2ec, FastText, TF-IDF, Count Vectorization


Email phishing is a cyber-attack, bringing substantial financial damage to corporate and commercial organizations. A phishing email is a special type of spamming, used to trick the user to disclose personal information to access his digital assets. Phishing attack is generally triggered by emailing links to spoofed websites that collect sensitive information. The APWG survey suggests that the existing countermeasures remain ineffective and insufficient for detecting phishing attacks. Hence there is a need for an efficient mechanism to detect phishing emails to provide better security against such attacks to the common user. The existing open-source data sets are limited in diversity, hence they do not capture the real picture of the attack. Hence there is a need for real-time input data set to design accurate email anti-phishing solutions. In the current work, it has been created a real-time in-house corpus of phishing and legitimate emails and proposed efficient techniques to detect phishing emails using a word embedding and machine learning algorithms. The proposed system uses only four email header-based heuristics for the classification of emails. The proposed word embedding cum machine learning framework comprises six word embedding techniques with five machine learning classifiers to evaluate the best performing combination. Among all six combinations, Random Forest consistently performed the best with FastText (CBOW) by achieving an accuracy of 99.50% with a false positive rate of 0.053%, TF-IDF achieved an accuracy of 99.39% with a false positive rate of 0.4% and Count Vectorizer achieved an accuracy of 99.18% with a false positive rate of 0.98% respectively for three datasets used.


Download data is not yet available.

Author Biographies

Somesha M., Information Security Research Lab, Department of Computer Science and Engineering, National Institute of Technology karnataka, Surathkal, Karnataka, India, 575025

Somesha M. is an Assistant Professor & HOD, Department of Computer Science and Engineering, Government Engineering College, Karwar, Karnataka, India. He completed his B.E.(CSE) from Bangalore University, India and M.Tech.(CNE) from NIE Mysore, India. He is currently pursuing Ph.D. from the Department of Computer Science & Engineering, National Institute of Technology Karnataka (NITK), Surathkal. His areas of interest include Information Security, Computer Networks, and Cyber security.

Alwyn R. Pais, Information Security Research Lab, Department of Computer Science and Engineering, National Institute of Technology karnataka, Surathkal, Karnataka, India, 575025

Alwyn R. Pais is an Associate Professor and Research Guide, Department of Computer Science and Engineering, National Institute of Technology Karnataka (NITK). He completed his B.Tech.(CSE) from Mangalore University, India, M.Tech. (CSE) from IIT Bombay, India, and Ph.D. (CSE) in NITK, Surthkal. His area of interest includes Information Security, Image Processing and Computer Vision.


M Somesha, Alwyn Roshan Pais, Routhu Srinivasa Rao, and Vikram Singh Rathour. Efficient deep learning techniques for the detection of phishing websites. Sādhanā, 45(1):1–18, 2020.

Riittakerttu Kaltiala-Heino, Tomi Lintonen, and Arja Rimpelä. Internet addiction? potentially problematic use of the internet in a population of 12–18 year-old adolescents. Addiction Research & Theory, 12(1):89–96, 2004.

Martha Shaw and Donald W Black. Internet addiction. CNS drugs, 22(5):353–365, 2008.

D J Kuss, M D Griffiths, Laurent Karila, and Jöel Billieux. Internet addiction: A systematic review of epidemiological research for the last decade. Current pharmaceutical design, 20(25):4026–4052, 2014.

Tracii Ryan, Andrea Chester, John Reece, and Sophia Xenos. The uses and abuses of facebook: A review of facebook addiction. Journal of behavioral addictions, 3(3):133–148, 2014.

Ying-ying Zhang, Jian-ji Chen, Hai Ye, and Lupe Volantin. Psychological effects of cognitive behavioral therapy on internet addiction in adolescents: A systematic review protocol. Medicine, 99(4), 2020.

APWG. Apwg 2019 phishing activity trends reports, third quarter 2019., 2019. Accessed: 2019-11-04.

APWG. Apwg 2019 phishing activity trends reports, fourth quarter 2019., 2020. Accessed: 2020-02-24.

APWG. Apwg 2018 phishing attack trends reports, fourth quarter 2018., 2019. Accessed: 2019-03-04.

APWG. Apwg 2015 phishing activity trends reports, first-to-third quarter 2015., 2015. Accessed: 2015-12-23.

APWG. Apwg 2015 phishing activity trends reports, fourth quarter 2015., 2015. Accessed: 2016-03-22.

Mimecast. Mimecast – the state of the email security report 2019., 2019.

Kaspersky. Spam and phishing in q3 2019., 2019.

Felipe Almeida and Geraldo Xexéo. Word embeddings: A survey. arXiv preprint arXiv:1901.09069, 2019.

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.

Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5:135–146, 2017.

Ammar Almomani, Brij B Gupta, Samer Atawneh, Andrew Meulenberg, and Eman Almomani. A survey of phishing email filtering techniques. IEEE communications surveys & tutorials, 15(4):2070–2090, 2013.

Ian Fette, Norman Sadeh, and Anthony Tomasic. Learning to detect phishing emails. In Proceedings of the 16th international conference on World Wide Web, pages 649–656, 2007.

Fergus Toolan and Joe Carthy. Phishing detection using classifier ensembles. In 2009 eCrime researchers summit, pages 1–9. IEEE, 2009.

André Bergholz, Jan De Beer, Sebastian Glahn, Marie-Francine Moens, Gerhard Paaß, and Siehyun Strobel. New filtering approaches for phishing email. Journal of computer security, 18(1):7–35, 2010.

Fergus Toolan and Joe Carthy. Feature selection for spam and phishing detection. In 2010 eCrime Researchers Summit, pages 1–12. IEEE, 2010.

Mahmoud Khonji, Andrew Jones, and Youssef Iraqi. A study of feature subset evaluators and feature subset searching methods for phishing classification. In Proceedings of the 8th Annual Collaboration, Electronic messaging, Anti-Abuse and Spam Conference, pages 135–144, 2011.

Wilfried N Gansterer and David Pölz. E-mail classification for phishing defense. In European Conference on Information Retrieval, pages 449–460. Springer, 2009.

Saeed Abu-Nimeh, Dario Nappa, Xinlei Wang, and Suku Nair. Distributed phishing detection by applying variable selection using bayesian additive regression trees. In 2009 IEEE International Conference on Communications, pages 1–5. IEEE, 2009.

Madhusudhanan Chandrasekaran, Krishnan Narayanan, and Shambhu Upadhyaya. Phishing email detection based on structural properties. In NYS cyber security conference, volume 3. Albany, New York, 2006.

Aviad Cohen, Nir Nissim, and Yuval Elovici. Novel set of general descriptive features for enhanced detection of malicious emails using machine learning methods. Expert Systems with Applications, 110:143–169, 2018.

Sami Smadi, Nauman Aslam, and Li Zhang. Detection of online phishing email using dynamic evolving neural network based on reinforcement learning. Decision Support Systems, 107:88–102, 2018.

Minh Nguyen, Toan Nguyen, and Thien Huu Nguyen. A deep learning model with hierarchical lstms and supervised attention for anti-phishing. arXiv preprint arXiv:1805.01554, 2018.

Qi Li, Mingyu Cheng, Junfeng Wang, and Bowen Sun. Lstm based phishing detection for big email data. IEEE Transactions on Big Data, 2020.

Areej Alhogail and Afrah Alsabih. Applying machine learning and natural language processing to detect phishing email. Computers & Security, 110:102414, 2021.

Yong Fang, Cheng Zhang, Cheng Huang, Liang Liu, and Yue Yang. Phishing email detection using improved rcnn model with multilevel vectors and attention mechanism. IEEE Access, 7:56329–56340, 2019.

Sikha Bagui, Debarghya Nandi, Subhash Bagui, and Robert Jamie White. Classifying phishing email using machine learning and deep learning. In 2019 International Conference on Cyber Security and Protection of Digital Services (Cyber Security), pages 1–2. IEEE, 2019.

Esteban Castillo, Sreekar Dhaduvai, Peng Liu, Kartik-Singh Thakur, Adam Dalton, and Tomek Strzalkowski. Email threat detection using distinct neural network approaches. In Proceedings for the First International Workshop on Social Threats in Online Conversations: Understanding and Management, pages 48–55, 2020.

Vinayakumar Ra, Barathi Ganesh HBa, Anand Kumar Ma, Soman KPa, Prabaharan Poornachandran, and A Verma. Deepanti-phishnet: Applying deep neural networks for phishing email detection. In Proc. 1st AntiPhishing Shared Pilot 4th ACM Int. Workshop Secur. Privacy Anal.(IWSPA), pages 1–11. Tempe, AZ, USA, 2018.

M Hiransha, Nidhin A Unnithan, R Vinayakumar, K Soman, and ADR Verma. Deep learning based phishing e-mail detection. In Proc. 1st AntiPhishing Shared Pilot 4th ACM Int. Workshop Secur. Privacy Anal.(IWSPA). Tempe, AZ, USA, 2018.

NB Harikrishnan, R Vinayakumar, and KP Soman. A machine learning approach towards phishing email detection. In Proceedings of the Anti-Phishing Pilot at ACM International Workshop on Security and Privacy Analytics (IWSPA AP), volume 2013, pages 455–468, 2018.

Rakesh Verma, Narasimha Shashidhar, and Nabil Hossain. Detecting phishing emails the natural language way. In European Symposium on Research in Computer Security, pages 824–841. Springer, 2012.

Christopher N Gutierrez, Taegyu Kim, Raffaele Della Corte, Jeffrey Avery, Dan Goldwasser, Marcello Cinque, and Saurabh Bagchi. Learning from the ones that got away: Detecting new forms of phishing attacks. IEEE Transactions on Dependable and Secure Computing, 15(6):988–1001, 2018.

Rohit Valecha, Pranali Mandaokar, and H Raghav Rao. Phishing email detection using persuasion cues. IEEE Transactions on Dependable and Secure Computing, 2021.

Rafiqul Islam and Jemal Abawajy. A multi-tier phishing detection and filtering approach. Journal of Network and Computer Applications, 36(1):324–335, 2013.

Mahmoud Khonji, Youssef Iraqi, and Andrew Jones. Enhancing phishing e-mail classifiers: A lexical url analysis approach. International Journal for Information Security Research (IJISR), 2(1/2):40, 2012.

Venkatesh Ramanathan and Harry Wechsler. phishgillnet—phishing detection methodology using probabilistic latent semantic analysis, adaboost, and co-training. EURASIP Journal on Information Security, 2012(1):1–22, 2012.

Liping Ma, Bahadorrezda Ofoghi, Paul Watters, and Simon Brown. Detecting phishing emails using hybrid features. In 2009 Symposia and Workshops on Ubiquitous, Autonomic and Trusted Computing, pages 493–497. IEEE, 2009.

Isredza Rahmi A Hamid and Jemal Abawajy. Hybrid feature selection for phishing email detection. In International Conference on Algorithms and Architectures for Parallel Processing, pages 266–275. Springer, 2011.

Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759, 2016.

A Almomani, TC Wan, A Manasrah, A Altaher, M Baklizi, and S Ramadass. An enhanced online phishing e-mail detection framework based on evolving connectionist system. International Journal of Innovative Computing, Information and Control (IJICIC), 9(3):169–175, 2013.