Detecting Spam E-mails with Content and Weight-based Binomial Logistic Model

Authors

  • Richa Indu Department of Computer Science and Engineering, Graphic Era Deemed to be University, Dehradun, Uttarakhand, India
  • Sushil Chandra Dimri Department of Computer Science and Engineering, Graphic Era Deemed to be University, Dehradun, Uttarakhand, India

DOI:

https://doi.org/10.13052/jwe1540-9589.2271

Keywords:

Logistic regression, malicious advertisements, maximum likelihood estimation, spam e-mails

Abstract

Spam e-mails are continuously increasing and are a serious threats to a network and its users. Several efficient methods are available regarding this context, but still, it is evolving randomly. Considering this, the proposed approach addresses the problem of spam detection by combining traditional content-matching criteria with the modified version of the binomial logistic algorithm. The work generates seven categories for content-matching, which begins from three basic categories, namely: special words, adult content, and specific symbols and digits. The remaining four categories are derived from various possible combinations of these basic categories. The words selected for each category are carefully curated based on the human psychology of action and reaction. Then, a weight is assigned to each of the categories to signify their importance and a threshold criterion is deployed before implementing the binomial logistic algorithm, which not only increases the efficiency of the proposed algorithm but also reduces the rate of misclassification. The proposed model is tested on six separate datasets of Enron Spam Corpus, where 98.31% and 92.575% are the maximum and minimum accuracies achieved, respectively, in spam e-mail classification. The AUC_ROC scores for the entire Spam Corpus range between 0.927 and 0.983. A comparison is also carried out between the proposed algorithm and the other methods of spam detection that have logistic regression. Finally, the suggested method can adequately handle a large sample size without compromising the efficacy, which is measured using accuracy, precision, recall, F-measure, and AUC_ROC score.

Downloads

Download data is not yet available.

Author Biographies

Richa Indu, Department of Computer Science and Engineering, Graphic Era Deemed to be University, Dehradun, Uttarakhand, India

Richa Indu is currently pursuing a Ph.D. in Computer Science and Engineering from Graphic Era Deemed to be University, Dehradun. She accomplished M.Tech (Hons) in Computer Science and Engineering from Uttarakhand Technical University, Dehradun and M.Sc. (Gold medalist) in Information Technology from Hemavati Nandan Bahuguna University, Srinagar (Garhwal), India. She has published six papers in conferences and journals. Her research interest includes machine learning, programming languages, data sciences and designing algorithms.

Sushil Chandra Dimri, Department of Computer Science and Engineering, Graphic Era Deemed to be University, Dehradun, Uttarakhand, India

Sushil Chandra Dimri is currently serving Graphic Era Deemed to be University as a professor in the CSE Department. He received an M.Tech. from IIT Dhanbad and a Ph.D. in Computer Science from Kumaon University, Nainital, Uttarakhand, India. He has 22 years of experience in teaching of UG and PG level degree courses. He is the author of many books and has published more than 60 papers in national/international conferences and journals. His areas of interest are algorithm design, resource optimization, machine learning and computer graphics.

References

J. Johnson, ‘Number of sent and received e-mails per day worldwide from 2017 to 2025’, Statista Research Service, 2021. https://www.statista.com/statistics/456500/daily-number-of-e-mails-worldwide/.

N. Cveticanin, ‘What’s on the Other Side of Your Inbox – 20 SPAM Statistics for 2022’, DataProt, 2022. https://dataprot.net/statistics/spam-statistics/.

R. Indu, A. Sharma, ‘Ransomware: A New Era of Digital Terrorism’, Computer Reviews Journal, vol. 1, no. 2, pp. 168–226, 2018.

G. Vijayasekaran, S. Rosi, ‘Spam And E-Mail Detection in Big Data Platform Using Naive Bayesian Classifier’, Int. J. Comput. Sci. Mob. Computing, vol. 7, no. 4, pp. 53–58, 2018.

S.B. Rathod, T.M. Pattewar, ‘Content based spam detection in email using Bayesian classifier’, Proc. In ICCSP, pp. 1257–1261, Melmaruvathur, India, 2015. https://doi.org/10.1109/ICCSP.2015.7322709.

K.L. Goh, A.K. Singh, ‘Comprehensive Literature Review on Machine Learning Structures for Web Spam Classification’, Procedia Comput. Sci., vol. 70, pp. 434–441, 2015. https://doi.org/10.1016/j.procs.2015.10.069.

A. Wijaya, A. Bisri, ‘Hybrid decision tree and logistic regression classifier for e-mail spam detection’, Proc. In ICITEE, pp. 1–4, Yogyakarta, Indonesia, 2016. https://doi.org/10.1109/iciteed.2016.7863267.

A.H. Osman, H.M. Aljahdali, ‘Feature Weight Optimization Mechanism for Email Spam Detection based on Two-Step Clustering Algorithm and Logistic Regression Method’, Int. J. Adv. Comput. Sci. Appl., vol. 8, no. 10, pp. 420–429, 2017.

M. Bassiouni, M. Ali, E.A. El-Dahshan, ‘Ham and Spam E-Mails Classification Using Machine Learning Techniques’, J. Appl. Secur. Res., vol. 13, no. 3, pp. 315–31, 2018. https://doi.org/10.1080/19361610.2018.1463136.

N.F. Shah, P.A. Kumar, ‘Comparative Analysis of Various Spam Classifications’, In P. Sa, M. Sahoo, M. Murugappan, Y. Wu, B. Majhi (eds.) Progress in Intelligent Computing Techniques: Theory, Practice, and Applications. Advances in Intelligent Systems and Computing. vol. 719, Springer, Singapore, 2018. https://doi.org/10.1007/978-981-10-3376-6_29.

A. Anggraina, R. Primartha, A. Wijaya, ‘The Combination of Logistic Regression and Gradient Boost Tree for Email Spam Detection’, In IOP Conf. Series: Journal of Physics: Conf. Series, vol. 1196, pp. 012013, 2019. https://doi.org/10.1088/1742-6596/1196/1/012013.

B. Santoso, ‘An Analysis of Spam E-mail Detection Performance Assessment Using Machine Learning’, Jurnal Online Informatika, vol. 4, no. 1, pp. 53–56, 2019. https://doi.org/10.15575/join.v4i1.298.

S. Nandhini, K.S. Marseline, ‘Performance Evaluation of Machine Learning Algorithms for E-mail Spam Detection’, Proc. In ic-ETITE, pp. 1–4, Vellore, India, 2020. https://doi.org/10.1109/ic-etite47903.2020.312.

B.K. Dedeturk, B. Akay, ‘Spam filtering using a logistic regression model trained by an artificial bee colony algorithm’, Appl. Soft Comput., vol. 91, 2020. https://doi.org/10.1016/j.asoc.2020.106229.

N.J. Kawale, S.Y. Sait, ‘A Review on Various Techniques for Spam Detection’, Proc. In ICAIS, pp. 1771–1775, Coimbatore, India, 2021. https://doi.org/10.1109/icais50930.2021.9395979.

W. Park, N.M.F. Qureshi, D.R. Shin, ‘Pseudo nlp joint spam classification technique for big data cluster’, Computers, Materials & Continua, vol. 71, no. 1, pp. 517–535, 2022.

K. Debnath, N. Kar, ‘Email Spam Detection using Deep Learning Approach’, COM-IT-CON Int. Conf. International Conference on Machine Learning, Big Data, Cloud and Parallel Computing, pp. 37–41, Faridabad, India, 2022. https://doi.org/10.1109/COM-IT-CON54601.2022.9850588.

M. Dewis, T. Viana, ‘Phish Responder: A Hybrid Machine Learning Approach to Detect Phishing and Spam Emails’, Appl. Syst. Innov., vol. 5, no. 4, 2022. https://doi.org/10.3390/asi5040073.

E. John-Africa, V.T. Emmah, ‘Performance Evaluation of LSTM and RNN Models in the Detection of Email Spam Messages’, EJCSIT, vol. 2, no. 6, pp. 24–29, 2022. http://dx.doi.org/10.24018/ejcompute.2022.2.6.80.

A.K. Jilani, J. Sultana, ‘A Random Forest Based Approach to Classify Spam URLs Data’, ICETSIS Int. Conf. Emerging Technologies for Sustainability and Intelligent Systems, pp. 268–272, Manama, Bahrain, 2022. https://doi.org/10.1109/ICETSIS55481.2022.9888849.

A. Sadia, F. Bashir, R. Q. Khan, and A. Khalid, ‘Comparison of Machine Learning Algorithms for Spam Detection,’ Journal of Advances in Information Technology, vol. 14, no. 2, pp. 178–184, 2023. http://dx.doi.org/10.12720/jait.14.2.178-184.

M. A. Bouke, A. Abdullah, M. T. Abdullah, S. A. Zaid, H. El Atigh, and S. H. ALshatebi, ‘A Lightweight Machine Learning-Based Email Spam Detection Model Using Word Frequency Pattern,’ Journal of Information Technology and Computing, vol. 4, no. 1, pp. 15–28, 2023. http://dx.doi.org/10.48185/jitc.v4i1.653.

S. Das, S. Mandal, and R. Basak, ‘Spam email detection using a novel multilayer classification-based decision technique,’ International Journal of Computers and Applications, vol. 45, no. 9, pp. 587–599, 2023. http://dx.doi.org/10.1080/1206212X.2023.2258328.

M. Zivkovic, A. Petrovic, N. Bacanin, M. Djuric, A. Vesic, I. Strumberger, and M.Marjanovic, ‘Training Logistic Regression Model by Hybridized Multi-verse Optimizer for Spam Email Classification,’ Proceedings of International Conference on Data Science and Applications, pp. 507–520, 2023. http://dx.doi.org/10.1007/978-981-19-6634-7_35.

G. Manita, A. Chhabra, and O. Korbaa, ‘Efficient e-mail spam filtering approach combining Logistic Regression model and Orthogonal Atomic Orbital Search algorithm,’ Applied Soft Computing, vol. 144, 2023. http://dx.doi.org/10.1016/j.asoc.2023.110478.

A. M. Al-Zoubi, A. M. Mora and H. Faris, ‘A Multilingual Spam Reviews Detection Based on Pre-Trained Word Embedding and Weighted Swarm Support Vector Machines,’ in IEEE Access, vol. 11, pp. 72250–72271, 2023. https://doi.org/10.1109/ACCESS.2023.3293641.

B. N. Sai, B. Swaminathan, ‘Using the K-Nearest Neighbors Algorithm and Logistic Regression to Improve Accuracy, a Novel Machine Learning Approach for Detecting SMS Spam Message,’ Journal of Survey in Fisheries Sciences, vol. 10, no. 1S, pp. 2831–2842, 2023. https://doi.org/10.17762/sfs.v10i1S.516.

I. Moutafis, A. Andreatos, and P. Stefaneas, “Spam Email Detection Using Machine Learning Techniques,” European Conference on Cyber Warfare and Security, vol. 22, no. 1, pp. 303–310, 2023. http://dx.doi.org/10.34190/eccws.22.1.1208.

M. Maalouf, ‘Logistic regression in data analysis: an overview’, Int. J. Data Anal. Tech. Strateg., vol. 3, no. 3, pp. 281–299, 2011. https://doi.org/10.1504/IJDATS.2011.041335.

IBM, ‘What is logistic regression? – Learn how logistic regression can help make predictions to enhance decision-making’, 2022. https://www.ibm.com/in-en/topics/logistic-regression.

R. Febrianti, Y. Widyaningsih, S. Soemartojo, ‘The parameter estimation of logistic regression with maximum likelihood method and score function modification’, Proc. In BASIC, vol. 1725, pp. 012014, Depok, Indonesia, 2018. https://doi.org/10.1088/1742-6596/1725/1/012014.

N. Agrawal, ‘Decoding Logistic Regression Using MLE. Data Science Blogathon’, Analytics Vidhya, 2022. https://www.analyticsvidhya.com/blog/2022/02/decoding-logistic-regression-using-mle/.

J. Billieux, A. Heeren, L. Rochat, P. Maurage, S. Bayard, R. Bet, et al, ‘Positive and negative urgency as a single coherent construct: Evidence from a large-scale network analysis in clinical and non-clinical samples’, Journal of personality, vol. 89, no. 6, pp. 1252–1262, 2021. https://doi.org/10.1111/jopy.12655.

Spam Enron Corpus Dataset. http://nlp.cs.aueb.gr/software_and_datasets/Enron-Spam/index.html.

V. Metsis, I. Androutsopoulos, G. Paliouras, ‘Spam Filtering with Naive Bayes – Which Naive Bayes?’, Proc. In CEAS, pp. 28–69, Mountain View, California, USA, 2006.

Downloads

Published

2024-02-03

How to Cite

Indu, R. ., & Dimri, S. C. . (2024). Detecting Spam E-mails with Content and Weight-based Binomial Logistic Model. Journal of Web Engineering, 22(07), 939–960. https://doi.org/10.13052/jwe1540-9589.2271

Issue

Section

Articles