Efficient Pre-Processing Techniques for Improving Classifiers Performance


  • S. Nickolas Department of Computer Applications, National Institute of Technology, Tiruchirappalli, Tamilnadu 620015, India
  • K. Shobha High Performance Computing Lab, Department of Computer Applications, National Institute of Technology, Tiruchirappalli, Tamilnadu 620015, India https://orcid.org/0000-0002-6208-2705




data Mining, data pre-processing, decision trees, Expectation Maximization (EM) algorithms, neural networks.


Data pre-processing plays a vital role in the life cycle of data mining for accomplishing quality outcomes. In this paper, it is experimentally shown the importance of data pre-processing to achieve highly accurate classifier outcomes by imputing missing values using a novel imputation method, CLUSTPRO, by selecting highly correlated features using Correlation-based Variable Selection (CVS) and by handling imbalanced data using Synthetic Minority Over-sampling Technique (SMOTE). The proposed CLUSTPRO method makes use of Random Forest (RF) and Expectation Maximization (EM) algorithms to impute missing. The imputed results are evaluated using standard evaluation metrics. The CLUSTPRO imputation method outperforms existing, state-of-the-art imputation methods. The combined approach of imputation, feature selection, and imbalanced data handling techniques has significantly contributed to attaining an improved classification accuracy (AUC curve) of 40%–50% in comparison with results obtained without any pre-processing.


Download data is not yet available.

Author Biographies

S. Nickolas, Department of Computer Applications, National Institute of Technology, Tiruchirappalli, Tamilnadu 620015, India

S. Nickolas is a Professor in the Department of Computer Applications, National Institute of Technology, Tiruchirappalli, Tamilnadu, India. He received his M.E. Computer Science from REC, Trichy in 1992 and Ph.D in the year 2007 from NIT, Trichy. He is the Professor In-Charge of the Massively Parallel Programming Laboratory, NVIDIA CUDA Teaching Centre, NIT, Trichy. His research interest includes Evolutionary Algorithms, Data Mining, Big Data Analytics, Distributed Computing, Cloud Computing and Software Metrics.

K. Shobha, High Performance Computing Lab, Department of Computer Applications, National Institute of Technology, Tiruchirappalli, Tamilnadu 620015, India

K. Shobha is a Research Scholar in the Department of Computer Applications, National Institute of Technology, Tiruchirappalli, Tamilnadu, India. Her research interest Data Mining, Big Data Analytics, Cloud Computing and Software Metrics, Computer Networks.


K Shobha and S Nickolas. Analysis of importance of pre-processing in prediction of hypertension. CSI Transactions on ICT, 6(2):209–214, 2018.

Hamza Turabieh, Amer Abu Salem, and Noor Abu-El-Rub. Dynamic l-rnn recovery of missing data in iomt applications. Future Generation Computer Systems, 89:575–583, 2018.

Amir Momeni, Matthew Pincus, and Jenny Libien. Imputation and missing data. In Introduction to Statistical Methods in Pathology, pages 185–200. Springer, 2018.

Unai Garciarena and Roberto Santana. An extensive analysis of the interaction between missing data types, imputation methods, and supervised classifiers. Expert Systems with Applications, 89:52–65, 2017.

Donald B Rubin. Multiple imputation for nonresponse in surveys, volume 81. John Wiley & Sons, 2004.

Barry L Ford. An overview of hot-deck procedures. Incomplete data in sample surveys, 2(Part IV):185–207, 1983.

Saiedeh Haji-Maghsoudi, Azam Rastegari, Behshid Garrusi, and Mohammad Reza Baneshi. Addressing the problem of missing data in decision tree modeling. Journal of Applied Statistics, 45(3):547–557, 2018.

Suhani Sen, Madhabananda Das, and Rajdeep Chatterjee. Estimation of incomplete data in mixed dataset. In Progress in Intelligent Computing Techniques: Theory, Practice, and Applications, pages 483–492. Springer, 2018.

Min Chen, Yixue Hao, Kai Hwang, Lu Wang, and Lin Wang. Disease prediction by machine learning over big data from healthcare communities. Ieee Access, 5:8869–8879, 2017.

Kezban Yagci Sokat, Irina S Dolinskaya, Karen Smilowitz, and Ryan Bank. Incomplete information imputation in limited data environments with application to disaster response. European Journal of Operational Research, 269(2):466–485, 2018.

Heikki Junninen, Harri Niska, Kari Tuppurainen, Juhani Ruuskanen, and Mikko Kolehmainen. Methods for imputation of missing values in air quality data sets. Atmospheric Environment, 38(18):2895–2907, 2004.

José M Jerez, Ignacio Molina, Pedro J García-Laencina, Emilio Alba, Nuria Ribelles, Miguel Martín, and Leonardo Franco. Missing data imputation using statistical and machine learning methods in a real breast cancer problem. Artificial intelligence in medicine, 50(2):105–115, 2010.

Danh V Nguyen, Naisyin Wang, and Raymond J Carroll. Evaluation of missing value estimation for microarray data. Journal of Data Science, 2(4):347–370, 2004.

Olga Troyanskaya, Michael Cantor, Gavin Sherlock, Pat Brown, Trevor Hastie, Robert Tibshirani, David Botstein, and Russ B Altman. Missing value estimation methods for dna microarrays. Bioinformatics, 17(6):520–525, 2001.

Gerhard Tutz and Shahla Ramzan. Improved methods for the imputation of missing data by nearest neighbor methods. Computational Statistics & Data Analysis, 90:84–99, 2015.

Kay I Penny and Thomas Chesney. Imputation methods to deal with missing values when data mining trauma injury data. In 28th International Conference on Information Technology Interfaces, 2006., pages 213–218. IEEE, 2006.

Suzan Arslanturk, Mohammad-Reza Siadat, Theophilus Ogunyemi, Kim Killinger, and Ananias Diokno. Analysis of incomplete and inconsistent clinical survey data. Knowledge and Information Systems, 46(3):731–750, 2016.

Geert JMG Van der Heijden, A Rogier T Donders, Theo Stijnen, and Karel GM Moons. Imputation of missing values is superior to complete case analysis and the missing-indicator method in multivariable diagnostic research: a clinical example. Journal of clinical epidemiology, 59(10):1102–1109, 2006.

Imran Kurt, Mevlut Ture, and A Turhan Kurum. Comparing performances of logistic regression, classification and regression tree, and neural networks for predicting coronary artery disease. Expert systems with applications, 34(1):366–374, 2008.

Daniel LaFreniere, Farhana Zulkernine, David Barber, and Ken Martin. Using machine learning to predict hypertension from a clinical dataset. In 2016 IEEE Symposium Series on Computational Intelligence (SSCI), pages 1–7. IEEE, 2016.

Eric P Xing, Michael I Jordan, Richard M Karp, et al. Feature selection for high-dimensional genomic microarray data. In ICML, volume 1, pages 601–608. Citeseer, 2001.

Yiming Yang and Jan O Pedersen. A comparative study on feature selection in text categorization. In Icml, volume 97, page 35, 1997.

Yong Rui, Thomas S Huang, and Shih-Fu Chang. Image retrieval: Current techniques, promising directions, and open issues. Journal of visual communication and image representation, 10(1):39–62, 1999.

Kiansing Ng and Huan Liu. Customer retention via data mining. Artificial Intelligence Review, 14(6):569–590, 2000.

Lei Yu and Huan Liu. Feature selection for high-dimensional data: A fast correlation-based filter solution. In Proceedings of the 20th international conference on machine learning (ICML-03), pages 856–863, 2003.

Sanmay Das. Filters, wrappers and a boosting-based hybrid for feature selection. In Icml, volume 1, pages 74–81, 2001.

Ron Kohavi and George H John. Wrappers for feature subset selection. Artificial intelligence, 97(1-2):273–324, 1997.

Mark Andrew Hall. Correlation-based feature selection for machine learning. 1999.

Azlyna Senawi, Hua-Liang Wei, and Stephen A Billings. A new maximum relevance-minimum multicollinearity (mrmmc) method for feature selection and ranking. Pattern Recognition, 67:47–61, 2017.

Guodong Zhao, Yan Wu, Fuqiang Chen, Junming Zhang, and Jing Bai. Effective feature selection using feature vector graph for classification. Neurocomputing, 151:376–389, 2015a.

Cheng-Lung Huang and Chieh-Jen Wang. A ga-based feature selection and parameters optimizationfor support vector machines. Expert Systems with applications, 31(2):231–240, 2006.

Surya S Durbha, Roger L King, and Nicolas H Younan. Wrapper-based feature subset selection for rapid image information mining. IEEE Geoscience and Remote Sensing Letters, 7(1):43–47, 2009.

Pablo Bermejo, Jose A Gámez, and Jose M Puerta. A grasp algorithm for fast hybrid (filter-wrapper) feature subset selection in high-dimensional datasets. Pattern Recognition Letters, 32(5):701–711, 2011.

Saúl Solorio-Fernández, J Ariel Carrasco-Ochoa, and José Fco Martínez-Trinidad. A new hybrid filter–wrapper feature selection method for clustering based on ranking. Neurocomputing, 214:866–880, 2016.

Satyam Maheshwari, Jitendra Agrawal, and Sanjeev Sharma. New approach for classification of highly imbalanced datasets using evolutionary algorithms. Intl. J. Sci. Eng. Res, 2:1–5, 2011.

Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. Smote: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16:321–357, 2002.

Kai Ming Ting. An instance-weighting method to induce cost-sensitive trees. IEEE Transactions on Knowledge and Data Engineering, 14(3):659–665, 2002.

Konstantinos Veropoulos, Colin Campbell, Nello Cristianini, et al. Controlling the sensitivity of support vector machines. In Proceedings of the international joint conference on AI, volume 55, page 60, 1999.

Riccardo Poli, Stefano Cagnoni, Riccardo Livi, Giuseppe Coppini, and Guido Valli. A neural network expert system for diagnosing and treating hypertension. Computer, 24(3):64–71, 1991.

Justin B Echouffo-Tcheugui, G David Batty, Mika Kivimäki, and Andre P Kengne. Risk models to predict hypertension: a systematic review. PloS one, 8(7):e67370, 2013.

Mevlut Ture, Imran Kurt, A Turhan Kurum, and Kazim Ozdamar. Comparing classification techniques for predicting essential hypertension. Expert Systems with Applications, 29(3):583–588, 2005.

Gail A Carpenter and Stephen Grossberg. Adaptive resonance theory. Springer, 2017.

Daniel J Stekhoven and Peter Bühlmann. Missforest—non-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1):112–118, 2011.

Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1):1–22, 1977.

Natalie K Donovan, Kenneth A Foster, and Carlos Alberto Parra Salinas. Analysis of green coffee quality using hermetic bag storage. Journal of Stored Products Research, 80:1–9, 2019.

Daniel J Stekhoven. missforest: nonparametric missing value imputation using random forest. Astrophysics Source Code Library, 2015.

Craig K Enders. Using the expectation maximization algorithm to estimate coefficient alpha for scales with item-level missing data. Psychological methods, 8(3):322, 2003.

Liang Zhao, Zhikui Chen, Zhennan Yang, and Yueming Hu. A hybrid method for incomplete data imputation. In 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems, pages 1725–1730. IEEE, 2015b.

Aditya Dubey and Akhtar Rasool. Data mining based handling missing data. In 2019 Third International conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud)(I-SMAC), pages 483–489. IEEE, 2019.