Data Quality Assessment and Recommendation of Feature Selection Algorithms: An Ontological Approach

Authors

  • Aparna Nayak SFI Centre for Research Training in Machine Learning, School of Computer Science, Technological University Dublin, Dublin, Republic of Ireland
  • Bojan Božić SFI Centre for Research Training in Machine Learning, School of Computer Science, Technological University Dublin, Dublin, Republic of Ireland
  • Luca Longo SFI Centre for Research Training in Machine Learning, School of Computer Science, Technological University Dublin, Dublin, Republic of Ireland

DOI:

https://doi.org/10.13052/jwe1540-9589.2219

Keywords:

Data quality, feature selection algorithm, meta-features, ontology, recommendation

Abstract

Feature selection plays an important role in machine learning and data mining problems. Identifying the best feature selection algorithm that helps to remove irrelevant and redundant features is a complex task. This research tries to address it by recommending a feature selection algorithm based on dataset meta-features. The main contribution of the work is the use of Semantic Web principles to develop a recommendation model for the feature selection algorithm. As a result, dataset meta-features are modeled in a domain ontology, and a set of Semantic Web rule language (SWRL) predictive rules have been proposed to recommend a feature selection algorithm. The result of this research is a feature selection algorithm recommendation based on the data characteristics and quality (FSDCQ) ontology, which not only helps with recommendations but also finds the data points with data quality violations. An experiment is conducted on the classification datasets from the UCI repository to evaluate the proposed ontology. The usefulness and effectiveness of the proposed method is evaluated by comparing it with the widely used method in the literature for the recommendation. Results show that the ontology-based recommendations are equally good as the widely used recommendation model, which is k-NN, with added benefits.

Downloads

Download data is not yet available.

Author Biographies

Aparna Nayak, SFI Centre for Research Training in Machine Learning, School of Computer Science, Technological University Dublin, Dublin, Republic of Ireland

Aparna Nayak received her M.Tech degree from Manipal Academy of Higher Education, India. She has more than seven years of teaching experience. She is currently pursuing her Ph.D. at the Technological University Dublin, specializing in knowledge graphs. Her current research interests include machine learning and knowledge graphs.

Bojan Božić, SFI Centre for Research Training in Machine Learning, School of Computer Science, Technological University Dublin, Dublin, Republic of Ireland

Bojan Božić is a Lecturer in Computer Science at TU Dublin. He has worked on European research projects such as SANY (Sensor Web Enablement), TaToo (Tagging Tools for Semantic Discovery), Europeana Creative (Cultural Inheritage), PELAGIOS (Linked Data), and C2-SENSE (Sensor Web and Interoperability). He also has contributed to the H2020 project ALIGNED, modelling data and software engineering processes through ontologies and annotations for the Dacura platform. His current research interests are Semantic Web, machine learning, and natural language processing.

Luca Longo, SFI Centre for Research Training in Machine Learning, School of Computer Science, Technological University Dublin, Dublin, Republic of Ireland

Luca Longo is a curious individual deeply devoted to and highly passionate for science. He strives for excellence and contribution to knowledge. He received his doctoral degree in Artificial Intelligence at Trinity College Dublin after a bachelor and masters in Computer Science, Statistics and Health Informatics. He is actively engaged in dissemination of scientific material to the public as his TEDx talks demonstrate. He has received various awards both for his research work and for his teaching. With his team of doctoral and post-doctoral students, he conducts fundamental research in explainable artificial intelligence, defeasible reasoning, and non-monotonic argumentation. He also performs applied research in machine learning and predictive data analytics, mainly applied to the problem of mental workload modelling.

References

Robert Aduviri, Daniel Matos, and Edwin Villanueva. Feature selection algorithm recommendation for gene expression data through gradient boosting and neural network metamodels. In 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 2726–2728, 2018.

Riccardo Albertoni and Antoine Isaac. Introducing the data quality vocabulary (DQV). Semantic Web, 12(1):81–97, 2021.

Ricardo Almeida, Paulo Maio, Paulo Oliveira, and João Barroso. An ontology-based methodology for reusing data cleaning knowledge. In KEOD 2015 - Proceedings of the International Conference on Knowledge Engineering and Ontology Development, pages 202–211. SciTePress, 2015.

Carlo Batini, Cinzia Cappiello, Chiara Francalanci, and Andrea Maurino. Methodologies for data quality assessment and improvement. ACM Comput. Surv., 41(3), jul 2009.

Bojan Bozic, Rob Brennan, Kevin Feeney, and Gavin Mendel-Gleason. Describing reasoning results with rvo, the reasoning violations ontology. In MEPDaW and LDQ co-located with ESWC, volume 1585 of CEUR Workshop Proceedings, pages 62–69, 2016.

Qiushi Cao, Ahmed Samet, Cecilia Zanni-Merk, François de Bertrand de Beuvron, and Christoph Reich. An ontology-based approach for failure classification in predictive maintenance using fuzzy c-means and swrl rules. Procedia Computer Science, 159:630–639, 2019. Knowledge-Based and Intelligent Information & Engineering Systems: Proceedings of the 23rd International Conference KES2019.

Cinzia Cappiello, Chiara Francalanci, and Barbara Pernici. Data quality assessment from the user’s perspective. In Proceedings of the 2004 International Workshop on Information Quality in Information Systems, IQIS ’04, page 68–73, New York, NY, USA, 2004. Association for Computing Machinery.

Girish Chandrashekar and Ferat Sahin. A survey on feature selection methods. Computers & Electrical Engineering, 40(1):16–28, 2014.

Jianguo Chen, Kenli Li, Huigui Rong, Kashif Bilal, Nan Yang, and Keqin Li. A disease diagnosis and treatment recommendation system based on big data mining and cloud computing. Information Sciences, 435:124–149, 2018.

Rung-Ching Chen, Yun-Hou Huang, Cho-Tsan Bau, and Shyi-Ming Chen. A recommendation system based on domain ontology and swrl for anti-diabetic drugs selection. Expert Systems with Applications, 39(4):3995–4006, 2012.

Padraig Cunningham, Bahavathy Kathirgamanathan, and Sarah Jane Delany. Feature selection tutorial with python examples, 2021.

Manoranjan Dash and Huan Liu. Feature selection for classification. Intelligent data analysis, 1(1-4):131–156, 1997.

Xibin Dong, Zhiwen Yu, Wenming Cao, Yifan Shi, and Qianli Ma. A survey on ensemble learning. Frontiers Comput. Sci., 14(2):241–258, 2020.

Mariano Fernández-López, Asunción Gómez-Pérez, and Natalia Juristo. Methontology: from ontological art towards ontological engineering. 1997.

Christian Fürber and Martin Hepp. Towards a vocabulary for data quality management in semantic web architectures. In Proceedings of the 2011 EDBT/ICDT Workshop on Linked Web Data Management, pages 1–8. ACM, 2011.

Isha Gandhi and Mrinal Pandey. Hybrid ensemble of classifiers using voting. In 2015 International Conference on Green Computing and Internet of Things (ICGCIoT), pages 399–404, 2015.

Nitin Gupta, Shashank Mujumdar, Hima Patel, Satoshi Masuda, Naveen Panwar, Sambaran Bandyopadhyay, Sameep Mehta, Shanmukha Guttula, Shazia Afzal, Ruhi Sharma Mittal, and Vitobha Munigala. Data quality for machine learning tasks. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, KDD ’21, page 4040–4041, New York, NY, USA, 2021. Association for Computing Machinery.

Iris Hendrickx and Antal van den Bosch. Hybrid algorithms with instance-based classification. In Machine Learning: ECML 2005, 16th European Conference on Machine Learning, volume 3720 of Lecture Notes in Computer Science, pages 158–169. Springer, 2005.

Abhinav Jain, Hima Patel, Lokesh Nagalapatti, Nitin Gupta, Sameep Mehta, Shanmukha Guttula, Shashank Mujumdar, Shazia Afzal, Ruhi Sharma Mittal, and Vitobha Munigala. Overview and importance of data quality for machine learning tasks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’20, page 3561–3562, New York, NY, USA, 2020. Association for Computing Machinery.

Alexandros Kalousis and Melanie Hilario. Feature selection for meta-learning. In Knowledge Discovery and Data Mining – PAKDD, volume 2035 of Lecture Notes in Computer Science, pages 222–233. Springer, 2001.

C. Maria Keet, Agnieszka Lawrynowicz, Claudia d’Amato, Alexandros Kalousis, Phong Nguyen, Raúl Palma, Robert Stevens, and Melanie Hilario. The data mining optimization ontology. Journal of web semantics, 32:43–53, 2015.

Saloni Kumari, Deepika Kumar, and Mamta Mittal. An ensemble approach for classification and prediction of diabetes mellitus using soft voting classifier. International Journal of Cognitive Computing in Engineering, 2:40–46, 2021.

Peng Li, Xi Rao, Jennifer Blase, Yue Zhang, Xu Chu, and Ce Zhang. Cleanml: A study for evaluating the impact of data cleaning on ml classification tasks. In 2021 IEEE 37th International Conference on Data Engineering (ICDE), pages 13–24, 2021.

Luca Longo, Randy Goebel, Freddy Lecue, Peter Kieseberg, and Andreas Holzinger. Explainable artificial intelligence: Concepts, applications, research challenges and visions. In International Cross-Domain Conference for Machine Learning and Knowledge Extraction, pages 1–16. Springer, 2020.

Núria Macià and Ester Bernadó-Mansilla. Towards UCI+: A mindful repository design. Information Sciences, 261:237–262, 2014.

Rafael Gomes Mantovani, André L. D. Rossi, Edesio Alcobaça, Joaquin Vanschoren, and André C. P. L. F. de Carvalho. A meta-learning recommender system for hyperparameter tuning: Predicting when tuning improves SVM classifiers. Information Sciences, 501:193–221, 2019.

L.C. Molina, L. Belanche, and A. Nebot. Feature selection algorithms: a survey and experimental evaluation. In 2002 IEEE International Conference on Data Mining, 2002. Proceedings., pages 306–313, 2002.

Munehiro Nakamura, Atsushi Otsuka, and Haruhiko Kimura. Automatic selection of classification algorithms for non-experts using meta-features. China-USA Business Review, 13(3), 2014.

Aparna Nayak, Bojan Božić, and Luca Longo. Data quality assessment of comma separated values using linked data approach. In Witold Abramowicz, Sören Auer, and Milena Stróżyna, editors, Business Information Systems Workshops, pages 240–250, Cham, 2022. Springer International Publishing.

Aparna Nayak, Bojan Božić, and Luca Longo. An ontological approach for recommending a feature selection algorithm. In Web Engineering, pages 300–314, Cham, 2022. Springer International Publishing.

Charbel Obeid, Inaya Lahoud, Hicham El Khoury, and Pierre-Antoine Champin. Ontology-based recommender system in higher education. In Companion Proceedings of the The Web Conference 2018, pages 1031–1034, 2018.

Dijana Oreski, Stjepan Oreski, and Bozidar Klicek. Effects of dataset characteristics on the performance of feature selection techniques. Applied Soft Computing, 52:109–119, 2017.

Pance Panov, Saso Dzeroski, and Larisa N. Soldatova. Ontodm: An ontology of data mining. In Workshops Proceedings of the 8th IEEE International Conference on Data Mining, pages 752–760. IEEE Computer Society, 2008.

Pance Panov, Larisa N. Soldatova, and Saso Dzeroski. Ontodm-kdd: Ontology for representing the knowledge discovery process. In Discovery Science - 16th International Conference, DS, volume 8140 of Lecture Notes in Computer Science, pages 126–140. Springer, 2013.

Pance Panov, Larisa N. Soldatova, and Saso Dzeroski. Generic ontology of datatypes. Information Sciences, 329:900–920, 2016.

Antonio Rafael Sabino Parmezan, Huei Diana Lee, Newton Spolaôr, and Feng Chung Wu. Automatic recommendation of feature selection algorithms based on dataset characteristics. Expert Systems with Applications, 185:115589, 2021.

Yonghong Peng, Peter A. Flach, Carlos Soares, and Pavel Brazdil. Improved dataset characterisation for meta-learning. In Discovery Science, 5th International Conference, volume 2534 of Lecture Notes in Computer Science, pages 141–152. Springer, 2002.

Leo L. Pipino, Yang W. Lee, and Richard Y. Wang. Data quality assessment. Commun. ACM, 45(4):211–218, apr 2002.

Nitin Pise and Parag Kulkarni. Algorithm selection for classification problems. In SAI Computing Conference (SAI), pages 203–211. IEEE, 2016.

Matthias Reif, Faisal Shafait, and Andreas Dengel. Prediction of classifier training time including parameter optimization. In Advances in Artificial Intelligence, volume 7006 of Lecture Notes in Computer Science, pages 260–271. Springer, 2011.

Matthias Reif, Faisal Shafait, Markus Goldstein, Thomas M. Breuel, and Andreas Dengel. Automatic classifier selection for non-experts. Pattern Analysis and Applications, 17(1):83–96, 2014.

Adriano Rivolli, Luís P.F. Garcia, Carlos Soares, Joaquin Vanschoren, and André C.P.L.F. de Carvalho. Meta-features for meta-learning. Knowledge-Based Systems, 240:108101, 2022.

Renata Lopes Rosa, Gisele Maria Schwartz, Wilson Vicente Ruggiero, and Demóstenes Zegarra Rodríguez. A knowledge-based recommendation system that includes sentiment analysis and deep learning. IEEE Transactions on Industrial Informatics, 15(4):2124–2135, 2018.

Salvatore Ruggieri. Efficient c4.5 [classification algorithm]. IEEE transactions on knowledge and data engineering, 14(2):438–444, 2002.

Omer Sagi and Lior Rokach. Ensemble learning: A survey. WIREs Data Mining Knowl. Discov., 8(4), 2018.

Victor S. Sheng, Foster Provost, and Panagiotis G. Ipeirotis. Get another label? improving data quality and data mining using multiple, noisy labelers. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’08, page 614–622, New York, NY, USA, 2008. Association for Computing Machinery.

Samar Shilbayeh and Sunil Vadera. Feature selection in meta learning framework. In Science and Information Conference, pages 269–275. IEEE, 2014.

Qinbao Song, Guangtao Wang, and Chao Wang. Automatic recommendation of classification algorithms based on dataset characteristics. Pattern Recognition, 45(7):2672–2689, 2012.

Man Tianxing, Myo Myint, Wang Guan, Nataly Zhukova, and Nikolay Mustafin. A hierarchical data mining process ontology. In 28th Conference of Open Innovations Association (FRUCT), pages 465–471. IEEE, 2021.

Mike Uschold and Michael Gruninger. Ontologies: Principles, methods and applications. The knowledge engineering review, 11(2):93–136, 1996.

Ramneesh Vaidyambath, Jeremy Debattista, Neha Srivatsa, and Rob Brennan. An intelligent linked data quality dashboard. In Proceedings for the 27th AIAI Irish Conference on Artificial Intelligence and Cognitive Science, volume 2563 of CEUR Workshop Proceedings, pages 341–352. CEUR-WS.org, 2019.

Linda C. van der Gaag and Andrea Capotorti. Naive bayesian classifiers with extreme probability features. In International Conference on Probabilistic Graphical Models, volume 72 of Proceedings of Machine Learning Research, pages 499–510. PMLR, 2018.

Joaquin Vanschoren and Larisa Soldatova. Exposé: An ontology for data mining experiments. In International workshop on third generation data mining: Towards service-oriented knowledge discovery (SoKD-2010), pages 31–46, 2010.

Ricardo Vilalta, Christophe G. Giraud-Carrier, Pavel Brazdil, and Carlos Soares. Using meta-learning to support data mining. International Journal of Computer Science Applications, 1(1):31–45, 2004.

Giulia Vilone and Luca Longo. Notions of explainability and evaluation approaches for explainable artificial intelligence. Information Fusion, 76:89–106, 2021.

Kiri Wagstaff. Machine learning that matters. arXiv, 2012.

Guangtao Wang, Qinbao Song, Heli Sun, Xueying Zhang, Baowen Xu, and Yuming Zhou. A feature subset selection algorithm automatic recommendation method. Journal of Artificial Intelligence Research, 47:1–34, 2013.

Amrapali Zaveri, Anisa Rula, Andrea Maurino, Ricardo Pietrobon, Jens Lehmann, and Sören Auer. Quality assessment for linked data: A survey. Semantic Web, 7(1):63–93, 2016.

Yang Zhongguo, Li Hongqi, Sikandar Ali, and Ao Yile. Choosing classification algorithms and its optimum parameters based on data set characteristics. Journal of Computers, 28(5):26–38, 2017.

Downloads

Published

2023-04-20

How to Cite

Nayak, A. ., Božić, B. ., & Longo, L. . (2023). Data Quality Assessment and Recommendation of Feature Selection Algorithms: An Ontological Approach. Journal of Web Engineering, 22(01), 175–196. https://doi.org/10.13052/jwe1540-9589.2219

Issue

Section

ICWE2022