Classification of Firewall Log Files with Different Algorithms and Performance Analysis of These Algorithms
DOI:
https://doi.org/10.13052/jwe1540-9589.2344Keywords:
Firewalls, log files, classification, performance metrics, the Simple Cart algorithmAbstract
Classifying firewall log files allows analysing potential threats and deciding on appropriate rules to prevent them. Therefore, in this study, firewall log files are classified using different classification algorithms and the performance of the algorithms are evaluated using performance metrics. The dataset was prepared using the log files of a firewall. It was filtered to make it free from any personal data and consisted of 12 attributes in total and from these attributes the action attribute was selected as the class. In the performance evaluation, Simple Cart and NB tree algorithms made the best predictions, achieving an accuracy rate of 99.84%. Decision Stump had the worst prediction performance, achieving an accuracy rate of 79.68%. As the total number of instances belonging to each of the classes in the dataset was not equal, the Matthews correlation coefficient was also used as a performance metric in the evaluations. The Simple Cart, BF tree, FT tree, J48 and NB Tree algorithms achieved the highest average values. However, although the reset-both class was not predicted successfully by the others, the Simple Cart algorithm made the best predictions for it. The values of other performance metrics used in this study also support this conclusion. Therefore, the Simple Cart algorithm is recommended for use in classifying firewall log files. However, there is a need to develop a prefiltering and parsing approach to process different log files as each firewall brand creates and maintains log files in its own format. Therefore, in this study, a novel prefiltering and parsing approach has been proposed to process log files with different structures and create structured datasets using them.
Downloads
References
Karen, S., and H. Paul. 2008. Guidelines on firewalls and firewall policy, NIST Recommendations, SP, p. 800–841.
Kent, K., and M. Souppaya. 2006. Guide to computer security log management: recommendations of the National Institute of Standards and Technology, US Department of Commerce, Technology Administration.
Ertam, F., and M. Kaya. 2018. Classification of firewall log files with multiclass support vector machine, 2018 6th International symposium on digital forensic and security (ISDFS), IEEE, p. 1–4.
González-Granadillo, G., S. González-Zarzosa, and R. Diaz. 2021. Security information and event management (siem): Analysis, trends, and usage in critical infrastructures, Sensors, 21, 4759.
Al-Haija, Q.A., and A. Ishtaiwi. 2022. Multiclass Classification of Firewall Log Files Using Shallow Neural Network for Network Security Applications, Soft Computing for Security Applications, Springer, p. 27–41.
Chicco, D., and G. Jurman. 2020. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation, BMC genomics, 21: 1–13.
Arnaldo, I., A. Cuesta-Infante, A. Arun, M. Lam, C. Bassias, and K. Veeramachaneni. 2017. Learning representations for log data in cybersecurity, International conference on cyber security cryptography and machine learning, Springer, p. 250–268.
Lu, J., F. Lv, Z. Zhuo, X. Zhang, X. Liu, T. Hu, and W. Deng. 2019. Integrating traffics with network device logs for anomaly detection, Security and Communication Networks, 2019: 1–10.
Ucar, E., and E. Ozhan. 2017. The analysis of firewall policy through machine learning and data mining, Wireless Personal Communications, 96: 2891–2909.
Gutierrez, R.J., K.W. Bauer, B.C. Boehmke, C.M. Saie, T.J. Bihl. 2018. Cyber anomaly detection: Using tabulated vectors and embedded analytics for efficient data mining, Journal of Algorithms & Computational Technology, 12: 293–310.
Yang, G., Y. Zhao, B. Li, Y. Ma, R. Li, J. Jing, and Y. Dian. 2019. Tree species classification by employing multiple features acquired from integrated sensors, Journal of Sensors, 2019: 1–12.
Keerthi, S.S., S.K. Shevade, C. Bhattacharyya, and K.R.K. Murthy. 2001. Improvements to Platt’s SMO algorithm for SVM classifier design, Neural computation, 13: 637–649.
Altman, N.S. 1992. An introduction to kernel and nearest-neighbor nonparametric regression, The American Statistician, 46: 175–185.
Aha, D.W., D. Kibler, and M.K. Albert. 1991. Instance-based learning algorithms, Machine learning, 6: 37–66.
Quinlan, J.R. 1987. Simplifying decision trees, International journal of man-machine studies, 27: 221–234.
Srinivasan, D.B., and P. Mekala. 2014. Mining social networking data for classification using reptree, International Journal of Advance Research in Computer Science and Management Studies, 2(10): 155–160.
Kalmegh, S. 2015. Analysis of weka data mining algorithm reptree, simple cart and randomtree for classification of indian news, International Journal of Innovative Science, Engineering & Technology, 2: 438–446.
Pfahringer, B. 2010. Random model trees: an effective and scalable regression method.
Landwehr, N., M. Hall, E. Frank. 2005. Logistic model trees, Machine learning, 59: 161–205.
Geurts, P., D. Ernst, and L. Wehenkel. 2006. Extremely randomized trees, Machine learning, 63: 3–42.
Hulten, G., L. Spencer, and P. Domingos. 2001. Mining time-changing data streams, Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, p. 97–106.
Tan, C., H. Chen, and C. Xia. 2009. The prediction of cardiovascular disease based on trace element contents in hair and a classifier of boosting decision stumps, Biological trace element research, 129: 9–19.
Coºkun, C., and A. Baykal. 2011. An application for comparison of data mining classification algorithms, XIII. Akademik Biliþim Konferansı, p. 51–58.
Gama, J. 2004. Functional trees, Machine learning, 55: 219–250.
Shi, H. 2007. Best-first decision tree learning, The University of Waikato.
Kohavi, R. 1996. Scaling up the accuracy of naive-bayes classifiers: A decision-tree hybrid, Kdd, 202–207.
Frank, E. 2014. Fully supervised training of Gaussian radial basis function networks in WEKA.
Holmes, G., B. Pfahringer, R. Kirkby, E. Frank, and M. Hall. 2002. Multiclass alternating decision trees, European Conference on Machine Learning, Springer, p. 161–172.
Freund, Y., and R.E. Schapire. 1996. Experiments with a new boosting algorithm, icml, Citeseer, 148–156.
Joshi, R. 2016. Accuracy, precision, recall & f1 score: Interpretation of performance measures, Retrieved April, 1, 2022.
Eibe, F., M.A. Hall, and I.H. Witten. 2016. The WEKA workbench. Online appendix for data mining: practical machine learning tools and techniques, Morgan Kaufmann, Elsevier Amsterdam, The Netherlands.
Kotthoff, L., C. Thornton, H.H. Hoos, F. Hutter, and K. Leyton-Brown. 2019. Auto-WEKA: Automatic model selection and hyperparameter optimization in WEKA, Automated Machine Learning, Springer, Cham, 81–95.
Bilalli, B., A. Abelló, T. Aluja-Banet, and R. Wrembel. 2016. Automated data pre-processing via meta-learning, International Conference on Model and Data Engineering, Springer, p. 194–208.
As-Suhbani, H.E., and S. Khamitkar. 2018. Mining Frequent Patterns in Firewall Logs Using Apriori Algorithm with WEKA, International Conference on Recent Trends in Image Processing and Pattern Recognition, Springer, p. 561–571.
Astekin, M., S. Özcan, and H. Sözer. 2019. Incremental analysis of large-scale system logs for anomaly detection in 2019 IEEE International Conference on Big Data (Big Data). IEEE.
Xiao, T., et al., LPV: A Log Parsing Framework Based on Vectorization. IEEE Transactions on Network and Service Management, 2023.
Xiao, T., et al. 2020. Lpv: A log parser based on vectorization for offline and online log parsing. in 2020 IEEE International Conference on Data Mining (ICDM). IEEE.
Lashram, A.B., L. Hsairi, and H. Al Ahmadi, 2023. HCLPars: A New Hierarchical Clustering Log Parsing Method. Engineering, Technology & Applied Science Research, 13(4): p. 11130–11138.
Coustié, O., et al. 2020. Meting: A robust log parser based on frequent n-gram mining. in 2020 IEEE International Conference on Web Services (ICWS). IEEE.
Chunyong, Z. and X. Meng. 2020. Log parser with one-to-one markup. in 2020 3rd International Conference on Information and Computer Technologies (ICICT). IEEE.
Zhang, S., et al., 2020. Efficient and robust syslog parsing for network devices in datacenter networks. IEEE access, 8: p. 30245–30261.
Bai, Y., Y. Chi, and D. Zhao, 2023. PatCluster: A Top-Down Log Parsing Method Based on Frequent Words. IEEE Access, 11: p. 8275–8282.
Liu, X., Y. Zhu, and S. Ji. 2020. Web log analysis in genealogy system. in 2020 IEEE International Conference on Knowledge Graph (ICKG). IEEE.
Liu, W., et al. 2020. FastLogSim: A quick log pattern parser scheme based on text similarity. in International Conference on Knowledge Science, Engineering and Management. Springer.
Vervaet, A., et al., 2024. Online log parsing using evolving research tree. Knowledge and Information Systems, 66(2): p. 1231–1255.
Le, V.-H. and H. Zhang, 2023. An evaluation of log parsing with chatgpt. arXiv preprint arXiv:2306.01590.
Yu, S., et al., 2023. Self-supervised log parsing using semantic contribution difference. Journal of Systems and Software, 200: p. 111646.
Chen, X., et al., 2023. AS-Parser: Log Parsing Based on Adaptive Segmentation. Proceedings of the ACM on Management of Data, 1(4): p. 1–26.
Xu, J., et al. 2023.Hue: A user-adaptive parser for hybrid logs. in Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering.
El-Masri, D., et al., 2020. A systematic literature review on automated log abstraction techniques. Information and Software Technology, 122: p. 106276.
Zhu, J., et al. 2019. Tools and benchmarks for automated log parsing. in 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE.