Abstract
In big data anomaly detection, traditional methods often struggle to handle complex noise and interference factors, so innovative technology combination solutions are necessary. This paper proposes a detection framework that integrates multiple strategies to enhance the accuracy and efficiency of anomaly detection by combining random forest algorithms. First, the data is preprocessed using exponential smoothing to remove potential outliers and ensure data quality. Then, k-means clustering was used to classify the data and group the data for subsequent processing. After that, principal component analysis (PCA) is used for feature extraction to reduce the data dimension and retain the main features, thereby minimising the impact of redundant information. On this basis, multiple decision trees are constructed using the random forest algorithm, and integrated learning and random sampling strategies are employed to enhance the stability and accuracy of the model. After several iterations and weight updates, the final model can output accurate classification results and complete anomaly detection. Experiments show that the proposed method can complete data processing within 25 seconds, with an accuracy rate of 92.3% and a false positive rate of only 4.3%, verifying its excellent performance and practical application value in a big data environment. Overall, the proposed system provides a highly efficient and accurate model for big data anomaly detection. Methods employed include exponential smoothing, k-means clustering, PCA, and random forest, achieving an accuracy of 92.3% with a few false positives.
References
Habeeb, R.A.A., Nasaruddin, F., Gani, A., Hashem, I.A.T., Ahmed, E. and Imran, M., 2019. Real-time big data processing for anomaly detection: A survey. International Journal of Information Management, 45, pp. 289–307.
Rettig, L., Khayati, M., Cudré-Mauroux, P. and Piórkowski, M., 2019. Online anomaly detection over big data streams. Applied Data Science: Lessons Learned for the Data-Driven Business, pp. 289–312.
Arjunan, T., 2024. Real-time detection of network traffic anomalies in big data environments using deep learning models. International Journal for Research in Applied Science and Engineering Technology, 12(9), pp. 10–22214.
Oprea, S.V., Bâra, A., Puican, F.C. and Radu, I.C., 2021. Anomaly detection with machine learning algorithms and big data in electricity consumption. Sustainability, 13(19), p. 10963.
Kai, K.S.B., Chong, E. and Balachandran, V., 2019. Anomaly detection on DNS traffic using big data and machine learning. In CEUR Workshop Proceedings, 2622, pp. 95–104.
Laskar, M.T.R., Huang, J.X., Smetana, V., Stewart, C., Pouw, K., An, A. and Liu, L., 2021. Extending isolation forest for anomaly detection in big data via K-means. ACM Transactions on Cyber-Physical Systems (TCPS), 5(4), pp. 1–26.
Ariyaluran Habeeb, R.A., Nasaruddin, F., Gani, A., Amanullah, M.A., Abaker Targio Hashem, I., Ahmed, E. and Imran, M., 2022. Clustering-based real-time anomaly detection—A breakthrough in big data technologies. Transactions on Emerging Telecommunications Technologies, 33(8), e3647.
Tabesh, P., Mousavidin, E. and Hasani, S., 2019. Implementing big data strategies: A managerial perspective. Business Horizons, 62(3), pp. 347–358.
Karras, A., Giannaros, A., Karras, C., Theodorakopoulos, L., Mammassis, C.S., Krimpas, G.A. and Sioutas, S., 2024. TinyML algorithms for Big Data Management in large-scale IoT systems. Future Internet, 16(2), 42.
Manimurugan, S., 2021. IoT-Fog-Cloud model for anomaly detection using improved Naïve Bayes and principal component analysis. Journal of Ambient Intelligence and Humanised Computing, pp. 1–10.
Thudumu, S., Branch, P., Jin, J., and Singh, J. (2020). A comprehensive survey of anomaly detection techniques for high-dimensional big data. Journal of Big Data, 7, 1–30.
Bhattarai, B.P., Paudyal, S., Luo, Y., Mohanpurkar, M., Cheung, K., Tonkoski, R., and Zhang, X., 2019. Big data analytics in smart grids: state-of-the-art, challenges, opportunities, and future directions. IET Smart Grid, 2(2), pp. 141–154.
Corizzo, R., Ceci, M., and Japkowicz, N., 2019. Anomaly detection and repair for accurate predictions in geo-distributed big data. Big Data Research, 16, pp. 18–35.
Alguliyev, R.M., Aliguliyev, R.M., and Abdullayeva, F.J., 2019. PSO+ K-means algorithm for anomaly detection in Big Data. Statistics, Optimization & Information Computing, 7(2), pp. 348–359.
Haskaran, S.V., 2020. Integrating data quality services (DQS) in big data ecosystems: Challenges, best practices, and opportunities for decision-making. Journal of Applied Big Data Analytics, Decision-Making, and Predictive Modelling Systems, 4(11), pp. 1–12.
Ridzuan, F., and Zainon, W.M.N.W., 2019. A review of data cleansing methods for big data. Procedia Computer Science, 161, pp. 731–738..
Surianarayanan, C., Kunasekaran, S., Chelliah, P.R., A high-throughput architecture for anomaly detection in streaming data using machine learning algorithms, International Journal of Information Technology, 16(1), 493–506, 2024.
Morales, F.A., Ramírez, J.M., Ramos, E.A., A mathematical assessment of the isolation random forest method for anomaly detection in big data, Mathematical Methods in the Applied Sciences, 46(1), 1156–1177, 2023.
Udeh, E.O., Amajuoyi, P., Adeusi, K.B., Scott, A.O., The role of big data in detecting and preventing financial fraud in digital transactions, World Journal of Advanced Research and Reviews, 22(2), 1746–1760, 2024.
Torabi Asr, F., Taboada, M., Big Data and quality data for fake news and misinformation detection, Big Data & Society, 6(1), 2053951719843310, 2019.
Shijun, S., Min, F., Design of big data anomaly detection model based on random forest algorithm, Scientific Insights and Discoveries Review, 1, 166–172, 2024.
Speiser, J.L., Miller, M.E., Tooze, J., Ip, E., A comparison of random forest variable selection methods for classification prediction modeling, Expert Systems with Applications, 134, 93–101, 2019.
Aarthi, G., Priya, S.S., Banu, W.A., KRF-AD: Innovating anomaly detection with KDE-KL and random forest fusion, Intelligent Decision Technologies, 18(3), 2275–2287, 2024.
Probst, P., Wright, M.N., Boulesteix, A.L., Hyperparameters and tuning strategies for random forest, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 9(3), e1301, 2019.
Schonlau, M., Zou, R.Y., The random forest algorithm for statistical learning, The Stata Journal, 20(1), 3–29, 2020.
Hu, J., Szymczak, S., A review on longitudinal data analysis with random forest, Briefings in Bioinformatics, 24(2), bbad002, 2023.
Alfian, G., Syafrudin, M., Fitriyani, N. L., Alam, S., Pratomo, D. N., Subekti, L., Benes, F., Utilizing random Forest with iForest-based outlier detection and SMOTE to detect movement and direction of RFID tags, Future Internet, 15(3), 103, 2023.
Karabadji, N. E. I., Korba, A. A., Assi, A., Seridi, H., Aridhi, S., Dhifli, W., Accuracy and diversity-aware multi-objective approach for random forest construction, Expert Systems with Applications, 225, 120138, 2023.
Shah, K., Patel, H., Sanghvi, D., Shah, M., A comparative analysis of logistic regression, random forest and KNN models for the text classification, Augmented Human Research, 5(1), 12, 2020.
Kan, X., Zhou, Z., Yao, L., and Zuo, Y. Research on Anomaly Detection in Vehicular CAN Based on Bi-LSTM. Journal of Cyber Security and Mobility, 12(5), 629–652. 2023.
Vishva, E. S., and Aju, D. Phisher fighter: website phishing detection system based on url and term frequency-inverse document frequency values. Journal of Cyber Security and Mobility, 11(1), 83–104. 2022.
Diko, Z., and Sibanda, K. (2024). Comparative Analysis of Popular Supervised Machine Learning Algorithms for Detecting Malicious Universal Resource Locators. Journal of Cyber Security and Mobility, 13(5), 1105–1128.
Shan, J., and Ma, H. (2024). Optimization of Network Intrusion Detection Model Based on Big Data Analysis. Journal of Cyber Security and Mobility, 13(6), 1357–1378.
Garikipati, V., and Bharathidasan, S. (2020). Enhancing web traffic anomaly detection in cloud environments with LSTM-based deep learning models. International Journal in Physical and Applied Sciences, 7(5).
Dondapati, K., and Chetlapalli, H. (2025). The enhanced financial system validation: using kernel PCA, weighted kernel K-medoids, and mutation-based testing for accurate risk assessment and compliance: financial system validation. International Journal of Digital Innovation, Insight, and Information, 1(01), 37–42.
Punitha, P., Dinesh Kumar, V. K., and Lakshmana Kumar, R. (2025). Advancing IoT security with an innovative machine learning paradigm for botnet attack detection. EAI Endor Trans Int Things, 11.

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
Copyright (c) 2026 Journal of Cyber Security and Mobility
