Web Crawling Algorithm Fusing TF-IDF and Word2Vec Feature Extraction

Authors

  • Xinyue Feng School of Electronic Information, Foshan Polytechnic, Foshan, 528137, China

DOI:

https://doi.org/10.13052/jwe1540-9589.2452

Keywords:

Network information, retrieval, internet worm, TF-IDF, Word2Vec, data extraction

Abstract

Current research focuses on how to efficiently extract and crawl network information because, with the growth of the Internet, network information is becoming more and more diverse. To address the problem of incorrect data extraction and topic judgment of web crawlers, this study proposes a novel approach based on a file inverse frequency algorithm and Word2Vec feature extraction. The new method improves the retrieval capability of web crawlers by using the file inverse frequency algorithm and uses Word2Vec to extract data features, which improves the data extraction capability of current crawlers. The results showed that the F1 values of the research use model were 25.8% and 26.2% higher than those of the digital filtering algorithm, respectively. The total number of localization resources for the research use strategy was 2800 and the network coverage was 81%, which was 12% higher than the optimal strategy. The research use strategy had a shorter retrieval time and the model could recognize the vocabulary of the keywords. Finally, the model used by the research also had a good model processing capability when compared to other models. In summary, the new model built by the research can improve the data retrieval ability and data extraction ability of the web crawler, which provides new research ideas for future web information extraction.

Downloads

Download data is not yet available.

Author Biography

Xinyue Feng, School of Electronic Information, Foshan Polytechnic, Foshan, 528137, China

Xinyue Feng is a lecturer and postgraduate. She received her Bachelor’s degree in Computer Science and Technology from North University of China in 2012 and her Master’s degree in Software Engineering from North University of China in 2015. From 2022 to now, she has been a doctoral candidate in Electrical and Computer Engineering at Maha Salakan University, Thailand, in the research direction of data mining and data analysis. From 2015 to now she has been a full-time teacher at Foshan Polytechnic. She has published 8 academic articles, participated in 6 scientific research projects, authorized 8 patents, published 3 textbooks, and published 2 other academic research and achievements.

References

K. Manjari, R. Sumanth, S. Rousha, and J. Devi. “Extractive Text Summarization from Web pages using Selenium and TF-IDF algorithm,” Proc. Int. Conf. Trends Electron. Inform. (ICOEI), vol. 15, no. 2, pp. 648–652, April, 2020, DOI: 10.1109/ICOEI48184.2020.9142938.

A. Jalilifard, V. F. Caridá, A. F. Mansano, R. S. Cristo, and F. P. da Fonseca. “Semantic sensitive TF-IDF to determine word relevance in documents,” Adv. Comput. Netw. Commun., vol. 2021, no. 2, pp. 327–337, June, 2021, DOI: 10.1007/978-981-33-6987-0_27.

F. Lan. “Research on text similarity measurement hybrid algorithm with term semantic information and TF-IDF method,” Adv. Multimed., vol. 23, no. 5, pp. 2022–2023, April, 2022, DOI: 10.1155/2022/7923262.

M. Suma and P. Madhumathy. “Brakerski-Gentry-Vaikuntanathan fully homomorphic encryption cryptography for privacy preserved data access in cloud assisted Internet of Things services using glow-worm swarm optimization,” Trans. Emerg. Telecommun. Tech., vol. 33, no. 12, pp. 4641–4642, December, 2022, DOI: 10.1002/ett.4641.

M. Aqeel, F. Ali, M. W. Iqbal, T. A. Rana, M. Arif, and M. R. Auwul. “A review of security and privacy concerns in the internet of things (IoT),” J. Sensors, vol. 2022, no. 10, pp. 29–30, September, 2022, DOI: 10.1155/2022/5724168.

Y. Deng, Y. Pei, and C. Li. “Parameter estimation of a susceptible–infected–recovered–dead computer worm model,” Simul., vol. 98, no. 3, pp. 209–220, March, 2022, DOI: 10.1177/00375497211009576.

A. R. Lubis, M. K. Nasution, O. S. Sitompul, and E. M. Zamzami. “The effect of the TF-IDF algorithm in times series in forecasting word on social media,” Indones. J. Electr. Eng. Comput. Sci., vol. 22, no. 2, pp. 976–984, February, 2021, DOI: 10.11591/ijeecs.v22.i2.pp976-984.

L. Cheng, Y. Yang, K. Zhao, and Z. Gao. “Research and improvement of TF-IDF algorithm based on information theory,” Proc. Int. Conf. Comput. Eng. Netw. (CENet), vol. 13, no. 6, pp. 608–616, April, 2020, DOI: 10.1007/978-3-030-14680-1_67.

X. Ao, X. Yu, D. Liu, and H. Tian. “News keywords extraction algorithm based on TextRank and classified TF-IDF,” Proc. Int. Wireless Commun. Mob. Comput. (IWCMC), vol. 15, no. 6, pp. 1364–1369, June, 2020, DOI: 10.1109/IWCMC48107.2020.9148491.

W. Zhuohao, W. Dong, and L. Qing. “Keyword Extraction from Scientific Research Projects Based on SRP-TF-IDF,” Chin. J. Electron., vol. 30, no. 4, pp. 652–657, April, 2021, DOI: 10.1049/cje.2021.05.007.

R. Rawat, V. Mahor, S. Chirgaiya, R. N. Shaw, and A. Ghosh. “Analysis of darknet traffic for criminal activities detection using TF-IDF and light gradient boosted machine learning algorithm,” Innov. Electr. Electron. Eng., vol. 2021, no. 5, pp. 671–681, May, 2021, DOI: 10.1007/978-981-16-0749-3_53.

S. Rahman, K. H. Talukder, and S. K. Mithila. “An empirical study to detect cyberbullying with TF-IDF and machine learning algorithms,” Proc. Int. Conf. Electron. Commun. Inf. Tech. (ICECIT), vol. 2021, no. 14, pp. 1–4, September, 2021, DOI: 10.1109/ICECIT54077.2021.9641251.

H. Yu, Y. Ji, and Q. Li. “Student sentiment classification model based on GRU neural network and TF-IDF algorithm,” J. Intell. Fuzzy Syst., vol. 40, no. 2, pp. 2301–2311, February, 2021, DOI: 10.3233/JIFS-189227.

Z. Jiang, B. Gao, Y. He, Y. Han, P. Doyle, and Q. Zhu. “Text classification using novel term weighting scheme-based improved TF-IDF for internet media reports,” Math. Probl. Eng., vol. 2021, no. 4, pp. 1–30, Mar, 2021, DOI: 10.1155/2021/6619088.

T. Korkmaz, A. Çetinkaya, H. Aydın, and M. A. Barışkan. “Analysis of whether news on the Internet is real or fake by using deep learning methods and the TF-IDF algorithm,” Int. Adv. Res. Eng. J., vol. 5, no. 1, pp. 31–41, April, 2021, DOI: 10.35860/iarej.779019.

M. Mohammed and N. Omar. “Question classification based on Bloom’s taxonomy cognitive domain using modified TF-IDF and word2vec,” PLoS ONE, vol. 15, no. 3, pp. 19–20, March, 2020, DOI: 10.1371/journal.pone.0230442.

V. D. Antonio, S. Efendi, and H. Mawengkang. “Sentiment analysis for covid-19 in Indonesia on Twitter with TF-IDF featured extraction and stochastic gradient descent,” Int. J. Nonlinear Anal. Appl., vol. 13, no. 1, pp. 1367–1373, January, 2022, DOI: 10.22075/IJNAA.2021.5735.

G. Yunanda, D. Nurjanah, and S. Meliana. “Recommendation system from microsoft news data using TF-IDF and cosine similarity methods,” Build. Inform. Tech. Sci. (BITS), vol. 4, no. 1, pp. 277–284, Jun, 2022, DOI: 10.47065/bits.v4i1.1670.

S. Amin, M. I. Uddin, S. Hassan, A. Khan, N. Nasser, A. Alharbi, and H. Alyami. “Recurrent neural networks with TF-IDF embedding technique for detection and classification in tweets of dengue disease,” IEEE Access, vol. 8, no. 5, pp. 131522–131533, May, 2020, DOI: 10.1109/ACCESS.2020.3009058.

Y. Li, and H. Ning. “Multi-feature keyword extraction method based on TF-IDF and Chinese grammar analysis,” Proc. Int. Conf. Mach. Learn. Intell. Syst. Eng. (MLISE), vol. 2021, no. 9, pp. 362–365, November, 2021, DOI: 10.1109/MLISE54096.2021.00075.

J. Li. “A comparative study of keyword extraction algorithms for English texts,” J. Intell. Syst., vol. 30, no. 1, pp. 808–815, Jul, 2021, DOI: 10.1515/jisys-2021-0040.

J. Qin, Z. Zhou, Y. Tan, X., and Z. He. “A big data text coverless information hiding based on topic distribution and TF-IDF,” Int. J. Digit. Crime Forensics, vol. 13, no. 4, pp. 40–56, Jul, 2021, DOI: 10.4018/IJDCF.20210701.oa4.

I. Ghozali, M. F. Asy’ari, S. Triarjo, H. M. Ramadhani, H. Studiawan, and A. M. Shiddiqi. “A Novel SQL Injection Detection Using Bi-LSTM and TF-IDF,” Proc. 7th Int. Conf. Inf. Netw. Technol. (ICINT), vol. 21, no. 6, pp. 16–22, May, 2022, DOI: 10.1109/ICINT55083.2022.00010.

G. Di Gennaro, A. Buonanno, and F. A. Palmieri. “Considerations about learning Word2Vec,” J. Supercomput., vol. 2021, no. 1, pp. 1–6, 2021, DOI: 10.1007/s11227-021-03743-2.

D. E. Cahyani and I. Patasik. “Performance comparison of tf-idf and word2vec models for emotion text classification,” Bull. Electr. Eng. Inform., vol. 10, no. 5, pp. 2780–2788, October, 2021, DOI: 10.11591/eei.v10i5.3157.

B. Jang, M. Kim, G. Harerimana, S. U. Kang, and J. W. Kim. “Bi-LSTM model to increase accuracy in text classification: Combining Word2vec CNN and attention mechanism,” Appl. Sci., vol. 10, no. 17, pp. 5841–5842, August, 2020, DOI: 10.3390/app10175841.

S. Thavareesan and S. Mahesan. “Sentiment lexicon expansion using Word2vec and fastText for sentiment prediction in Tamil texts,” Proc. Moratuwa Eng. Res. Conf. (MERCon), vol. 2020, no. 28, pp. 272–276, July, 2020, DOI: 10.1109/MERCon50084.2020.9185369.

R. Kurnia, Y. Tangkuman, and A. Girsang. “Classification of user comment using word2vec and SVM classifier,” Int. J. Adv. Trends Comput. Sci. Eng., vol. 9, no. 1, pp. 643–648, February, 2020, DOI: 10.30534/ijatcse/2020/90912020.

A. Mallik and S. Kumar. “Word2Vec and LSTM based deep learning technique for context-free fake news detection,” Multimed. Tools Appl., vol. 83, no. 1, pp. 919–940, January, 2024, DOI: 10.1007/s11042-023-15364-3.

P. Rakshit and A. Sarkar. “A supervised deep learning-based sentiment analysis by the implementation of Word2Vec and GloVe Embedding techniques,” Multimed. Tools Appl., vol. 202, no. 9, pp. 1–34, April, 2024, DOI: 10.1007/s11042-024-19045-7.

P. Preethi and H. R. Mamatha, “Region-Based Convolutional Neural Network for Segmenting Text in Epigraphical Images,” Artif. Intell. Appl., vol. 1, no. 2, pp. 119–127, September, 2023, DOI: 10.47852/bonviewAIA2202293.

Downloads

Published

2025-08-26

How to Cite

Feng, X. . (2025). Web Crawling Algorithm Fusing TF-IDF and Word2Vec Feature Extraction. Journal of Web Engineering, 24(05), 713–738. https://doi.org/10.13052/jwe1540-9589.2452

Issue

Section

Articles