Research on the Methods and Key Techniques of Web Archive Oriented Social Media Information Collection




Social media, web archive, information collection, key techniques, long-term preservation


Social media information collection and preservation is a hot issue in the field of Web Archive. This paper makes a comparative analysis of the different social media information collection methods, deeply analyzes the key techniques of the three important parts-collection, evaluation and preservation in the information collection process, and provides the solutions for the problems in the key techniques. Through analysis, the collection method suitable for the social media information is found. In terms of the problem that social websites impose restrictions on the call frequency of API, the paper provides solutions, for example, use the multiplexing mechanism, use the naive Bayesian algorithm to solve the spam filtering problem, and use MongoDB Dbased distributed storage to store collected massive data.


Download data is not yet available.

Author Biography

Xinping Huang, School of Management, Jilin University, Changchun 130012, China

Xinping Huang received the Ph.D. degree in information science from the Jilin University, Changchun, China, in 2017. He is currently an associate professor with the Department of information management, School of management, Jilin University, China. His current research interests include Information Management System and Web Archive.


Kantar Media CIC. 2020 China Social Media Research Report [EB/OL].[2021-06-07].

Zhang Yan. Research on the Transformation of Archival Memory Reproduction in New Media Age [D]. Shanghai: Shanghai University, 2020.

Huang Xinping. Comparison and Its Reference of Social Media Information Long-term Preservation Projects Founded by FP7 in the European Union [J]. Research on Library Science, 2019(17):2–9.

Saito K, Kimura M, Ohara K, et al. Behavioral Analyses of Information Diffusion Models by Observed Data of Social Network[C]. 3rd International Workshop on Social Computing, Behavioral Modeling and Prediction, Bethesda, MD, MAR 30–31, 2010.

Aghdam S M, Khansari M, Rabiee H R, et al. WCCP: A congestion control protocol for wireless multimedia communication in sensor networks[J]. Ad Hoc Networks, 2014, 13:516–534.

Jones S M, Klein M, Weigle M C, et al. MementoEmbed and Raintale for Web Archive Storytelling [EB/OL]. [2021-06-11].

Xiong Zutao. Analysis of Micro-blog Public Opinion based on Text Information Extraction from Webpage [D]. Xi’an: Xi’an University of Science and Technology, 2013.

Liu Chao, Zheng Jiancheng. Discussion on the Key Issues of Chinese Micro-blog Information Collection from the Perspective of Long-term Preservation [J]. Library and Information Service, 2015, 59(3):134–139.

Panhong, Wang Zipeng. Application of blockchain technology to social media information archiving [J]. China Archives, 2018(06):74–77.

Zeng Sa, Huang Xinrong. Construction of social media file archive metadata scheme in China [J]. Research on Library Science, 2020(20):58–66.

Huang Xinping, Wang Ping. Recent Home and Abroad Studies on Progress of Web Archive Technology Research and Application [J]. Research on Library Science, 2016(18):30–35+

Liu Lan, Wu Zhenxin. Research on Web Archive Information collection process and key issues[J]. Information Studies: Theory & Application, 2009(8): 113–117.

Library of Congress. Update on the twitter archive at the Library of Congress [EB/OL]. [2021-06-17].

Thomas Risse, Elena Demidova, Stefan Dietze, etc. The ARCOMEM Architecture for Social and Semantic Driven Web Archiving [J]. Future Internet, 2014, 6(3):688–716.

Intelligent Archiving of the Social Web [EB/OL].[2021-06-17].

Huang Xinrong, Gao Chenxiang. Review of social media archiving technology from the perspective of process [J]. Research on Library Science, 2019(02):2–11.

Zhang J, Feng S. Machine Learning Modeling: A New Way to do Quantitative Research in Social Sciences in the Era of AI [J]. Journal of Web Engineering, 2021, 20(2):280–301.