A K-means Text Clustering Algorithm Based on Subject Feature Vector

Authors

  • Ji Duo Criminal Investigation Police University of China, China https://orcid.org/0000-0003-0959-4573
  • Peng Zhang Institute of Information Engineering, Chinese Academy of Sciences, China
  • Liu Hao Criminal Investigation Police University of China, China

DOI:

https://doi.org/10.13052/jwe1540-9589.20612

Keywords:

K-means, initial points, decision graph, iterative class center, subject feature vector.

Abstract

As one of the most popular clustering algorithms, k-means is easily influenced by initial points and the number of clusters, besides, the iterative class center calculated by the mean of all points in a cluster is one of the reasons influencing clustering performance. Representational initial points are selected in this paper according to the decision graph composed by local density and distance of each point. Then we propose an improved k-means text clustering algorithm, the iterative class center of the improved algorithm is composed by subject feature vector which can avoid the influence caused by noises. Experiments show that the initial points are selected successfully and the clustering results improve 3%, 5%, 2% and 7% respectively than traditional k-means clustering algorithm on four experimental corpuses of Fudan and Sougou.

Downloads

Download data is not yet available.

Author Biographies

Ji Duo, Criminal Investigation Police University of China, China

Ji Duo received his M.En degree from Northeast University. He is an associate professor in department of cyber crime investigation, Criminal Investigation Police University of China. His research direction mainly includes text mining, machine translation, network public opinion analysis, etc. Relevant research results have been published in more than 20 academic journals and conferences at home and abroad, and won the first prize of Liaoning science and technology progress award, and First Prize of Aviation Science and Technology Progress Award of China Aviation Society.

Peng Zhang, Institute of Information Engineering, Chinese Academy of Sciences, China

Peng Zhang received his PhD degree from Institute of Computing Technology, Chinese Academy of Sciences. He is an associate professor in Institute of Information Engineering, Chinese Academy of Sciences. His research direction mainly includes social computing and data mining, etc. Relevant research results have been published in more than 60 academic journals and conferences at home and abroad, and is the Member of Youth Innovation Promotion Association of Chinese Academy of Sciences.

Liu Hao, Criminal Investigation Police University of China, China

Liu Hao received his M.En degree from Criminal Investigation Police University of China. He is a senior experimentalist in department of network information center, Criminal Investigation Police University of China. His research direction mainly includes smart campus construction, campus information, etc.

References

Sahami M. Using machine learning to improve information access[D]. stanford university, 1998.

Baldi P, Hatfield G W. DNA microarrays and gene expression: from experiments to data analysis and modeling[M]. Cambridge University Press, 2002.

Rao G N, Madhavi D. An Efficient Document Clustering Mechanism in N-dimensional space[J]. 2012.

Lloyd S. Least squares quantization in PCM[J]. Information Theory, IEEE Transactions on, 1982, 28(2): 129–137.

Chang H C, Chiun-Chieh H S U. Using topic keyword clusters for automatic document clustering[J]. IEICE Transactions on Information and Systems, 2005, 88(8): 1852–1860.

Sun Jigui, Liu Jie, Zhao Lianyu. Research on clustering algorithm [J]. Journal of software, 2008, 19(1): 48–61.

Zheng Wei. Research on text clustering technology based on latent semantic index [D]. Shenyang Institute of Aeronautical Technology, 2009.

Rodriguez A, Laio A. Clustering by fast search and find of density peaks[J]. Science, 2014, 344(6191): 1492–1496.

Kaufman L, Rousseeuw P. Clustering by means of medoids[M]. North-Holland, 1987.

Pratap R, Devi J R, Vani K S, et al. An Efficient Density based Improved K-Medoids Clustering algorithm[J]. IJACSA) International Journal of Advanced Computer Science and Applications, 2011, 2(6).

Ji W, Guo Q, Zhong S, et al. Improved K-medoids Clustering Algorithm under Semantic Web[C]//Proceedings of the 2nd International Conference on Computer Science and Electronics Engineering. Atlantis Press, 2013.

Zipf G K. Human behavior and the principle of least effort[J]. 1949.

Yang Jie, Ji Duo, Cai Dongfeng. Multi document keyword extraction technology based on joint weight [J]. Journal of Chinese information, 2008, 22(6): 75–79.

Downloads

Published

2021-10-18

Issue

Section

Articles