A K-means Text Clustering Algorithm Based on Subject Feature Vector
Keywords:K-means, initial points, decision graph, iterative class center, subject feature vector.
As one of the most popular clustering algorithms, k-means is easily influenced by initial points and the number of clusters, besides, the iterative class center calculated by the mean of all points in a cluster is one of the reasons influencing clustering performance. Representational initial points are selected in this paper according to the decision graph composed by local density and distance of each point. Then we propose an improved k-means text clustering algorithm, the iterative class center of the improved algorithm is composed by subject feature vector which can avoid the influence caused by noises. Experiments show that the initial points are selected successfully and the clustering results improve 3%, 5%, 2% and 7% respectively than traditional k-means clustering algorithm on four experimental corpuses of Fudan and Sougou.
Sahami M. Using machine learning to improve information access[D]. stanford university, 1998.
Baldi P, Hatfield G W. DNA microarrays and gene expression: from experiments to data analysis and modeling[M]. Cambridge University Press, 2002.
Rao G N, Madhavi D. An Efficient Document Clustering Mechanism in N-dimensional space[J]. 2012.
Lloyd S. Least squares quantization in PCM[J]. Information Theory, IEEE Transactions on, 1982, 28(2): 129–137.
Chang H C, Chiun-Chieh H S U. Using topic keyword clusters for automatic document clustering[J]. IEICE Transactions on Information and Systems, 2005, 88(8): 1852–1860.
Sun Jigui, Liu Jie, Zhao Lianyu. Research on clustering algorithm [J]. Journal of software, 2008, 19(1): 48–61.
Zheng Wei. Research on text clustering technology based on latent semantic index [D]. Shenyang Institute of Aeronautical Technology, 2009.
Rodriguez A, Laio A. Clustering by fast search and find of density peaks[J]. Science, 2014, 344(6191): 1492–1496.
Kaufman L, Rousseeuw P. Clustering by means of medoids[M]. North-Holland, 1987.
Pratap R, Devi J R, Vani K S, et al. An Efficient Density based Improved K-Medoids Clustering algorithm[J]. IJACSA) International Journal of Advanced Computer Science and Applications, 2011, 2(6).
Ji W, Guo Q, Zhong S, et al. Improved K-medoids Clustering Algorithm under Semantic Web[C]//Proceedings of the 2nd International Conference on Computer Science and Electronics Engineering. Atlantis Press, 2013.
Zipf G K. Human behavior and the principle of least effort[J]. 1949.
Yang Jie, Ji Duo, Cai Dongfeng. Multi document keyword extraction technology based on joint weight [J]. Journal of Chinese information, 2008, 22(6): 75–79.