ONTOLOGY-ASSISTED DISCOVERY OF HIERARCHICAL TOPIC CLUSTERS ON THE SOCIAL WEB
Keywords:
Ontology, hierarchical clustering, topic modeling, community detection, Social Web Communicated by: M. Gaedke & C. BizerAbstract
Discovery and clustering of users by their topic of interest on the SocialWeb can help enhance various applications, such as user recommendation and expert finding. Traditional approaches, such as latent semantic analysis-based topic modeling or k-means document clustering, run into issues when content is sparse, the number of existing topics is unknown and/or we seek topics that are hierarchical in nature. In this paper, we propose a method for ontology-assisted topic clustering, in which we map Social Web user content to ontological classes to overcome sparsity. Using a novel ranking technique for calculating the topical similarity between individuals at different topic scopes, we construct graphs on which we apply a quasi-clique algorithm in order to find topic clusters at that scope, without having to pre-define a target number of topics. Our approach allows (1) the topic scope to be controlled in order to discover general or specific topics; (2) the automatic labeling of clusters with tags that are human and machine-understandable; and (3) graphs to be clustered recursively in order to generate a hierarchy of topics. The approach is evaluated against ground truths of Twitter users and the 20- newsgroups dataset, commonly used in document clustering research. We compare our approach to standard and Twitter-specific latent Dirichlet allocation (LDA), hierarchical LDA, and standard and hierarchical k-means clustering. Results show that our method outperforms regular LDA by up to 24.7%, Twitter-LDA by up to 11.9%, and k-means by up to 26.7% on Social Web content. It performs equivalently, depending on several factors, to these approaches on a dataset of traditional documents. Additionally, our method can discover the appropriate number and composition of topics at a given topic scope, whereas k-means clustering cannot account for differences in scope.
Downloads
References
Dbpedia wiki: The dbpedia ontology (2014). http://wiki.dbpedia.org/Ontology2014, Retrieved on April 14
F. Abel, Q. Gao, G.-J. Houben, and K. Tao. Analyzing user modeling on twitter for personalized news
recommendations. In User Modeling, Adaption and Personalization, pages 1–12. Springer, 2011.
S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Z. Ives. DBpedia: A Nucleus for a Web of
Open Data. In The Semantic Web, volume 4825 of Lecture Notes in Computer Science, chapter 52, pages
–735. Springer Berlin / Heidelberg, Berlin, Heidelberg, 2007.
C. Bizer, T. Heath, and T. Berners-Lee. Linked data-the story so far, 2009.
D. M. Blei. Probabilistic topic models. Communications of the ACM, 55(4):77–84, 2012.
D. M. Blei, T. L. Griffiths, and M. I. Jordan. The nested chinese restaurant process and bayesian nonparametric
inference of topic hierarchies. Journal of the ACM (JACM), 57(2):7, 2010.
D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. the Journal of machine Learning research,
:993–1022, 2003.
J. Daiber, M. Jakob, C. Hokamp, and P. N. Mendes. Improving efficiency and accuracy in multilingual entity
extraction. In Proceedings of the 9th International Conference on Semantic Systems, pages 121–124. ACM,
S. Dasgupta, C. Papadimitriou, and U. Vazirani. Algorithms–chapter 5, 2006.
S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent
semantic analysis. JAsIs, 41(6):391–407, 1990.
L. Derczynski, D. Maynard, N. Aswani, and K. Bontcheva. Microblog-genre noise and impact on semantic
annotation accuracy. In Proceedings of the 24th ACM Conference on Hypertext and Social Media, pages
–30. ACM, 2013.
I. Der´enyi, G. Palla, and T. Vicsek. Clique percolation in random networks. Physical review letters,
(16):160202, 2005.
M. S. Granovetter. The strength of weak ties. American journal of sociology, pages 1360–1380, 1973.
T. B. Group. Social usage involves more platforms, more often. www.emarketer.com/Article/Social-Usage-
Involves-More-Platforms-More-Often/1010019, Retrieved on February 19 2013.
W. V. Hage, A. Isaac, and Z. Aleksovski. Sample evaluation of ontology-matching systems. In Fifth Int.
Workshop on Evaluation of Ontologies and Ontology-based Tools, ISWC 2007.
J. A. Hartigan and M. A.Wong. Algorithm as 136: A k-means clustering algorithm. Applied statistics, pages
–108, 1979.
E. Hartuv and R. Shamir. A clustering algorithm based on graph connectivity. Information processing letters,
(4):175–181, 2000.
M. Hausenblas and R. Cyganiak. Schema.rdfs.org. http://schema.rdfs.org/, Retrieved on April 20 2015.
T.-A. Hoang and E.-P. Lim. On joint modeling of topical communities and personal interest in microblogs.
In Social Informatics, pages 1–16. Springer, 2014.
A. Hotho, S. Staab, and G. Stumme. Ontologies improve text document clustering. In Data Mining, 2003.
ICDM 2003. Third IEEE International Conference on, pages 541–544. IEEE, 2003.
A. K. Jain, R. C. Dubes, et al. Algorithms for clustering data, volume 6. Prentice hall Englewood Cliffs,
K. Jarvelin and J. Kekalainen. Cumulated gain-based evaluation of IR techniques. ACM Transactions on
Information Systems, 20(4):422–446, 2002.
S. Kiritchenko, F. Famili, S. Matwin, and R. Nock. Learning and evaluation in the presence of class hierarchies:
Application to text categorization. 2006.
J. B. Kruskal. On the shortest spanning subtree of a graph and the traveling salesman problem. Proceedings
of the American Mathematical society, 7(1):48–50, 1956.
A. Lancichinetti, M. I. Sirer, J. X. Wang, D. Acuna, K. K¨ording, and L. A. N. Amaral. High-reproducibility
and high-accuracy method for automated topic classification. Physical Review X, 5(1):011007, 2015.
K. Lang. Newsweeder: Learning to filter netnews. In Proceedings of the 12th international conference on
machine learning, pages 331–339, 1995.
J. Leskovec, K. J. Lang, and M. Mahoney. Empirical comparison of algorithms for network community
detection. In Proceedings of the 19th international conference on World wide web, pages 631–640. ACM,
X. Liu, M. Zhou, F. Wei, Z. Fu, and X. Zhou. Joint inference of named entity recognition and normalization
for tweets. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics:
Long Papers-Volume 1, pages 526–535. Association for Computational Linguistics, 2012.
B. W. Matthews. Comparison of the predicted and observed secondary structure of t4 phage lysozyme.
Biochimica et Biophysica Acta (BBA)-Protein Structure, 405(2):442–451, 1975.
A. K. McCallum. Mallet: A machine learning for language toolkit. http://mallet.cs.umass.edu, 2002.
P. N. Mendes, M. Jakob, A. Garc´ıa-Silva, and C. Bizer. Dbpedia spotlight: Shedding light on the web of
documents. In Proc. of the 7th Intl. Conference on Semantic Systems, 2011.
M. Michelson and S. A. Macskassy. Discovering users’ topics of interest on twitter: a first look. In Proceedings
of the fourth workshop on Analytics for noisy unstructured text data, pages 73–80. ACM, 2010.
G. A. Miller. WordNet: a lexical database for English. Commun. ACM, 38(11):39–41, 1995.
M. E. Newman. Modularity and community structure in networks. Proceedings of the National Academy of
Sciences, 103(23):8577–8582, 2006.
S. Papadopoulos, Y. Kompatsiaris, A. Vakali, and P. Spyridonos. Community detection in social media. Data
Mining and Knowledge Discovery, 24(3):515–554, 2012.
O. U. Press. Rt this: Oup dictionary team monitors twitterer’s tweets. http://blog.oup.com/2009/06/oxfordtwitter/,
M. Qiu, F. Zhu, and J. Jiang. It is not just what we say, but how we say them: Lda-based behavior-topic
model. SIAM.
J. Rennie. The 20 newsgroups data set. http://qwone.com/ jason/20Newsgroups/, Retrieved on April 2 2015.
A. Ritter, S. Clark, O. Etzioni, et al. Named entity recognition in tweets: an experimental study. In Proceedings
of the Conference on Empirical Methods in Natural Language Processing, pages 1524–1534. Association
for Computational Linguistics, 2011.
J. Ronallo. Html5 microdata and schema. org. Code4Lib Journal, 16, 2012.
M. Rosen-Zvi, T. Griffiths, M. Steyvers, and P. Smyth. The author-topic model for authors and documents.
In Proceedings of the 20th conference on Uncertainty in artificial intelligence, pages 487–494. AUAI Press,
C. N. Silla Jr and A. A. Freitas. A survey of hierarchical classification across different application domains.
Data Mining and Knowledge Discovery, 22(1-2):31–72, 2011.
K. Slabbekoorn, T. Noro, and T. Tokuda. Towards twitter user recommendation based on user relations and
taxonomical analysis. In 23nd European-Japanese Conference on Information Modelling and Knowledge
Bases (EJC), 2013, 2013.
K. Sparck Jones. A statistical interpretation of term specificity and its application in retrieval. Journal of
documentation, 28(1):11–21, 1972.
F. M. Suchanek, G. Kasneci, and G. Weikum. Yago: a core of semantic knowledge. In Proceedings of
WWW’07, pages 697–706, 2007.
L. Tang and H. Liu. Community detection and mining in social media. Synthesis Lectures on Data Mining
and Knowledge Discovery, 2(1):1–137, 2010.
O. Tsur, A. Littman, and A. Rappoport. Efficient clustering of short messages into general domains. In
Proceedings of the 7th International Conference on Weblogs and Social Media, 2013.
P. Willett. Recent trends in hierarchic document clustering: a critical review. Information Processing &
Management, 24(5):577–597, 1988.
S.-H. Yang, A. Kolcz, A. Schlaikjer, and P. Gupta. Large-scale high-precision topic modeling on twitter. In
Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining,
pages 1907–1916. ACM, 2014.
W. X. Zhao, J. Jiang, J. Weng, J. He, E.-P. Lim, H. Yan, and X. Li. Comparing twitter and traditional media
using topic models. In Advances in Information Retrieval, pages 338–349. Springer, 2011.
Z. Zhao, S. Feng, Q. Wang, J. Z. Huang, G. J. Williams, and J. Fan. Topic oriented community detection
through social objects and link analysis in social networks. Knowledge-Based Systems, 26:164–173, 2012.