UNSUPERVISED KEYWORD EXTRACTION FROM MICROBLOG POSTS VIA HASHTAGSa

Authors

  • LIN LI 1School of Computer Science & Technology, Wuhan University of Technology 2Hubei Key Laboratory of Transportation Internet of Things, Wuhan University of Technology Wuhan, 430070, China
  • JINHANG LIU School of Computer Science & Technology, Wuhan University of Technology Wuhan, 430070, China
  • YUEQING SUN School of Computer Science & Technology, Wuhan University of Technology Wuhan, 430070, China
  • GUANDONG XU Advanced Analytics Institute, University of Technology, Sydney NSW 2007, Australia
  • JINGLING YUAN School of Computer Science & Technology, Wuhan University of Technology Wuhan, 430070, China
  • LUO ZHONG School of Computer Science & Technology, Wuhan University of Technology Wuhan, 430070, China

Keywords:

Keyword Extraction, Microblog Post, Hashtag, Topic Model, Random Walk

Abstract

Nowadays, huge amounts of texts are being generated for social networking purposes on Web. Keyword extraction from such texts like microblog posts benets many applica- tions such as advertising, search, and content ltering. Unlike traditional web pages, a microblog post usually has some special social feature like a hashtag that is topical in nature and generated by users. Extracting keywords related to hashtags can re ect the intents of users and thus provides us better understanding on post content. In this paper, we propose a novel unsupervised keyword extraction approach for microblog posts by treating hashtags as topical indicators. Our approach consists of two hashtag enhanced algorithms. One is a topic model algorithm that infers topic distributions biased to hashtags on a collection of microblog posts. The words are ranked by their average topic probabilities. Our topic model algorithm can not only nd the topics of a collection, but also extract hashtag-related keywords. The other is a random walk based algorithm. It rst builds a word-post weighted graph by taking into account posts themselves. Then, a hashtag biased random walk is applied on this graph, which guides the algorithm to extract keywords according to hashtag topics. Last, the nal ranking score of a word is determined by the stationary probability after a number of iterations. We evaluate our proposed approach on a collection of real Chinese microblog posts. Experiments show that our approach is more eective in terms of precision than traditional approaches con- sidering no hashtag. The result achieved by the combination of two algorithms performs even better than each individual algorithm.

Downloads

Download data is not yet available.

References

A. Ahmed and E. P. Xing. Staying informed: Supervised and semi-supervised multi-view topical

analysis of ideological perspective. In Proceedings of the 2010 Conference on Empirical Methods

in Natural Language Processing, EMNLP '10, pages 1140{1150, Stroudsburg, PA, USA, 2010.

Association for Computational Linguistics.

B. Bi, Y. Tian, Y. Sismanis, A. Balmin, and J. Cho. Scalable topic-speci c in

uence analysis on

microblogs. In Proceedings of the 7th ACM International Conference on Web Search and Data

Mining, WSDM '14, pages 513{522, New York, NY, USA, 2014. ACM.

D. M. Blei and J. D. McAuli e. Supervised topic models. In Advances in Neural Information

Processing Systems 20, Proceedings of the Twenty-First Annual Conference on Neural Information

Processing Systems, Vancouver, British Columbia, Canada, December 3-6, 2007, 2007.

D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. J. Mach. Learn. Res.,

:993{1022, Mar. 2003.

T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms. McGraw-

Hill, 1990.

M. Dredze, H. M. Wallach, D. Puller, and F. Pereira. Generating summary keywords for emails

using topics. In Proceedings of the 13th International Conference on Intelligent User Interfaces,

IUI '08, pages 199{206, New York, NY, USA, 2008. ACM.

D. Gao, W. Li, O. You, and R. Zhang. Lda-based topic formation and topic-sentence reinforcement

for graph-based multi-document summarization. In Information Retrieval Technology, 8th Asia

Information Retrieval Societies Conference, AIRS 2012, Tianjin, China, December 17-19, 2012.

Proceedings, pages 376{385, 2012.

M. Grineva, M. Grinev, and D. Lizorkin. Extracting key terms from noisy and multitheme documents.

In Proceedings of the 18th International Conference on World Wide Web, WWW '09,

pages 661{670, New York, NY, USA, 2009. ACM.

M. Habibi and A. Popescu-Belis. Diverse keyword extraction from conversations. In Proceedings of

the 51st Annual Meeting of the Association for Computational Linguistics, ACL 2013, 4-9 August

, So a, Bulgaria, Volume 2: Short Papers, pages 651{657, 2013.

T. H. Haveliwala. Topic-sensitive pagerank: A context-sensitive ranking algorithm for web search.

IEEE Trans. on Knowl. and Data Eng., 15(4):784{796, July 2003.

X. Hu, J. Tang, and H. Liu. Leveraging knowledge across media for spammer detection in microblogging.

In Proceedings of the 37th International ACM SIGIR Conference on Research and

Development in Information Retrieval, SIGIR '14, pages 547{556, New York, NY, USA, 2014.

ACM.

A. Hulth. Improved automatic keyword extraction given more linguistic knowledge. In Proceedings

of the 2003 Conference on Empirical Methods in Natural Language Processing, EMNLP '03, pages

{223, Stroudsburg, PA, USA, 2003. Association for Computational Linguistics.

G. Karypis and V. Kumar. Multilevel k-way partitioning scheme for irregular graphs. J. Parallel

Distrib. Comput., 48(1):96{129, Jan. 1998.

H. Kwak, C. Lee, H. Park, and S. Moon. What is twitter, a social network or a news media? In

Proceedings of the 19th International Conference on World Wide Web, WWW '10, pages 591{600,

New York, NY, USA, 2010. ACM.

L. Li, L. Qi, F. Deng, S. Xiong, and J. Yuan. Enhancing keyword suggestion of web search by

leveraging microblog data. J. Web Eng., 15(3&4):181{202, 2016.

L. Li, C. Su, Y. Sun, S. Xiong, and G. Xu. Hashtag biased ranking for keyword extraction

from microblog posts. In Knowledge Science, Engineering and Management - 8th International

Conference, KSEM 2015, Chongqing, China, October 28-30, 2015, Proceedings, pages 348{359,

Z. Li, D. Zhou, Y.-F. Juan, and J. Han. Keyword extraction for social snippets. In Proceedings of

the 19th International Conference on World Wide Web, WWW '10, pages 1143{1144, New York,

NY, USA, 2010. ACM.

S. Liu, M. X. Zhou, S. Pan, W. Qian, W. Cai, and X. Lian. Interactive, topic-based visual text

summarization and analysis. In Proceedings of the 18th ACM Conference on Information and

Knowledge Management, CIKM '09, pages 543{552, New York, NY, USA, 2009. ACM.

Z. Liu, W. Huang, Y. Zheng, and M. Sun. Automatic keyphrase extraction via topic decomposition.

In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing,

EMNLP '10, pages 366{376, Stroudsburg, PA, USA, 2010. Association for Computational Linguistics.

Z. Liu, P. Li, Y. Zheng, and M. Sun. Clustering to nd exemplar terms for keyphrase extraction.

In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing:

Volume 1 - Volume 1, EMNLP '09, pages 257{266, Stroudsburg, PA, USA, 2009. Association for

Computational Linguistics.

C. D. Manning, P. Raghavan, and H. Schutze. Introduction to information retrieval, 2008.

R. Mihalcea and P. Tarau. Textrank: Bringing order into texts. In D. Lin and D. Wu, editors,

Proceedings of EMNLP 2004, pages 404{411, Barcelona, Spain, July 2004. Association for

Computational Linguistics.

R. Nallapati and W. W. Cohen. Link-plsa-lda: A new unsupervised model for topics and in

uence

of blogs. In Proceedings of the Second International Conference on Weblogs and Social Media,

ICWSM 2008, Seattle, Washington, USA, March 30 - April 2, 2008, 2008.

L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order

to the web. Technical report, Stanford Digital Library Technologies Project, 1998.

A. Qamra, B. Tseng, and E. Y. Chang. Mining blog stories using community-based and temporal

clustering. In Proceedings of the 15th ACM International Conference on Information and

Knowledge Management, CIKM '06, pages 58{67, New York, NY, USA, 2006. ACM.

R. Qiang, F. Liang, and J. Yang. Exploiting ranking factorization machines for microblog retrieval.

In Proceedings of the 22nd ACM international conference on Conference on Information and

knowledge management, CIKM '13, pages 1783{1788, New York, NY, USA, 2013. ACM.

J. Radelaar, A. Boor, D. Vandic, J. van Dam, and F. Frasincar. Improving search and exploration

in tag spaces using automated tag clustering. J. Web Eng., 13(3&4):277{301, 2014.

S. Siegel and N. J. C. Jr. Nonparametric Statistics for the Behavioral Sciences. McGraw-Hill Book

Co, 1998.

Y. Song, S. Pan, S. Liu, M. X. Zhou, and W. Qian. Topic and keyword re-ranking for ldabased

topic modeling. In Proceedings of the 18th ACM Conference on Information and Knowledge

Management, CIKM '09, pages 1757{1760, New York, NY, USA, 2009. ACM.

H. Tong, C. Faloutsos, and J.-Y. Pan. Fast random walk with restart and its applications. In

Proceedings of the Sixth International Conference on Data Mining, ICDM '06, pages 613{622,

Washington, DC, USA, 2006. IEEE Computer Society.

P. D. Turney. Learning algorithms for keyphrase extraction. Inf. Retr., 2(4):303{336, May 2000.

J. Vosecky, K. W.-T. Leung, and W. Ng. Collaborative personalized twitter search with topiclanguage

models. In Proceedings of the 37th International ACM SIGIR Conference on Research

and Development in Information Retrieval, SIGIR '14, pages 53{62, New York, NY, USA, 2014.

ACM.

X. Wan and J. Xiao. Collabrank: Towards a collaborative approach to single-document keyphrase

extraction. In Proceedings of the 22Nd International Conference on Computational Linguistics -

Volume 1, COLING '08, pages 969{976, Stroudsburg, PA, USA, 2008. Association for Computational

Linguistics.

X. Wan and J. Xiao. Single document keyphrase extraction using neighborhood knowledge. In

Proceedings of the 23rd National Conference on Arti cial Intelligence - Volume 2, AAAI'08, pages

{860. AAAI Press, 2008.

J. Wang, J. Liu, and C. Wang. Keyword extraction based on pagerank. In Proceedings of the

th Paci c-Asia Conference on Advances in Knowledge Discovery and Data Mining, PAKDD'07,

pages 857{864, Berlin, Heidelberg, 2007. Springer-Verlag.

W. Wang, H. Xu, W. Yang, and X. Huang. Constrained-hlda for topic discovery in chinese microblogs. In Advances in Knowledge Discovery and Data Mining - 18th Paci c-Asia Conference,

PAKDD 2014, Tainan, Taiwan, May 13-16, 2014. Proceedings, Part II, pages 608{619. Springer,

X. Wang, L. Wang, J. Li, and S. Li. Exploring simultaneous keyword and key sentence extraction:

Improve graph-based ranking using wikipedia. In Proceedings of the 21st ACM International

Conference on Information and Knowledge Management, CIKM '12, pages 2619{2622, New York,

NY, USA, 2012. ACM.

W. Wu, B. Zhang, and M. Ostendorf. Automatic generation of personalized annotation tags

for twitter users. In Human Language Technologies: The 2010 Annual Conference of the North

American Chapter of the Association for Computational Linguistics, HLT '10, pages 689{692,

Stroudsburg, PA, USA, 2010. Association for Computational Linguistics.

T. Yano, W. W. Cohen, and N. A. Smith. Predicting response to political blog posts with topic

models. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the

North American Chapter of the Association for Computational Linguistics, NAACL '09, pages

{485, Stroudsburg, PA, USA, 2009. Association for Computational Linguistics.

W. Zhang, W. Feng, and J. Wang. Integrating semantic relatedness and words' intrinsic features

for keyword extraction. In Proceedings of the Twenty-Third International Joint Conference on

Arti cial Intelligence, IJCAI'13, pages 2225{2231. AAAI Press, 2013.

W. X. Zhao, J. Jiang, J. He, Y. Song, P. Achananuparp, E.-P. Lim, and X. Li. Topical keyphrase

extraction from twitter. In Proceedings of the 49th Annual Meeting of the Association for Compu-

tational Linguistics: Human Language Technologies - Volume 1, HLT '11, pages 379{388, Stroudsburg,

PA, USA, 2011. Association for Computational Linguistics.

L. Zhiyuan, C. Xinxiong, and S. Maosong. Mining the interests of chinese microbloggers via

keyword extraction. Foundations and Trends in Information Retrieval, 6(1):76{87, 2012.

D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Scholkopf. Learning with local and global

consistency. In Advances in Neural Information Processing Systems 16, pages 321{328. MIT Press,

Downloads

Published

2018-01-01

Issue

Section

Articles