UNSUPERVISED KEYWORD EXTRACTION FROM MICROBLOG POSTS VIA HASHTAGSa
Keywords:
Keyword Extraction, Microblog Post, Hashtag, Topic Model, Random WalkAbstract
Nowadays, huge amounts of texts are being generated for social networking purposes on Web. Keyword extraction from such texts like microblog posts benets many applica- tions such as advertising, search, and content ltering. Unlike traditional web pages, a microblog post usually has some special social feature like a hashtag that is topical in nature and generated by users. Extracting keywords related to hashtags can re ect the intents of users and thus provides us better understanding on post content. In this paper, we propose a novel unsupervised keyword extraction approach for microblog posts by treating hashtags as topical indicators. Our approach consists of two hashtag enhanced algorithms. One is a topic model algorithm that infers topic distributions biased to hashtags on a collection of microblog posts. The words are ranked by their average topic probabilities. Our topic model algorithm can not only nd the topics of a collection, but also extract hashtag-related keywords. The other is a random walk based algorithm. It rst builds a word-post weighted graph by taking into account posts themselves. Then, a hashtag biased random walk is applied on this graph, which guides the algorithm to extract keywords according to hashtag topics. Last, the nal ranking score of a word is determined by the stationary probability after a number of iterations. We evaluate our proposed approach on a collection of real Chinese microblog posts. Experiments show that our approach is more eective in terms of precision than traditional approaches con- sidering no hashtag. The result achieved by the combination of two algorithms performs even better than each individual algorithm.
Downloads
References
A. Ahmed and E. P. Xing. Staying informed: Supervised and semi-supervised multi-view topical
analysis of ideological perspective. In Proceedings of the 2010 Conference on Empirical Methods
in Natural Language Processing, EMNLP '10, pages 1140{1150, Stroudsburg, PA, USA, 2010.
Association for Computational Linguistics.
B. Bi, Y. Tian, Y. Sismanis, A. Balmin, and J. Cho. Scalable topic-speci c in
uence analysis on
microblogs. In Proceedings of the 7th ACM International Conference on Web Search and Data
Mining, WSDM '14, pages 513{522, New York, NY, USA, 2014. ACM.
D. M. Blei and J. D. McAuli e. Supervised topic models. In Advances in Neural Information
Processing Systems 20, Proceedings of the Twenty-First Annual Conference on Neural Information
Processing Systems, Vancouver, British Columbia, Canada, December 3-6, 2007, 2007.
D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. J. Mach. Learn. Res.,
:993{1022, Mar. 2003.
T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms. McGraw-
Hill, 1990.
M. Dredze, H. M. Wallach, D. Puller, and F. Pereira. Generating summary keywords for emails
using topics. In Proceedings of the 13th International Conference on Intelligent User Interfaces,
IUI '08, pages 199{206, New York, NY, USA, 2008. ACM.
D. Gao, W. Li, O. You, and R. Zhang. Lda-based topic formation and topic-sentence reinforcement
for graph-based multi-document summarization. In Information Retrieval Technology, 8th Asia
Information Retrieval Societies Conference, AIRS 2012, Tianjin, China, December 17-19, 2012.
Proceedings, pages 376{385, 2012.
M. Grineva, M. Grinev, and D. Lizorkin. Extracting key terms from noisy and multitheme documents.
In Proceedings of the 18th International Conference on World Wide Web, WWW '09,
pages 661{670, New York, NY, USA, 2009. ACM.
M. Habibi and A. Popescu-Belis. Diverse keyword extraction from conversations. In Proceedings of
the 51st Annual Meeting of the Association for Computational Linguistics, ACL 2013, 4-9 August
, So a, Bulgaria, Volume 2: Short Papers, pages 651{657, 2013.
T. H. Haveliwala. Topic-sensitive pagerank: A context-sensitive ranking algorithm for web search.
IEEE Trans. on Knowl. and Data Eng., 15(4):784{796, July 2003.
X. Hu, J. Tang, and H. Liu. Leveraging knowledge across media for spammer detection in microblogging.
In Proceedings of the 37th International ACM SIGIR Conference on Research and
Development in Information Retrieval, SIGIR '14, pages 547{556, New York, NY, USA, 2014.
ACM.
A. Hulth. Improved automatic keyword extraction given more linguistic knowledge. In Proceedings
of the 2003 Conference on Empirical Methods in Natural Language Processing, EMNLP '03, pages
{223, Stroudsburg, PA, USA, 2003. Association for Computational Linguistics.
G. Karypis and V. Kumar. Multilevel k-way partitioning scheme for irregular graphs. J. Parallel
Distrib. Comput., 48(1):96{129, Jan. 1998.
H. Kwak, C. Lee, H. Park, and S. Moon. What is twitter, a social network or a news media? In
Proceedings of the 19th International Conference on World Wide Web, WWW '10, pages 591{600,
New York, NY, USA, 2010. ACM.
L. Li, L. Qi, F. Deng, S. Xiong, and J. Yuan. Enhancing keyword suggestion of web search by
leveraging microblog data. J. Web Eng., 15(3&4):181{202, 2016.
L. Li, C. Su, Y. Sun, S. Xiong, and G. Xu. Hashtag biased ranking for keyword extraction
from microblog posts. In Knowledge Science, Engineering and Management - 8th International
Conference, KSEM 2015, Chongqing, China, October 28-30, 2015, Proceedings, pages 348{359,
Z. Li, D. Zhou, Y.-F. Juan, and J. Han. Keyword extraction for social snippets. In Proceedings of
the 19th International Conference on World Wide Web, WWW '10, pages 1143{1144, New York,
NY, USA, 2010. ACM.
S. Liu, M. X. Zhou, S. Pan, W. Qian, W. Cai, and X. Lian. Interactive, topic-based visual text
summarization and analysis. In Proceedings of the 18th ACM Conference on Information and
Knowledge Management, CIKM '09, pages 543{552, New York, NY, USA, 2009. ACM.
Z. Liu, W. Huang, Y. Zheng, and M. Sun. Automatic keyphrase extraction via topic decomposition.
In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing,
EMNLP '10, pages 366{376, Stroudsburg, PA, USA, 2010. Association for Computational Linguistics.
Z. Liu, P. Li, Y. Zheng, and M. Sun. Clustering to nd exemplar terms for keyphrase extraction.
In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing:
Volume 1 - Volume 1, EMNLP '09, pages 257{266, Stroudsburg, PA, USA, 2009. Association for
Computational Linguistics.
C. D. Manning, P. Raghavan, and H. Schutze. Introduction to information retrieval, 2008.
R. Mihalcea and P. Tarau. Textrank: Bringing order into texts. In D. Lin and D. Wu, editors,
Proceedings of EMNLP 2004, pages 404{411, Barcelona, Spain, July 2004. Association for
Computational Linguistics.
R. Nallapati and W. W. Cohen. Link-plsa-lda: A new unsupervised model for topics and in
uence
of blogs. In Proceedings of the Second International Conference on Weblogs and Social Media,
ICWSM 2008, Seattle, Washington, USA, March 30 - April 2, 2008, 2008.
L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order
to the web. Technical report, Stanford Digital Library Technologies Project, 1998.
A. Qamra, B. Tseng, and E. Y. Chang. Mining blog stories using community-based and temporal
clustering. In Proceedings of the 15th ACM International Conference on Information and
Knowledge Management, CIKM '06, pages 58{67, New York, NY, USA, 2006. ACM.
R. Qiang, F. Liang, and J. Yang. Exploiting ranking factorization machines for microblog retrieval.
In Proceedings of the 22nd ACM international conference on Conference on Information and
knowledge management, CIKM '13, pages 1783{1788, New York, NY, USA, 2013. ACM.
J. Radelaar, A. Boor, D. Vandic, J. van Dam, and F. Frasincar. Improving search and exploration
in tag spaces using automated tag clustering. J. Web Eng., 13(3&4):277{301, 2014.
S. Siegel and N. J. C. Jr. Nonparametric Statistics for the Behavioral Sciences. McGraw-Hill Book
Co, 1998.
Y. Song, S. Pan, S. Liu, M. X. Zhou, and W. Qian. Topic and keyword re-ranking for ldabased
topic modeling. In Proceedings of the 18th ACM Conference on Information and Knowledge
Management, CIKM '09, pages 1757{1760, New York, NY, USA, 2009. ACM.
H. Tong, C. Faloutsos, and J.-Y. Pan. Fast random walk with restart and its applications. In
Proceedings of the Sixth International Conference on Data Mining, ICDM '06, pages 613{622,
Washington, DC, USA, 2006. IEEE Computer Society.
P. D. Turney. Learning algorithms for keyphrase extraction. Inf. Retr., 2(4):303{336, May 2000.
J. Vosecky, K. W.-T. Leung, and W. Ng. Collaborative personalized twitter search with topiclanguage
models. In Proceedings of the 37th International ACM SIGIR Conference on Research
and Development in Information Retrieval, SIGIR '14, pages 53{62, New York, NY, USA, 2014.
ACM.
X. Wan and J. Xiao. Collabrank: Towards a collaborative approach to single-document keyphrase
extraction. In Proceedings of the 22Nd International Conference on Computational Linguistics -
Volume 1, COLING '08, pages 969{976, Stroudsburg, PA, USA, 2008. Association for Computational
Linguistics.
X. Wan and J. Xiao. Single document keyphrase extraction using neighborhood knowledge. In
Proceedings of the 23rd National Conference on Arti cial Intelligence - Volume 2, AAAI'08, pages
{860. AAAI Press, 2008.
J. Wang, J. Liu, and C. Wang. Keyword extraction based on pagerank. In Proceedings of the
th Paci c-Asia Conference on Advances in Knowledge Discovery and Data Mining, PAKDD'07,
pages 857{864, Berlin, Heidelberg, 2007. Springer-Verlag.
W. Wang, H. Xu, W. Yang, and X. Huang. Constrained-hlda for topic discovery in chinese microblogs. In Advances in Knowledge Discovery and Data Mining - 18th Paci c-Asia Conference,
PAKDD 2014, Tainan, Taiwan, May 13-16, 2014. Proceedings, Part II, pages 608{619. Springer,
X. Wang, L. Wang, J. Li, and S. Li. Exploring simultaneous keyword and key sentence extraction:
Improve graph-based ranking using wikipedia. In Proceedings of the 21st ACM International
Conference on Information and Knowledge Management, CIKM '12, pages 2619{2622, New York,
NY, USA, 2012. ACM.
W. Wu, B. Zhang, and M. Ostendorf. Automatic generation of personalized annotation tags
for twitter users. In Human Language Technologies: The 2010 Annual Conference of the North
American Chapter of the Association for Computational Linguistics, HLT '10, pages 689{692,
Stroudsburg, PA, USA, 2010. Association for Computational Linguistics.
T. Yano, W. W. Cohen, and N. A. Smith. Predicting response to political blog posts with topic
models. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the
North American Chapter of the Association for Computational Linguistics, NAACL '09, pages
{485, Stroudsburg, PA, USA, 2009. Association for Computational Linguistics.
W. Zhang, W. Feng, and J. Wang. Integrating semantic relatedness and words' intrinsic features
for keyword extraction. In Proceedings of the Twenty-Third International Joint Conference on
Arti cial Intelligence, IJCAI'13, pages 2225{2231. AAAI Press, 2013.
W. X. Zhao, J. Jiang, J. He, Y. Song, P. Achananuparp, E.-P. Lim, and X. Li. Topical keyphrase
extraction from twitter. In Proceedings of the 49th Annual Meeting of the Association for Compu-
tational Linguistics: Human Language Technologies - Volume 1, HLT '11, pages 379{388, Stroudsburg,
PA, USA, 2011. Association for Computational Linguistics.
L. Zhiyuan, C. Xinxiong, and S. Maosong. Mining the interests of chinese microbloggers via
keyword extraction. Foundations and Trends in Information Retrieval, 6(1):76{87, 2012.
D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Scholkopf. Learning with local and global
consistency. In Advances in Neural Information Processing Systems 16, pages 321{328. MIT Press,