AN AUTOMATIC WEB NEWS ARTICLE CONTENTS EXTRACTION SYSTEM BASED ON RSS FEEDS

Authors

  • HAO HAN Department of Computer Science, Tokyo Institute of Technology Ookayama 2-12-1-W8-71, Meguro, Tokyo 152-8552, Japan
  • TOMOYA NORO Department of Computer Science, Tokyo Institute of Technology Ookayama 2-12-1-W8-71, Meguro, Tokyo 152-8552, Japan
  • TAKEHIRO TOKUDA Department of Computer Science, Tokyo Institute of Technology Ookayama 2-12-1-W8-71, Meguro, Tokyo 152-8552, Japan

Keywords:

Web News Article, Information Extraction, RSS Feed

Abstract

Nowadays, the Web news article contents extraction is vital to provide news indexing and searching services. Most of the traditional methods need to analyze the layout of news pages to generate the wrappers manually or automatically. It is a costly work and needs much maintenance during the extraction over a long period of time. In this paper, we construct an automatic Web news article contents extraction system based on RSS feeds. We propose an effective and efficient algorithm to extract the news article contents from the news pages without the analysis of news sites before extraction. We calculate the relevance between the news title and each sentence in the news page to detect the news article contents. Our approach is applicable to the general types of news RSS feeds and independent of news page layout. Our experimental results show that our approach can extract the news article contents automatically, accurately and constantly.

 

Downloads

Download data is not yet available.

References

American newspapers and the internet: Threat or opportunity? Technical report, The Bivings

Group, July 2007.

AllInOneNews. http://www.allinonenews.com.

ChaSen. http://chasen-legacy.sourceforge.jp.

J. Chen and S.-C. Lui. Perception-oriented online news extraction. In The Proceedings of the 8th

ACM/IEEE-CS Joint Conference on Digital Libraries, pages 363–366, 2008.

D. de Castro Reis, P. B. Golgher, A. S. da Silva, and A. H. F. Laender. Automatic Web news

extraction using tree edit distance. In The Proceedings of the 13th International Conference on

World Wide Web, pages 502–511, 2004.

Y. Dong, Q. Li, Z. Yan, and Y. Ding. A genericWeb news extraction approach. In The Proceedings

of the 2008 IEEE International Conference on Information and Automation, pages 179–183, 2008.

F. Fukumoto and Y. Suzuki. Detecting shifts in news stories for paragraph extraction. In The

Proceedings of the 19th International Conference on Computational Linguistics, pages 1–7, 2002.

Full-Text RSS. http://echodittolabs.org/fulltextrss.

Google News. http://news.google.com.

Y. Li, X. Meng, Q. Li, and L. Wang. Hybrid method for automated news content extraction from

the Web. In The Proceedings of the 7th International Conference on Web Information Systems

Engineering, pages 327–338, 2006.

B. Liu, P. V. Hai, T. Noro, and T. Tokuda. Towards automatic construction of news directory

systems. In The Proceedings of the 17th European-Japanese Conference on Information Modeling

and Knowledge Bases, pages 211–220, 2007.

B. Liu, H. Han, T. Noro, and T. Tokuda. Personal news RSS feeds generation using existing news

feeds. In The Proceedings of the 9th International Conference on Web Engineering, pages 419–433,

Y. Lu, W. Meng, W. Zhang, K.-L. Liu, and C. Yu. Automatic extraction of publication time from

news search results. In The Proceedings of the 2nd International Workshop on Challenges in Web

Information Retrieval and Integration, page 50, 2006.

T. Noro, B. Liu, Y. Nakagawa, H. Han, and T. Tokuda. A news index system for global comparisons

of many major topics on the earth. In The Proceeding of the 18th European-Japanese Conference

on Information Modeling and Knowledge Bases, pages 197–213, 2008.

J. Parapar and A. Barreiro. An effective and efficient Web news extraction technique for an

operational newsIR system. In The Proceeding of XIII Conferencia de la Asociacion Espanola

para la Inteligencia Artificial CAEPIA, volume II, pages 319–328, 2007.

J. Prasad and A. Paepcke. CoreEx: Content extraction from online news articles. In The Proceeding

of the 17th ACM conference on Information and Knowledge Mining, pages 1391–1392, 2008.

Readability. http://lab.arc90.com/experiments/readability/.

H. Shinnou and M. Sasaki. Automatic extraction of target parts from a Web page. In IPSJ SIG

Notes, volume 2004-NL-162, pages 33–40, 2004. In Japanese.

Y. Shinyama. Webstemmer. http://www.unixuser.org/˜euske/python/webstemmer/.

TidyRead. http://www.tidyread.com/.

J. Wang, X. He, C. Wang, J. Pei, J. Bu, C. Chen, Z. Guan, and G. Lu. News article extraction

with template-independent wrapper. In The Proceedings of the 18th International Conference on

World Wide Web, pages 1085–1086, 2009.

H. Zhao, W. Meng, and C. Yu. Automatic extraction of dynamic record sections from search

engine result pages. In The Proceedings of the 32nd International Conference on Very Large Data

Bases, pages 989–1000, 2006.

S. Zheng, R. Song, and J.-R. Wen. Template-independent news extraction based on visual consistency.

In The Proceedings of the 22th AAAI Conference on Artificial Intelligence, pages 1507–1513,

C.-N. Ziegler and M. Skubacz. Content extraction from news pages using particle swarm optimization

on linguistic and structural features. In The Proceedings of the IEEE/WIC/ACM

International Conference on Web Intelligence, pages 242–249, 2007.

Downloads

Published

2009-11-30

How to Cite

HAN, H. ., NORO, T. ., & TOKUDA, T. . (2009). AN AUTOMATIC WEB NEWS ARTICLE CONTENTS EXTRACTION SYSTEM BASED ON RSS FEEDS. Journal of Web Engineering, 8(3), 268–284. Retrieved from https://journals.riverpublishers.com/index.php/JWE/article/view/4055

Issue

Section

Articles