AN AUTOMATIC WEB NEWS ARTICLE CONTENTS EXTRACTION SYSTEM BASED ON RSS FEEDS
Keywords:
Web News Article, Information Extraction, RSS FeedAbstract
Nowadays, the Web news article contents extraction is vital to provide news indexing and searching services. Most of the traditional methods need to analyze the layout of news pages to generate the wrappers manually or automatically. It is a costly work and needs much maintenance during the extraction over a long period of time. In this paper, we construct an automatic Web news article contents extraction system based on RSS feeds. We propose an effective and efficient algorithm to extract the news article contents from the news pages without the analysis of news sites before extraction. We calculate the relevance between the news title and each sentence in the news page to detect the news article contents. Our approach is applicable to the general types of news RSS feeds and independent of news page layout. Our experimental results show that our approach can extract the news article contents automatically, accurately and constantly.
Downloads
References
American newspapers and the internet: Threat or opportunity? Technical report, The Bivings
Group, July 2007.
AllInOneNews. http://www.allinonenews.com.
ChaSen. http://chasen-legacy.sourceforge.jp.
J. Chen and S.-C. Lui. Perception-oriented online news extraction. In The Proceedings of the 8th
ACM/IEEE-CS Joint Conference on Digital Libraries, pages 363–366, 2008.
D. de Castro Reis, P. B. Golgher, A. S. da Silva, and A. H. F. Laender. Automatic Web news
extraction using tree edit distance. In The Proceedings of the 13th International Conference on
World Wide Web, pages 502–511, 2004.
Y. Dong, Q. Li, Z. Yan, and Y. Ding. A genericWeb news extraction approach. In The Proceedings
of the 2008 IEEE International Conference on Information and Automation, pages 179–183, 2008.
F. Fukumoto and Y. Suzuki. Detecting shifts in news stories for paragraph extraction. In The
Proceedings of the 19th International Conference on Computational Linguistics, pages 1–7, 2002.
Full-Text RSS. http://echodittolabs.org/fulltextrss.
Google News. http://news.google.com.
Y. Li, X. Meng, Q. Li, and L. Wang. Hybrid method for automated news content extraction from
the Web. In The Proceedings of the 7th International Conference on Web Information Systems
Engineering, pages 327–338, 2006.
B. Liu, P. V. Hai, T. Noro, and T. Tokuda. Towards automatic construction of news directory
systems. In The Proceedings of the 17th European-Japanese Conference on Information Modeling
and Knowledge Bases, pages 211–220, 2007.
B. Liu, H. Han, T. Noro, and T. Tokuda. Personal news RSS feeds generation using existing news
feeds. In The Proceedings of the 9th International Conference on Web Engineering, pages 419–433,
Y. Lu, W. Meng, W. Zhang, K.-L. Liu, and C. Yu. Automatic extraction of publication time from
news search results. In The Proceedings of the 2nd International Workshop on Challenges in Web
Information Retrieval and Integration, page 50, 2006.
T. Noro, B. Liu, Y. Nakagawa, H. Han, and T. Tokuda. A news index system for global comparisons
of many major topics on the earth. In The Proceeding of the 18th European-Japanese Conference
on Information Modeling and Knowledge Bases, pages 197–213, 2008.
J. Parapar and A. Barreiro. An effective and efficient Web news extraction technique for an
operational newsIR system. In The Proceeding of XIII Conferencia de la Asociacion Espanola
para la Inteligencia Artificial CAEPIA, volume II, pages 319–328, 2007.
J. Prasad and A. Paepcke. CoreEx: Content extraction from online news articles. In The Proceeding
of the 17th ACM conference on Information and Knowledge Mining, pages 1391–1392, 2008.
Readability. http://lab.arc90.com/experiments/readability/.
H. Shinnou and M. Sasaki. Automatic extraction of target parts from a Web page. In IPSJ SIG
Notes, volume 2004-NL-162, pages 33–40, 2004. In Japanese.
Y. Shinyama. Webstemmer. http://www.unixuser.org/˜euske/python/webstemmer/.
TidyRead. http://www.tidyread.com/.
J. Wang, X. He, C. Wang, J. Pei, J. Bu, C. Chen, Z. Guan, and G. Lu. News article extraction
with template-independent wrapper. In The Proceedings of the 18th International Conference on
World Wide Web, pages 1085–1086, 2009.
H. Zhao, W. Meng, and C. Yu. Automatic extraction of dynamic record sections from search
engine result pages. In The Proceedings of the 32nd International Conference on Very Large Data
Bases, pages 989–1000, 2006.
S. Zheng, R. Song, and J.-R. Wen. Template-independent news extraction based on visual consistency.
In The Proceedings of the 22th AAAI Conference on Artificial Intelligence, pages 1507–1513,
C.-N. Ziegler and M. Skubacz. Content extraction from news pages using particle swarm optimization
on linguistic and structural features. In The Proceedings of the IEEE/WIC/ACM
International Conference on Web Intelligence, pages 242–249, 2007.