A STRUCTURAL APPROACH TO EXTRACTING CHINESE POSITION RELATIONS FROM WEB PAGES
Keywords:
Position Relation, Relation Extraction, Structural File SegmentAbstract
The use of position relations, which refer to the position of people in an organization, can serve for enterprises as a significant competitive intelligence method. The rapid growth of the data volume in the Web brings new opportunities for us to extract position relations of interest from the Web. In this paper, we propose a new algorithm to extract position relations from the Web. Our algorithm is based on the structural feature of position relations in the Web, i.e., a position relation is usually presented in Web pages as a table or a list. In order to define the structural feature of Web content, we first introduce a structural coefficient for each Web page, which is then used to generate structural file segments for Web pages. A structural file segment consists of all candidates of position relations having a similar structure. After that, we employ a pattern-matching method to extract position relations from the structural file segments. Finally, we conduct experiments on a real data set containing 6028 Chinese Web pages gathered by the Baidu search engine, and evaluate precision and recall of our approach. The experimental results confirm that our algorithm has a precision over 96% and a recall over 87%.
Downloads
References
Agichtein, E., & Gravano, L., Snowball: Extracting Relations from Large Plain-text Collections.
In Proceedings of the Fifth ACM International Conference on Digital Libraries, 2000, 85-94
Brin, S., Extracting Patterns and Relations from the World-Wide Web. In Proceedings of the 1998
International Workshop on the Web and Databases (WebDB’98), 1998, 172-183
Kim, S., Jeong, M., Lee, G. G., Ko, K., & Lee, Z., An Alignment-based Approach to Semisupervised
Relation Extraction Including Multiple Arguments. In Proceedings of AIRS, LNCS
, 2008, 526-536
Li, W. G., Liu, T., & Li, S. Automated Entity Relation Tuple Extraction Using Web Mining.
ACTA Electronica Sinica, 2007, 35(11): 2111-2116
Ravichandran, D., & Hovy, E. Learning Surface Text Patterns for a Question Answering System.
In Proceedings of the ACL Conference, 2002, 41-47
Reichartz, F., Korte, H. and Paass, G., Dependency Tree Kernels for Relation Extraction from
Natural Language Text. In Proceedings of ECML/PKDD, 2009, 270-285
Giuliano, C., Lavelli, A., Pighin, D., & Romano, L., FBK-IRST: Kernel Methods for Semantic
Relation Extraction. In Proceedings of the 4th International Workshop on Semantic Evaluations
(SemEval-2007), 2007, pp.141-144
Huang, R., Sun, L., & Feng, Y., Study of Kernel-Based Methods for Chinese Relation Extraction,
In Proceedings of AIRS, LNCS 4993, 2008, pp.598-604
Zelenko, D., Aone, C., & Richardella, A., Kernel Methods for Relation Extraction. Journal of
Machine Learning Research, 2003, 3: 1059-1082
Zhao, S. B., & Grishman, R., Extracting Relations with Integrated Information Using Kernel
Methods. In Proceedings of the 43rd Annual Meeting of the ACL, 2005, pp.419-426
Zhang, Y., Xu, X., & Zhang, T., Fusion of Multiple Features for Chinese Named Entity
Recognition Based on CRF Model, In Proceedings of AIRS, LNCS 4993, 2008, pp.95-106
Yao, L., Sun, C., Wang, X., & Wang, X., (2010) Combining Self Learning and Active Learning
for Chinese Named Entity Recognition, Journal of Software, 2011, 5(5): 530-537
Liu, Y., Jin, P., Yue, L., Extracting Position Relations from the Web, In Proceedings of 11th ACM
International Workshop on Web Information and Data Management (WIDM’09), Hong Kong,
China, 2009, pp. 59-62
ICTCLAS, http://www.ictclas.org (2008, accessed April 2012)
Jin, P., Chen, H., Lin, S., Zhao, X., Li, X., & Yue, L., Indexing Temporal Information for Web
Pages, Computer Science and Information Systems, 2011, 8(3): 711-737
Jin, P., Li, X., Chen, H., Yue, L., CT-Rank: A Time-aware Ranking Algorithm for Web Search,
Journal of Convergence Information Technology, 2010, 5(6): 99-111