GenDE: A CRF-Based Data Extractor

Authors

DOI:

https://doi.org/10.13052/jwe1540-9589.19342

Keywords:

Wrapper induction, data extractor, wrapper verifier, sequence labeling, CRFs model, JSON data extraction

Abstract

Web site schema detection and data extraction from the Deep Web have been studied a lot. Although, few researches have focused on the more challenging jobs: wrapper verification or extractor generation. A wrapper verifier would check whether a new page from a site complies with the detected schema, and so the extractor will use the wrapper to get instances of the schema types. If the wrapper failed to work with the new page, a new wrapper/schema would be re-generated by calling an unsupervised wrapper induction system. In this paper, a new data extractor called GenDE is proposed. It verifies the site schema and extracts data from the Web pages using Conditional Random Fields (CRFs). The problem is solved by breaking down an observation sequence (a Web page) into simpler subsequences that will be labeled using CRF. Moreover, the system solves the problem of automatic data extraction from modern JavaScript sites in which data/schema are attached (on the client side) in a JSON format. The experiments show an encouraging result as it outperforms the CSP-based extractor algorithm (95% and 96% of recall and precision, respectively). Moreover, it gives a high performance result when tested on the SWDE benchmark dataset (84.91%).

Downloads

Download data is not yet available.

Author Biographies

Mohammed Kayed, Faculty of Computers and Artificial Intelligence, Beni-Suef University, Beni-suef, Egypt, 62511

Mohammed Kayed received his M.Sc. degree in Computer Science from Minia University, Minia, Egypt, in 2002 and the Ph.D. degree in Computer Science from Beni-Suef University, Beni-Suef, Egypt, in 2007. From 2005 to 2006, he was a Research&Teaching Assistant in Department of Computer Science and Information Engineering at the National Central University, Taiwan. From 2008 to 2015, he was an Assistant Professor, IT Department, College of Applied Science, Sultanate of Oman. He is currently an Associate Professor and Head of Computer Science Department, Faculty of Computer and Artificial Intelligence, Beni-Suef University, Egypt. He is the author of more than 25 articles. His research interests include Web mining, Opinion Mining, Information Extraction and Information Retrieval.

Khaled Shalaan, Faculty of Engineering and IT, The British University in Dubai, Dubai, UAE

Khaled Shaalan is a full professor of Computer Science/Artificial Intelligence at the British University in Dubai (BUiD), UAE. He is an Honorary Fellow at the School of Informatics, University of Edinburgh (UoE), UK. Over the last two decades, Prof Khaled has been contributing to a wide range of research topics in AI, Arabic NLP, Knowledge management, health informatics, and educational technology. Prof Khaled has published 200+ referred publications. Prof Khaled’s research work is cited extensively worldwide and the impact of his research using GoogleScholar’s H-index metric is 35+. Prof Khaled has been actively and extensively supporting the local and international academic community. He acts as the chair of international Conferences, journals & books editor, keynote speaker, external member of promotions committees, among others.

References

G. M. R. Chang C.-H., Kayed M., S. K. F., A survey of web information extraction systems, IEEE Transactions on Knowledge and Data Engineering 18 (110) (2006) 1411–1428. doi:10.1109/TKDE.2006.152.

H. D. Verberne S., Sappelli M., K. W., Evaluation and analysis of term scoring methods for term extraction, Information Retrieval Journal 19 (5) (2016) 510–545. doi:10.1007/s10791-016-9286-2.

W. Q., N. Y.-K., An ontology-based binary-categorization approach for recognizing multiple-record web documents using a probabilistic retrieval model, Information Retrieval Journal 6 (3-4) (2003) 295–332. doi:10.1023/A:1026024513043.

C. B. Wei X., M. A., Table extraction for answer retrieval, Information Retrieval Journal 9 (5) (2006) 589–611. doi:10.1007/s10791-006-9005-5.

K. M., C. C.-H., Fivatech: Page-level web data extraction from template pages, IEEE Transaction on Knowledge and Data Eng. 22 (2) (2010) 249–263. doi:10.1109/ICDMW.2007.95.

S. H.A., C. R., Tex: an efficient and effective unsupervised web information extractor, Knowledge Based Systems 39 (2013) 109–123. doi:10.1016/j.knosys.2012.10.009.

N. W., P. K., Towards data extraction of dynamic content from javascript web applications, International Conference on Information Networking (ICOIN). doi:10.1109/ICOIN.2018.8343218.

H. D. Meng X., L. C., Schema-guided wrapper maintenance for web data extraction, 5th ACM international workshop on Web information and data management (2003) 1–8. doi:10.1145/956699.956701.

A. M. Raposo J., Pan A., H. J., Automatically maintaining wrappers for semi-structured web sources, Data & Knowledge Engineering 62 (2) (2007) 331–358. doi:10.1016/j.datak.2006.06.006.

M. S. N. Lerman K., K. C. A., Wrapper maintenance: A machine learning approach, Journal of Artificial Intelligence Research 18 (1) (2003) 149–181. doi:10.1613/jair.1145.

L. X. Pek E.-H., L. Y., Web wrapper validation, Proceedings of the 5th International Asia-Pacific Web Conference. doi:10.1007/3-540-36901-5 40.

L. K.-C. Chang C.-H., Lin Y.-L., K. M., Page-level wrapper verification for unsupervised web data extraction, International Conference on Web Information Systems Engineering (WISE) (2013) 454–467. doi:10.1007/978-3-642-41230-1 38.

F. T. Ortona S., Orsi G., B. M., Joint repairs for web wrappers, IEEE 32nd International Conference on Data Engineering (ICDE). doi:10.1109/ICDE.2016.7498320.

B. M. Ortona S., Orsi G., F. T., Wadar: Joint wrapper and data repair, Proceedings of the VLDB Endowment 8 (12). doi:10.14778/2824032.2824120.

C. C.-H. Chang C.-H., K. M., Fivatech2: A supervised approach to role differentiation for web data extraction from template pages, 26 th annual conference of the Japanese Society for Artifical Intelligence, Special Session on Web Intelligence & Data Mining (2012) 1–9.

P. J. Eric F.-L., Yanzhang H., P. R., Conditional random fields in speech, audio, and language processing, Proceedings of the IEEE 101 (5). doi:10.1109/JPROC.2013.2248112.

L. X., B. D., Extracting addresses from news reports using conditional random fields, 15 th IEEE International Conference on Machine Learning and Applications. doi:10.1109/ICMLA.2016.0141.

W. J.-R. Z. B. Zhu J., Nie Z., M. W.-Y., 2d conditional random fields for web information extraction, Proceedings of the 22 nd international conference on Machine learning (ICML) (2015) 1044–1051. doi:10.1145/1102351.1102483.

X. R. Liu R., G. K., A crf-based approach for web object extraction, 3 rd International Conference on Computer Science and Information Technology. doi:10.1109/ICCSIT.2010.5563787.

K. N., Wrapper verification, World Wide Web Journal 3 (2) (2000) 79–94. doi:10.1023/A:101922961.

K. M., Peer matrix alignment: a new algorithm, Pacific-Asia Conference on Knowledge Discovery and Data Mining (ICDM) (2012) 268–279. doi:10.1007/978-3-642-30220-6 23.

P. Y. Hao Q., Cai R., Z. L., From one tree to a forest: a unified solution for structured web data extraction, SIGIR, Beijing, China (2011) 775–784. doi:10.1.1.229.2837.

Published

2020-06-13

How to Cite

Kayed, M., & Shalaan, K. (2020). GenDE: A CRF-Based Data Extractor. Journal of Web Engineering, 19(3-4), 371–404. https://doi.org/10.13052/jwe1540-9589.19342

Issue

Section

Articles