Dynamic Query Processing for Hidden Web Data Extraction From Academic Domain
The web documents lying on WWW can be classified as hidden web and surface web. The web documents from surface web are indexable as well as crawlable by the search engines and hence they can be displayed to users as per their input query. In contrast to this, hidden web documents are neither indexable nor crawlable by the traditional search engines due to disconnected URL’s, no-index tag, user authentication, web form processing. Also, since the information is scattered across multiple web pages, users find it difficult to hop between multiple pages to find the desired information. Hence, there is dire need of hidden web crawlers which could extract the data from hidden web databases and uncover this big part of WWW. In this research, a novel framework “Dynamic Query Processing for Hidden Web Data Extraction (DQPHDE)” has been proposed to extract such hidden web data and integrate it with the data from surface web to meet user’s requirements. DQPHDE makes use of clustering, semantic based text mining and fuzzy rule based system to carry out the desired task. The results of the proposed work were compared with the existing academic search engines like ‘Microsoft Academic’ and ‘Academia.edu’ etc, and our proposed work outperforms them in fetching the information and then integrating the related information for other pages.
Kunder, Maurice de. World Wide Web Size. 2011.
Eduard c. Dragut, Weiyi Meng and Clement T. Yu (2012). Deep web query interface understanding and integration. Synthesis Lectures on Data Management. .https://doi.org/10.2200/S00419ED1V01Y201205DTM026
Michael k. Bergman (2001).The deep web: surfacing hidden value Bright Planet Corp
B. He, M. Patel, Z. Zhang, and K. C.-C. Chang (2007). Accessing the deep web. Communications of the ACM, 50(5). pp 94-101
. Pieper, Dirk (May 18, 2011). BASE Migration InetBib, www.base-search.net
. Microsoft. Academic Knowledge API Retrieved 29 January 2017.
B. He and K. C.-C. Chang (2003) Statistical Schema Matching across Web Query Interfaces. In SIGMOD. Doi:10.1145/872757.872784
A. D. Sarma, X. Dong, and A. Halevy (2008). Bootstrapping pay-as-you-go data integration systems In SIGMOD. doi>10.1145/1376616.1376702
Disheng Qiu, Luciano Barbosa, Xin Luna Dong (2015). DEXTER: Large-Scale Discovery and Extraction of Product Specifications on the Web. Proceedings of the VLDB Endowment. Doi:10.14778/2831360.2831372
Manuel Álvarez, Juan Raposo, Fidel Cacheda and Alberto Pan (2006).A Task specific Approach for Crawling the Deep Web. Engineering Letters, EL_13_2_19 (Advance online publication)
.Deng Cai, Shipeng Yu,Ji-Rong Wen and Wei-Ying Ma (2003). VIPS: A Vision based Page Segmentation Algorithm. Technical Report MSR-TR-2003-79, Microsoft Technical Report.
.SchleimerSaul, Wilkerson Daniel S., Aiken Alex (2003) Winnowing: Local Algorithms forDocument Fingerprinting. ACM SIGMOD international conference on Management of data, NY, USA, 2003, pp. 76-85.