Dynamic Query Processing for Hidden Web Data Extraction From Academic Domain

Authors

  • Babita Ahuja MRCE, Faridabad, India
  • Anuradha Pillai J.C. Bose University of Science and Technology, YMCA, Faridabad, India
  • Deepika Punj J.C. Bose University of Science and Technology, YMCA, Faridabad, India https://orcid.org/0000-0001-8191-096X
  • Jyoti Verma J.C. Bose University of Science and Technology, YMCA, Faridabad, India https://orcid.org/0000-0002-6271-3001

DOI:

https://doi.org/10.13052/jwe1540-9589.19782

Keywords:

Surface Web, Hidden web, Dynamic Query Processing, Text Summarization, Semantic Fuzzy Rules

Abstract

The web documents lying on WWW can be classified as hidden web and surface web. The web documents from surface web are indexable as well as crawlable by the search engines and hence they can be displayed to users as per their input query. In contrast to this, hidden web documents are neither indexable nor crawlable by the traditional search engines due to disconnected URL’s, no-index tag, user authentication, web form processing. Also, since the information is scattered across multiple web pages, users find it difficult to hop between multiple pages to find the desired information. Hence, there is dire need of hidden web crawlers which could extract the data from hidden web databases and uncover this big part of WWW. In this research, a novel framework “Dynamic Query Processing for Hidden Web Data Extraction (DQPHDE)” has been proposed to extract such hidden web data and integrate it with the data from surface web to meet user’s requirements. DQPHDE makes use of clustering, semantic based text mining and fuzzy rule based system to carry out the desired task. The results of the proposed work were compared with the existing academic search engines like ‘Microsoft Academic’ and ‘Academia.edu’ etc, and our proposed work outperforms them in fetching the information and then integrating the related information for other pages.

Downloads

Download data is not yet available.

Author Biographies

Babita Ahuja, MRCE, Faridabad, India

Babita Ahuja did her M.Tech (Computer Science and Engineering) from Maharishi Dayanand University, Rohtak in 2007 and B.Tech (Computer Science) from Maharishi Dayanand University, Rohtak in 2004. Ms. Babita has over 12 years of experience in teaching B.Tech and M.Tech Courses. Her areas of interest include Operating System, Semantic Web, Web Technologies, Search Engines and Hidden Web. She has published 13 research papers in various journals and conferences of international fame.

Anuradha Pillai, J.C. Bose University of Science and Technology, YMCA, Faridabad, India

Anuradha Pillai is an Associate Professor in the Department of Computer Engineering, JC Bose University of Science and Technology, YMCA, Faridabad, Haryana, India. She received Ph.D. in Computer Engineering from Maharishi Dayanand University, Rohtak. She published more than 60 papers in reputed international journals and successfully guided 4 PhD students. Her subjects of interest include Data Mining, Information Retrieval, Hidden web, Web Mining and Social Networks.

Deepika Punj, J.C. Bose University of Science and Technology, YMCA, Faridabad, India

Deepika Punj is working as Assistant Professor in Department of Computer Engineering at JC BOSE University of Science and Technology YMCA, Faridabad, India. She has done Ph.D in Computer Engineering. She is having 14 years of experience in teaching. She has published more than 25 papers in Reputed National and International Journals. Her research interests include Data Mining, Deep Learning, Machine Learning and Internet Technologies.

Jyoti Verma, J.C. Bose University of Science and Technology, YMCA, Faridabad, India

Jyoti Verma, received her PhD degree in the year 2011. She has a teaching experience of 17 years and research experience of 9 years. She is credited with 35 research articles to her name in the journals of repute. Her research interests include Information Retrieval, Web Mining and Big Data. She is currently working as Associate Professor in the Department of Computer Engineering, J.C. Bose University of Science and Technology, Faridabad, India.

References

Kunder, Maurice de. World Wide Web Size. 2011.

https://www.internetworldstats.com/stats.html

http://www.internetlivestats.com/one-second/

Eduard c. Dragut, Weiyi Meng and Clement T. Yu (2012). Deep web query interface understanding and integration. Synthesis Lectures on Data Management. .https://doi.org/10.2200/S00419ED1V01Y201205DTM026

Michael k. Bergman (2001).The deep web: surfacing hidden value Bright Planet Corp

B. He, M. Patel, Z. Zhang, and K. C.-C. Chang (2007). Accessing the deep web. Communications of the ACM, 50(5). pp 94-101

. https://www.popsci.com/dark-web-revealed

. https://doaj.org/about

. https://core.ac.uk/about

. Pieper, Dirk (May 18, 2011). BASE Migration InetBib, www.base-search.net

. https://en.wikipedia.org/wiki/Google_Scholar

. Microsoft. Academic Knowledge API Retrieved 29 January 2017.

. ResearhGate.net

. Academai.edu

B. He and K. C.-C. Chang (2003) Statistical Schema Matching across Web Query Interfaces. In SIGMOD. Doi:10.1145/872757.872784

A. D. Sarma, X. Dong, and A. Halevy (2008). Bootstrapping pay-as-you-go data integration systems In SIGMOD. doi>10.1145/1376616.1376702

Disheng Qiu, Luciano Barbosa, Xin Luna Dong (2015). DEXTER: Large-Scale Discovery and Extraction of Product Specifications on the Web. Proceedings of the VLDB Endowment. Doi:10.14778/2831360.2831372

Manuel Álvarez, Juan Raposo, Fidel Cacheda and Alberto Pan (2006).A Task specific Approach for Crawling the Deep Web. Engineering Letters, EL_13_2_19 (Advance online publication)

. https://en.wikipedia.org/wiki/Academia.edu

.Deng Cai, Shipeng Yu,Ji-Rong Wen and Wei-Ying Ma (2003). VIPS: A Vision based Page Segmentation Algorithm. Technical Report MSR-TR-2003-79, Microsoft Technical Report.

.SchleimerSaul, Wilkerson Daniel S., Aiken Alex (2003) Winnowing: Local Algorithms forDocument Fingerprinting. ACM SIGMOD international conference on Management of data, NY, USA, 2003, pp. 76-85.

Published

2020-12-25

Issue

Section

Advanced Practice in Web Engineering