Crawling the Deep Web Using Asynchronous Advantage Actor Critic Technique

Authors

  • Kapil Madan Computer Science & Engineering Department, Punjab Engineering College (Deemed to be University), Sector 12, Chandigarh, India https://orcid.org/0000-0003-3594-0062
  • Rajesh Bhatia Computer Science & Engineering Department, Punjab Engineering College (Deemed to be University), Sector 12, Chandigarh, India

DOI:

https://doi.org/10.13052/jwe1540-9589.20314

Keywords:

Web crawler, deep web, reinforcement learning, A3C

Abstract

In the digital world, World Wide Web magnitude is expanding very promptly. Now a day, a rising number of data-centric websites require a mechanism to crawl the information. The information accessible through hyperlinks can easily be retrieved with general-purpose search engines. A massive chunk of the structured information is invisible behind the search forms. Such immense information is recognized as the deep web and has structured information as compared to the surface web. Crawling the content of deep web is very challenging and requires filling the search forms with suitable queries. This paper proposes an innovative technique using an Asynchronous Advantage Actor-Critic (A3C) to explore the unidentified deep web pages. It is based on the policy gradient deep reinforcement learning technique that parameterizes the policy and value function based on the reward system. A3C has one coordinator and various agents. These agents learn from different environments, update the local gradients to a coordinator, and produce a more stable system. The proposed technique has been validated with Open Directory Project (ODP). The experimental outcome shows that the proposed technique outperforms most of the prevailing techniques based on various metrics such as average precision-recall, average harvest rate, and coverage ratio.

Downloads

Download data is not yet available.

Author Biographies

Kapil Madan, Computer Science & Engineering Department, Punjab Engineering College (Deemed to be University), Sector 12, Chandigarh, India

Kapil Madan is a Ph.D. student from Department of Computer Science and Engineering at Punjab Engineering College (Deemed to be University), Chandigarh, India. He attended the Thapar Institute of Engineering Technology (Deemed to be University), Patiala, India where he received his M.E. degree in Software Engineering. He received his B.Tech. degree in Computer Engineering from Kurukshetra University, Haryana, India. He has more than 8 years of Teaching and Research experience. His research areas include Information retrieval, Focused crawling, and Reinforcement learning.

Rajesh Bhatia, Computer Science & Engineering Department, Punjab Engineering College (Deemed to be University), Sector 12, Chandigarh, India

Rajesh Bhatia is currently working as a Professor in the Department of Computer Science and Engineering at Punjab Engineering College (Deemed to be University), Chandigarh, India. He received his Ph.D. and M.E. degrees in Computer Science Engineering from Thapar Institute of Engineering Technology (Deemed to be University), Patiala, India. He has received B. Tech. degree from Dr. B. Ambedkar Marathwada University, Aurangabad, India. He has more than 25 years of Teaching and Research experience. His research areas include Automated Software Debugging, Semantic Software Clones detection and Automated Test Cases Generation, Information Retrieval, and Search Based Software Engineering. He is also undertaking various Sponsored Research Projects. He has about 85 research publications in various reputed journals and conferences.

References

M. K. Bergman, “White Paper: The Deep Web: Surfacing Hidden Value,” J. Electron. Publ., vol. 7, no. 1, 2001.

I. Hernández, C. R. Rivero, and D. Ruiz, “Deep Web crawling: a survey,” World Wide Web, vol. 22, no. 4, pp. 1577–1610, Jul. 2019.

M. Kumar, R. Bhatia, and D. Rattan, “A survey of Web crawlers for information retrieval,” WIREs Data Min. Knowl. Discov., vol. 7, no. 6, p. e1218, 2017.

V. Mnih et al., “Asynchronous methods for deep reinforcement learning,” 33rd Int. Conf. Mach. Learn. ICML 2016, vol. 4, pp. 2850–2869, 2016.

B. He, M. Patel, Z. Zhang, and K. C.-C. Chang, “Accessing the deep web,” Commun. ACM, vol. 50, no. 5, pp. 94–101, May 2007.

S. Raghavan and H. Garcia-Molina, “Crawling the Hidden Web,” in 27th VLDB Conference - Roma, Italy, 2001, pp. 1–10.

M. C. Moraes, C. A. Heuser, V. P. Moreira, and D. Barbosa, “Prequery Discovery of Domain-Specific Query Forms: A Survey,” IEEE Trans. Knowl. Data Eng., vol. 25, no. 8, pp. 1830–1848, Aug. 2013.

G. Z. Kantorski, V. P. Moreira, and C. A. Heuser, “Automatic Filling of Hidden Web Forms,” ACM SIGMOD Rec., vol. 44, no. 1, pp. 24–35, May 2015.

Y. Ru and E. Horowitz, “Indexing the invisible web: a survey,” Online Inf. Rev., vol. 29, no. 3, pp. 249–265, 2005.

J. Madhavan, L. Afanasiev, L. Antova, and A. Halevy, “Harnessing the Deep Web: Present and Future,” Syst. Res., vol. 2, no. 2, pp. 50–54, 2009.

J. Madhavan, D. Ko, £

. Kot, V. Ganapathy, A. Rasmussen, and A. Halevy, “Google’s Deep Web crawl,” Proc. VLDB Endow., vol. 1, no. 2, pp. 1241–1252, Aug. 2008.

a. Ntoulas, P. Pzerfos, and J. C. J. Cho, “Downloading textual hidden web content through keyword queries,” Proc. 5th ACM/IEEE-CS Jt. Conf. Digit. Libr. (JCDL ’05), pp. 100–109, 2005.

P. Barrio and L. Gravano, “Sampling strategies for information extraction over the deep web,” Inf. Process. Manag., vol. 53, no. 2, pp. 1339–1351, 2017.

Y. Wang, J. Lu, J. Liang, J. Chen, and J. Liu, “Selecting queries from sample to crawl deep web data sources,” Web Intell. Agent Syst., vol. 10, no. 1, pp. 75–88, 2012.

R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. 1998.

N. Zhou, J. Du, X. Yao, W. Cui, Z. Xue, and M. Liang, “A content search method for security topics in microblog based on deep reinforcement learning,” World Wide Web, vol. 23, no. 1, pp. 75–101, 2020.

E. Z. Liu, K. Guu, P. Pasupat, T. Shi, and P. Liang, “Reinforcement Learning on Web Interfaces Using Workflow-Guided Exploration,” in 6th International Conference on Learning Representations, ICLR 2018 – Conference Track Proceedings, 2018.

M. Kumar and R. Bhatia, “Hidden Webpages Detection Using Distributed Learning Automata,” J. Web Eng., vol. 17, no. 3–4, pp. 270–283, 2018.

J. Ortiz, M. Balazinska, J. Gehrke, and S. S. Keerthi, “Learning State Representations for Query Optimization with Deep Reinforcement Learning,” DEEM’18 Int. Work. Data Manag. End-to-End Mach. Learn., 2018.

T. Shi, A. Karpathy, L. Fan, J. Hernandez, and P. Liang, “World of bits: An open-domain platform for web-based agents,” in In Proceedings of the 34th International Conference on Machine Learning, 2017, pp. 4834–4843.

Q. Zheng, Z. Wu, X. Cheng, L. Jiang, and J. Liu, “Learning to crawl deep web,” Inf. Syst., vol. 38, no. 6, pp. 801–819, Sep. 2013.

L. Singh and D. K. Sharma, “An architecture for extracting information from hidden web databases using intelligent agent technology through reinforcement learning,” in 2013 IEEE Conference on Information and Communication Technologies, 2013, no. Ict, pp. 292–297.

A. Asperti, D. Cortesi, and F. Sovrano, “Crawling in Rogue’s Dungeons with (Partitioned) A3C,” in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 11331 LNCS, 2019, pp. 264–275.

S. Yang, B. Yang, H.-S. Wong, and Z. Kang, “Cooperative traffic signal control using Multi-step return and Off-policy Asynchronous Advantage Actor-Critic Graph algorithm,” Knowledge-Based Syst., vol. 183, p. 104855, 2019.

M. Chen, T. Wang, K. Ota, M. Dong, M. Zhao, and A. Liu, “Intelligent resource allocation management for vehicles network: An A3C learning approach,” Comput. Commun., vol. 151, no. 2019, pp. 485–494, 2020.

A. Sharma, Z. Parekh, and P. Talukdar, “Speeding up reinforcement learning-based information extraction training using asynchronous methods,” in EMNLP 2017 - Conference on Empirical Methods in Natural Language Processing, Proceedings, 2017, pp. 2658–2663.

R. J. Willia, “Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning,” Mach. Learn., vol. 8, no. 3, pp. 229–256, 1992.

Downloads

Published

2021-06-10

How to Cite

Madan, K., & Bhatia, R. (2021). Crawling the Deep Web Using Asynchronous Advantage Actor Critic Technique. Journal of Web Engineering, 20(3), 879–902. https://doi.org/10.13052/jwe1540-9589.20314

Issue

Section

Articles