Crawling the Deep Web Using Asynchronous Advantage Actor Critic Technique
Keywords:Web crawler, deep web, reinforcement learning, A3C
In the digital world, World Wide Web magnitude is expanding very promptly. Now a day, a rising number of data-centric websites require a mechanism to crawl the information. The information accessible through hyperlinks can easily be retrieved with general-purpose search engines. A massive chunk of the structured information is invisible behind the search forms. Such immense information is recognized as the deep web and has structured information as compared to the surface web. Crawling the content of deep web is very challenging and requires filling the search forms with suitable queries. This paper proposes an innovative technique using an Asynchronous Advantage Actor-Critic (A3C) to explore the unidentified deep web pages. It is based on the policy gradient deep reinforcement learning technique that parameterizes the policy and value function based on the reward system. A3C has one coordinator and various agents. These agents learn from different environments, update the local gradients to a coordinator, and produce a more stable system. The proposed technique has been validated with Open Directory Project (ODP). The experimental outcome shows that the proposed technique outperforms most of the prevailing techniques based on various metrics such as average precision-recall, average harvest rate, and coverage ratio.
M. K. Bergman, “White Paper: The Deep Web: Surfacing Hidden Value,” J. Electron. Publ., vol. 7, no. 1, 2001.
I. Hernández, C. R. Rivero, and D. Ruiz, “Deep Web crawling: a survey,” World Wide Web, vol. 22, no. 4, pp. 1577–1610, Jul. 2019.
M. Kumar, R. Bhatia, and D. Rattan, “A survey of Web crawlers for information retrieval,” WIREs Data Min. Knowl. Discov., vol. 7, no. 6, p. e1218, 2017.
V. Mnih et al., “Asynchronous methods for deep reinforcement learning,” 33rd Int. Conf. Mach. Learn. ICML 2016, vol. 4, pp. 2850–2869, 2016.
B. He, M. Patel, Z. Zhang, and K. C.-C. Chang, “Accessing the deep web,” Commun. ACM, vol. 50, no. 5, pp. 94–101, May 2007.
S. Raghavan and H. Garcia-Molina, “Crawling the Hidden Web,” in 27th VLDB Conference - Roma, Italy, 2001, pp. 1–10.
M. C. Moraes, C. A. Heuser, V. P. Moreira, and D. Barbosa, “Prequery Discovery of Domain-Specific Query Forms: A Survey,” IEEE Trans. Knowl. Data Eng., vol. 25, no. 8, pp. 1830–1848, Aug. 2013.
G. Z. Kantorski, V. P. Moreira, and C. A. Heuser, “Automatic Filling of Hidden Web Forms,” ACM SIGMOD Rec., vol. 44, no. 1, pp. 24–35, May 2015.
Y. Ru and E. Horowitz, “Indexing the invisible web: a survey,” Online Inf. Rev., vol. 29, no. 3, pp. 249–265, 2005.
J. Madhavan, L. Afanasiev, L. Antova, and A. Halevy, “Harnessing the Deep Web: Present and Future,” Syst. Res., vol. 2, no. 2, pp. 50–54, 2009.
J. Madhavan, D. Ko, £
. Kot, V. Ganapathy, A. Rasmussen, and A. Halevy, “Google’s Deep Web crawl,” Proc. VLDB Endow., vol. 1, no. 2, pp. 1241–1252, Aug. 2008.
a. Ntoulas, P. Pzerfos, and J. C. J. Cho, “Downloading textual hidden web content through keyword queries,” Proc. 5th ACM/IEEE-CS Jt. Conf. Digit. Libr. (JCDL ’05), pp. 100–109, 2005.
P. Barrio and L. Gravano, “Sampling strategies for information extraction over the deep web,” Inf. Process. Manag., vol. 53, no. 2, pp. 1339–1351, 2017.
Y. Wang, J. Lu, J. Liang, J. Chen, and J. Liu, “Selecting queries from sample to crawl deep web data sources,” Web Intell. Agent Syst., vol. 10, no. 1, pp. 75–88, 2012.
R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. 1998.
N. Zhou, J. Du, X. Yao, W. Cui, Z. Xue, and M. Liang, “A content search method for security topics in microblog based on deep reinforcement learning,” World Wide Web, vol. 23, no. 1, pp. 75–101, 2020.
E. Z. Liu, K. Guu, P. Pasupat, T. Shi, and P. Liang, “Reinforcement Learning on Web Interfaces Using Workflow-Guided Exploration,” in 6th International Conference on Learning Representations, ICLR 2018 – Conference Track Proceedings, 2018.
M. Kumar and R. Bhatia, “Hidden Webpages Detection Using Distributed Learning Automata,” J. Web Eng., vol. 17, no. 3–4, pp. 270–283, 2018.
J. Ortiz, M. Balazinska, J. Gehrke, and S. S. Keerthi, “Learning State Representations for Query Optimization with Deep Reinforcement Learning,” DEEM’18 Int. Work. Data Manag. End-to-End Mach. Learn., 2018.
T. Shi, A. Karpathy, L. Fan, J. Hernandez, and P. Liang, “World of bits: An open-domain platform for web-based agents,” in In Proceedings of the 34th International Conference on Machine Learning, 2017, pp. 4834–4843.
Q. Zheng, Z. Wu, X. Cheng, L. Jiang, and J. Liu, “Learning to crawl deep web,” Inf. Syst., vol. 38, no. 6, pp. 801–819, Sep. 2013.
L. Singh and D. K. Sharma, “An architecture for extracting information from hidden web databases using intelligent agent technology through reinforcement learning,” in 2013 IEEE Conference on Information and Communication Technologies, 2013, no. Ict, pp. 292–297.
A. Asperti, D. Cortesi, and F. Sovrano, “Crawling in Rogue’s Dungeons with (Partitioned) A3C,” in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 11331 LNCS, 2019, pp. 264–275.
S. Yang, B. Yang, H.-S. Wong, and Z. Kang, “Cooperative traffic signal control using Multi-step return and Off-policy Asynchronous Advantage Actor-Critic Graph algorithm,” Knowledge-Based Syst., vol. 183, p. 104855, 2019.
M. Chen, T. Wang, K. Ota, M. Dong, M. Zhao, and A. Liu, “Intelligent resource allocation management for vehicles network: An A3C learning approach,” Comput. Commun., vol. 151, no. 2019, pp. 485–494, 2020.
A. Sharma, Z. Parekh, and P. Talukdar, “Speeding up reinforcement learning-based information extraction training using asynchronous methods,” in EMNLP 2017 - Conference on Empirical Methods in Natural Language Processing, Proceedings, 2017, pp. 2658–2663.
R. J. Willia, “Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning,” Mach. Learn., vol. 8, no. 3, pp. 229–256, 1992.