TOPICAL CRAWLING ON THE WEB THROUGH LOCAL SITE-SEARCHES

Authors

  • YALING LIU Department of Electrical Engineering and Computer Science, University of Kansas
  • ARVIN AGAH Department of Electrical Engineering and Computer Science, University of Kansas

Keywords:

Web searching, topical resource discovery, topical crawling, local site-search

Abstract

In this paper, we investigate the feasibility of discovering topical resources by combining Web searches and local site-searches. Existing techniques of topical resource discovery consist of crawling the Web and searching the Web. The former typically analyses linkage among Web pages to estimate the relevance of an unseen document to a topic. The latter exploits the indices of generic search engines to discover documents relevant to a topic. Although the local site-search has been a simple and convenient feature of a Web site for human users to quickly locate desired information within the site that hosts tremendous number of documents, this feature has been ignored by the techniques of automatic topical resource discovery. A typical local site-search returns a list of titles, hyperlinks, and snippets of relevant documents that can be used to estimate the relevance of the documents to the topic before actually fetching the documents. We propose an operational model to make use of this simple feature, and address how this model can be realized. Experiments have shown that this simple but efficient approach can provide much more precise estimations than a sophisticated intelligent topical crawler.

 

Downloads

Download data is not yet available.

References

Arasu, A., Cho, J., Garcia-Molina, H., Paepcke, A., and Raghavan, S. (2001). Searching the

Web. Transactions on Internet Technology 1, 1 (Aug. 2001), 2-43.

Chakrabarti, S., van den Berg, M., and Dom, B. (1999). Focused crawling: a new approach to

topic-specific Web resource discovery. In Proceedings of the Eighth international Conference

on World Wide Web (Toronto, Canada). 1623-1640.

Chakrabarti, S., van den Berg, M., and Dom, B. (1999). Distributed Hypertext Resource

Discovery Through Examples. In Proceedings of the 25th international Conference on Very

Large Data Bases (September 07 - 10, 1999). 375-386.

Aggarwal, C. C., Al-Garawi, F., and Yu, P.S. (2001). Intelligent crawling on the World Wide

Web with arbitrary predicates. In Proceedings of the 10th international Conference on World

Wide Web (Hong Kong, Hong Kong, May 01 - 05, 2001). 96-105.

Aggarwal, C.C., Al-Garawi, F., and Yu, P.S. (2001). On the design of a learning crawler for

topical resource discovery. ACM Transactions on Information Systems 19, 3 (Jul. 2001), 286-

Cohen, W.W. and Singer, Y. (1996). Learning to query the web. In AAAI Workshop on

Internet-Based Information Systems, 1996.

Ipeirotis, P.G., Agichtein, E., Jain, P., and Gravano, L. (2006). To search or to crawl?: towards

a query optimizer for text-centric tasks. In Proceedings of the 2006 ACM SIGMOD

International Conference on Management of Data (Chicago, IL, USA, June 27 - 29, 2006).

-276.

Aggarwal, C.C. (2004). On Leveraging User Access Patterns for Topic Specific Crawling. Data

Mining and Knowledge Discovery 9, 2 (Sep. 2004), 123-145.

Diligenti, M., Coetzee, F., Lawrence, S., Giles, C. L., and Gori, M. (2000). Focused Crawling

Using Context Graphs. In Proceedings of the 26th International Conference on Very Large

Data Bases (September 10 - 14, 2000). 527-534.

Liu, H., Milios, E., and Janssen, J. (2004). Probabilistic models for focused web crawling. In

Proceedings of the 6th Annual ACM International Workshop on Web information and Data

Management (Washington DC, USA, November 12 - 13, 2004). 16-22.

Qin, J., Zhou, Y., and Chau, M. (2004). Building domain-specific web collections for scientific

digital libraries: a meta-search enhanced focused crawling method. In Proceedings of the 4th

ACM/IEEE-CS Joint Conference on Digital Libraries (Tuscon, AZ, USA, June 07 - 11, 2004).

-141.

Chakrabarti, S., Punera, K., and Subramanyam, M. (2002). Accelerated focused crawling

through online relevance feedback. In Proceedings of the 11th International Conference on

World Wide Web (Honolulu, Hawaii, USA, May 07 - 11, 2002). 148-159.

Pant, G. and Srinivasan, P. (2005). Learning to crawl: Comparing classification schemes. ACM

Transactions on Information Systems 23, 4 (October 2005), 430-462.

Raghavan, S. and Garcia-Molina, H. (2001). Crawling the Hidden Web. In Proceedings of the

th International Conference on Very Large Data Bases (September 11 - 14, 2001). 129-138.

Madhavan, J., Ko, D., Kot, Ł., Ganapathy, V., Rasmussen, A., and Halevy, A. (2008). Google's

Deep Web crawl. In Proceedings of VLDB Endowment 1, 2 (Aug. 2008), 1241-1252.

Index of the Web.com. http://www.indexoftheweb.com/.

Google SOAP Search API. http://code.google.com/apis/soapsearch/.

Anagnostopoulos, A., Broder, A. Z., and Carmel, D. (2005). Sampling search-engine results. In

Proceedings of the 14th International Conference on World Wide Web (Chiba, Japan, May 10 -

, 2005). 245-256.

Anagnostopoulos, A., Broder, A. Z., and Carmel, D. (2006). Sampling Search-Engine Results.

World Wide Web 9, 4 (Dec. 2006), 397-429.

Menczer, F., Pant, G., Srinivasan, P., and Ruiz, M.E. (2001). Evaluating topic-driven web

crawlers. In Proceedings of the 24th Annual international ACM SIGIR Conference on Research

and Development in information Retrieval (New Orleans, Louisiana, United States). 241-249.

Menczer, F., Pant, G., and Srinivasan, P. (2004). Topical web crawlers: Evaluating adaptive

algorithms. ACM Transactions on Internet Technology 4, 4 (Nov. 2004), 378-419.

Liu, Y. and Agah, A. (2009). Crawling and extracting process data from the Web. In

Proceedings of the 5th International Conference on Advanced Data Mining and Applications

(Beijing, China, Aug. 17-19, 2009) 545-552.

Liu, Y. and Agah, A. (2009). A Prototype Process-Based Search Engine. In Proceedings of the

rd IEEE International Conference on Semantic Computing. Berkeley, CA, September 14-16,

, pp. 481-486.

Google Custom Search. http://www.google.com/coop/cse/.

Downloads

Published

2013-01-25

How to Cite

LIU, Y. ., & AGAH, A. . (2013). TOPICAL CRAWLING ON THE WEB THROUGH LOCAL SITE-SEARCHES. Journal of Web Engineering, 12(3-4), 203–214. Retrieved from https://journals.riverpublishers.com/index.php/JWE/article/view/4153

Issue

Section

Articles