Web searching, topical resource discovery, topical crawling, local site-searchAbstract
In this paper, we investigate the feasibility of discovering topical resources by combining Web searches and local site-searches. Existing techniques of topical resource discovery consist of crawling the Web and searching the Web. The former typically analyses linkage among Web pages to estimate the relevance of an unseen document to a topic. The latter exploits the indices of generic search engines to discover documents relevant to a topic. Although the local site-search has been a simple and convenient feature of a Web site for human users to quickly locate desired information within the site that hosts tremendous number of documents, this feature has been ignored by the techniques of automatic topical resource discovery. A typical local site-search returns a list of titles, hyperlinks, and snippets of relevant documents that can be used to estimate the relevance of the documents to the topic before actually fetching the documents. We propose an operational model to make use of this simple feature, and address how this model can be realized. Experiments have shown that this simple but efficient approach can provide much more precise estimations than a sophisticated intelligent topical crawler.
Arasu, A., Cho, J., Garcia-Molina, H., Paepcke, A., and Raghavan, S. (2001). Searching the
Web. Transactions on Internet Technology 1, 1 (Aug. 2001), 2-43.
Chakrabarti, S., van den Berg, M., and Dom, B. (1999). Focused crawling: a new approach to
topic-specific Web resource discovery. In Proceedings of the Eighth international Conference
on World Wide Web (Toronto, Canada). 1623-1640.
Chakrabarti, S., van den Berg, M., and Dom, B. (1999). Distributed Hypertext Resource
Discovery Through Examples. In Proceedings of the 25th international Conference on Very
Large Data Bases (September 07 - 10, 1999). 375-386.
Aggarwal, C. C., Al-Garawi, F., and Yu, P.S. (2001). Intelligent crawling on the World Wide
Web with arbitrary predicates. In Proceedings of the 10th international Conference on World
Wide Web (Hong Kong, Hong Kong, May 01 - 05, 2001). 96-105.
Aggarwal, C.C., Al-Garawi, F., and Yu, P.S. (2001). On the design of a learning crawler for
topical resource discovery. ACM Transactions on Information Systems 19, 3 (Jul. 2001), 286-
Cohen, W.W. and Singer, Y. (1996). Learning to query the web. In AAAI Workshop on
Internet-Based Information Systems, 1996.
Ipeirotis, P.G., Agichtein, E., Jain, P., and Gravano, L. (2006). To search or to crawl?: towards
a query optimizer for text-centric tasks. In Proceedings of the 2006 ACM SIGMOD
International Conference on Management of Data (Chicago, IL, USA, June 27 - 29, 2006).
Aggarwal, C.C. (2004). On Leveraging User Access Patterns for Topic Specific Crawling. Data
Mining and Knowledge Discovery 9, 2 (Sep. 2004), 123-145.
Diligenti, M., Coetzee, F., Lawrence, S., Giles, C. L., and Gori, M. (2000). Focused Crawling
Using Context Graphs. In Proceedings of the 26th International Conference on Very Large
Data Bases (September 10 - 14, 2000). 527-534.
Liu, H., Milios, E., and Janssen, J. (2004). Probabilistic models for focused web crawling. In
Proceedings of the 6th Annual ACM International Workshop on Web information and Data
Management (Washington DC, USA, November 12 - 13, 2004). 16-22.
Qin, J., Zhou, Y., and Chau, M. (2004). Building domain-specific web collections for scientific
digital libraries: a meta-search enhanced focused crawling method. In Proceedings of the 4th
ACM/IEEE-CS Joint Conference on Digital Libraries (Tuscon, AZ, USA, June 07 - 11, 2004).
Chakrabarti, S., Punera, K., and Subramanyam, M. (2002). Accelerated focused crawling
through online relevance feedback. In Proceedings of the 11th International Conference on
World Wide Web (Honolulu, Hawaii, USA, May 07 - 11, 2002). 148-159.
Pant, G. and Srinivasan, P. (2005). Learning to crawl: Comparing classification schemes. ACM
Transactions on Information Systems 23, 4 (October 2005), 430-462.
Raghavan, S. and Garcia-Molina, H. (2001). Crawling the Hidden Web. In Proceedings of the
th International Conference on Very Large Data Bases (September 11 - 14, 2001). 129-138.
Madhavan, J., Ko, D., Kot, Ł., Ganapathy, V., Rasmussen, A., and Halevy, A. (2008). Google's
Deep Web crawl. In Proceedings of VLDB Endowment 1, 2 (Aug. 2008), 1241-1252.
Index of the
Google SOAP Search API.
Anagnostopoulos, A., Broder, A. Z., and Carmel, D. (2005). Sampling search-engine results. In
Proceedings of the 14th International Conference on World Wide Web (Chiba, Japan, May 10 -
, 2005). 245-256.
Anagnostopoulos, A., Broder, A. Z., and Carmel, D. (2006). Sampling Search-Engine Results.
World Wide Web 9, 4 (Dec. 2006), 397-429.
Menczer, F., Pant, G., Srinivasan, P., and Ruiz, M.E. (2001). Evaluating topic-driven web
crawlers. In Proceedings of the 24th Annual international ACM SIGIR Conference on Research
and Development in information Retrieval (New Orleans, Louisiana, United States). 241-249.
Menczer, F., Pant, G., and Srinivasan, P. (2004). Topical web crawlers: Evaluating adaptive
algorithms. ACM Transactions on Internet Technology 4, 4 (Nov. 2004), 378-419.
Liu, Y. and Agah, A. (2009). Crawling and extracting process data from the Web. In
Proceedings of the 5th International Conference on Advanced Data Mining and Applications
(Beijing, China, Aug. 17-19, 2009) 545-552.
Liu, Y. and Agah, A. (2009). A Prototype Process-Based Search Engine. In Proceedings of the
rd IEEE International Conference on Semantic Computing. Berkeley, CA, September 14-16,
, pp. 481-486.
Google Custom Search.