CRAWLING THE INFINITE WEB
Keywords:User sessions, User navigation analysis
Many publicly available Web pages are generated dynamically upon request, and contain links to other dynamically generated pages. Web sites that are built with dynamic pages can create, in principle, a very large amount of Web pages. This poses a problem for the crawlers of Web search engines, as the network and storage resources required for indexing Web pages are neither infinite nor free. In this article, several probabilistic models for user browsing in “infinite” Web sites are proposed and studied. These models aim at predicting how deep users go while exploring Web sites. We use these models to estimate how deep a crawler must go to download a significant portion of the Web site content that is actually visited. The proposed models are validated against real data on page views in several Web sites, showing that, in both theory and practice, a crawler needs to download just a few levels, no more than 3 to 5 “clicks” away from the start page, to reach 90% of the pages that users actually visit.
E. Adar and B. A. Huberman. The economics of web surfing. In Poster Proceedings of the Ninth
Conference on World Wide Web, Amsterdam, Netherlands, May 2000.
R. Baeza-Yates and C. Castillo. Balancing volume, quality and freshness in web crawling. In
Soft Computing Systems - Design, Management and Applications, pages 565–572, Santiago, Chile,
IOS Press Amsterdam.
R. Baeza-Yates and C. Castillo. Crawling the infinite web: five levels are enough. In Proceedings of
the third Workshop on Web Graphs (WAW), volume 3243 of Lecture Notes in Computer Science,
pages 156–167, Rome, Italy, 2004. Springer.
R. Baeza-Yates, C. Castillo, and E. Efthimiadis. Characterization of national web domains. To
appear in ACM TOIT, 2007.
M. K. Bergman. The deep web: Surfacing hidden value. Journal of Electronic Publishing, 7(1),
L. Catledge and J. Pitkow. Characterizing browsing behaviors on the world wide web. Computer
Networks and ISDN Systems, 6(27), 1995.
S. Chakrabarti. Mining the Web: Analysis of Hypertext and Semi Structured Data. Morgan
Kaufmann, August 2002.
J. Cho and H. Garcia-Molina. Synchronizing a database to improve freshness. In Proceedings
of ACM International Conference on Management of Data (SIGMOD), pages 117–128, Dallas,
Texas, USA, 2000.
J. Cho and H. Garcia-Molina. Parallel crawlers. In Proceedings of the eleventh international
conference on World Wide Web, pages 124–135, Honolulu, Hawaii, USA, 2002. ACM Press.
R. Cooley, B. Mobasher, and J. Srivastava. Data preparation for mining world wide web browsing
patterns. Knowledge and Information Systems, 1(1):5–32, 1999.
R. Deraison. Nessus: remote security scanner. http://www.nessus.org/, 2004.
M. Diligenti, M. Gori, and M. Maggini. A unified probabilistis framework for Web page scoring
systems. IEEE Transactions on Knowledge and Data Engineering, 16(1):4–16, 2004.
N. Eiron, K. S. Curley, and J. A. Tomlin. Ranking the web frontier. In Proceedings of the 13th
international conference on World Wide Web, pages 309–318, New York, NY, USA, 2004. ACM
T. M. Ghanem and W. G. Aref. Databases deepen the web. Computer, 37(1), 2004.
Y. Hafri and C. Djeraba. High performance crawling system. In Proceedings of the 6th ACM
SIGMM international workshop on Multimedia information retrieval, pages 299–306, New York,
NY, USA, 2004. ACM Press.
S. Haigh and J. Megarity. Measuring web site usage: Log file analysis. Network Notes, 57, 1998.
M. Henzinger. Hyperlink analysis for the web. IEEE Internet Computing, 5(1):45–50, 2001.
M. Henzinger, A. Heydon, M. Mitzenmacher, and M. Najork. On near–uniform url sampling. In
Proceedings of the Ninth Conference on World Wide Web, pages 295–308, Amsterdam, Netherlands,
May 2000. Elsevier Science.
A. Heydon and M. Najork. Mercator: A scalable, extensible web crawler. World Wide Web
Conference, 2(4):219–229, April 1999.
B. A. Huberman, P. L. T. Pirolli, J. E. Pitkow, and R. M. Lukose. Strong regularities in world
wide web surfing. Science, 280(5360):95–97, April 1998.
S. Lawrence and L. C. Giles. Searching the World Wide Web. Science, 280(5360):98–100, 1998.
M. Levene, J. Borges, and G. Loizou. Zipf’s law for web surfers. Knowledge and Information
Systems, 3(1):120–129, 2001.
J. Liu, S. Zhang, and J. Yang. Characterizing web usage regularities with information foraging
agents. IEEE Transactions on Knowledge and Data Engineering, 16(5), 2004.
R. M. Lukose and B. A. Huberman. Surfing as a real option. In Proceedings of the first international
conference on Information and computation economies, pages 45–51, Charleston, South Carolina,
USA, 1998. ACM Press.
S. P. Meyn and R. L. Tweedie. Markov Chains and Stochastic Stability. Springer, London, 1993.
M. Najork and J. L. Wiener. Breadth-first crawling yields high-quality pages. In Proceedings
of the Tenth Conference on World Wide Web, pages 114–118, Hong Kong, May 2001. Elsevier
L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: bringing order
to the Web. Technical report, Stanford Digital Library Technologies Project, 1998.
S. Raghavan and H. Garcia-Molina. Crawling the hidden web. In Proceedings of the Twentyseventh
International Conference on Very Large Databases (VLDB), pages 129–138, Rome, Italy,
P. N. Tan and V. Kumar. Discovery of web robots session based on their navigational patterns.
Data Mining and Knowledge discovery, 6(1):9–35, 2002.
D. Tanasa and B. Trousse. Advanced data preprocessing for intersites Web usage mining. IEEE
Intelligent Systems, 19(2):59–65, 2004.
L. Tauscher and S. Greenberg. Revisitation patterns in world wide web navigation. In Proceedings
of the Conference on Human Factors in Computing Systems CHI’97, 1997.