CRAWLING THE INFINITE WEB

Authors

  • RICARDO BAEZA-YATES Center for Web Research, Dept. of Computer Science, University of Chilea Blanco Encalada 2120, Santiago 8370459, Chile
  • CARLOS CASTILLO Universitat Pompeu Fabrab Pg. de Circumval·laci´o 8, Barcelona 08003, Catalunya-Spain

Keywords:

User sessions, User navigation analysis

Abstract

Many publicly available Web pages are generated dynamically upon request, and contain links to other dynamically generated pages. Web sites that are built with dynamic pages can create, in principle, a very large amount of Web pages. This poses a problem for the crawlers of Web search engines, as the network and storage resources required for indexing Web pages are neither infinite nor free. In this article, several probabilistic models for user browsing in “infinite” Web sites are proposed and studied. These models aim at predicting how deep users go while exploring Web sites. We use these models to estimate how deep a crawler must go to download a significant portion of the Web site content that is actually visited. The proposed models are validated against real data on page views in several Web sites, showing that, in both theory and practice, a crawler needs to download just a few levels, no more than 3 to 5 “clicks” away from the start page, to reach 90% of the pages that users actually visit.

 

Downloads

Download data is not yet available.

References

E. Adar and B. A. Huberman. The economics of web surfing. In Poster Proceedings of the Ninth

Conference on World Wide Web, Amsterdam, Netherlands, May 2000.

R. Baeza-Yates and C. Castillo. Balancing volume, quality and freshness in web crawling. In

Soft Computing Systems - Design, Management and Applications, pages 565–572, Santiago, Chile,

IOS Press Amsterdam.

R. Baeza-Yates and C. Castillo. Crawling the infinite web: five levels are enough. In Proceedings of

the third Workshop on Web Graphs (WAW), volume 3243 of Lecture Notes in Computer Science,

pages 156–167, Rome, Italy, 2004. Springer.

R. Baeza-Yates, C. Castillo, and E. Efthimiadis. Characterization of national web domains. To

appear in ACM TOIT, 2007.

M. K. Bergman. The deep web: Surfacing hidden value. Journal of Electronic Publishing, 7(1),

L. Catledge and J. Pitkow. Characterizing browsing behaviors on the world wide web. Computer

Networks and ISDN Systems, 6(27), 1995.

S. Chakrabarti. Mining the Web: Analysis of Hypertext and Semi Structured Data. Morgan

Kaufmann, August 2002.

J. Cho and H. Garcia-Molina. Synchronizing a database to improve freshness. In Proceedings

of ACM International Conference on Management of Data (SIGMOD), pages 117–128, Dallas,

Texas, USA, 2000.

J. Cho and H. Garcia-Molina. Parallel crawlers. In Proceedings of the eleventh international

conference on World Wide Web, pages 124–135, Honolulu, Hawaii, USA, 2002. ACM Press.

R. Cooley, B. Mobasher, and J. Srivastava. Data preparation for mining world wide web browsing

patterns. Knowledge and Information Systems, 1(1):5–32, 1999.

R. Deraison. Nessus: remote security scanner. http://www.nessus.org/, 2004.

M. Diligenti, M. Gori, and M. Maggini. A unified probabilistis framework for Web page scoring

systems. IEEE Transactions on Knowledge and Data Engineering, 16(1):4–16, 2004.

N. Eiron, K. S. Curley, and J. A. Tomlin. Ranking the web frontier. In Proceedings of the 13th

international conference on World Wide Web, pages 309–318, New York, NY, USA, 2004. ACM

Press.

T. M. Ghanem and W. G. Aref. Databases deepen the web. Computer, 37(1), 2004.

Y. Hafri and C. Djeraba. High performance crawling system. In Proceedings of the 6th ACM

SIGMM international workshop on Multimedia information retrieval, pages 299–306, New York,

NY, USA, 2004. ACM Press.

S. Haigh and J. Megarity. Measuring web site usage: Log file analysis. Network Notes, 57, 1998.

M. Henzinger. Hyperlink analysis for the web. IEEE Internet Computing, 5(1):45–50, 2001.

M. Henzinger, A. Heydon, M. Mitzenmacher, and M. Najork. On near–uniform url sampling. In

Proceedings of the Ninth Conference on World Wide Web, pages 295–308, Amsterdam, Netherlands,

May 2000. Elsevier Science.

A. Heydon and M. Najork. Mercator: A scalable, extensible web crawler. World Wide Web

Conference, 2(4):219–229, April 1999.

B. A. Huberman, P. L. T. Pirolli, J. E. Pitkow, and R. M. Lukose. Strong regularities in world

wide web surfing. Science, 280(5360):95–97, April 1998.

S. Lawrence and L. C. Giles. Searching the World Wide Web. Science, 280(5360):98–100, 1998.

M. Levene, J. Borges, and G. Loizou. Zipf’s law for web surfers. Knowledge and Information

Systems, 3(1):120–129, 2001.

J. Liu, S. Zhang, and J. Yang. Characterizing web usage regularities with information foraging

agents. IEEE Transactions on Knowledge and Data Engineering, 16(5), 2004.

R. M. Lukose and B. A. Huberman. Surfing as a real option. In Proceedings of the first international

conference on Information and computation economies, pages 45–51, Charleston, South Carolina,

USA, 1998. ACM Press.

S. P. Meyn and R. L. Tweedie. Markov Chains and Stochastic Stability. Springer, London, 1993.

M. Najork and J. L. Wiener. Breadth-first crawling yields high-quality pages. In Proceedings

of the Tenth Conference on World Wide Web, pages 114–118, Hong Kong, May 2001. Elsevier

Science.

L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: bringing order

to the Web. Technical report, Stanford Digital Library Technologies Project, 1998.

S. Raghavan and H. Garcia-Molina. Crawling the hidden web. In Proceedings of the Twentyseventh

International Conference on Very Large Databases (VLDB), pages 129–138, Rome, Italy,

Morgan Kaufmann.

P. N. Tan and V. Kumar. Discovery of web robots session based on their navigational patterns.

Data Mining and Knowledge discovery, 6(1):9–35, 2002.

D. Tanasa and B. Trousse. Advanced data preprocessing for intersites Web usage mining. IEEE

Intelligent Systems, 19(2):59–65, 2004.

L. Tauscher and S. Greenberg. Revisitation patterns in world wide web navigation. In Proceedings

of the Conference on Human Factors in Computing Systems CHI’97, 1997.

Downloads

Published

2007-10-30

How to Cite

BAEZA-YATES, R. ., & CASTILLO, C. . (2007). CRAWLING THE INFINITE WEB. Journal of Web Engineering, 6(1), 049–072. Retrieved from https://journals.riverpublishers.com/index.php/JWE/article/view/4135

Issue

Section

Articles