On The Evolution of Clusters of Near-Duplicate Web Pages

Authors

  • D. Fetterly On The Evolution of Clusters of Near-Duplicate Web Pages
  • M. Manasse On The Evolution of Clusters of Near-Duplicate Web Pages
  • M. Najork On The Evolution of Clusters of Near-Duplicate Web Pages

Keywords:

We characterization, web evolution, clusters, mirrors, mirror detection

Abstract

This paper expands on a 1997 study of the amount and distribution of near duplicate pages on the World Wide Web. We downloaded a set of 150 million web pages on weekly basis over the span of 11 weeks. We than determined which of these pages are near duplicates of one another, and tracked how clusters of near duplicate documents evolved over time.

 

Downloads

Download data is not yet available.

Downloads

Published

2004-06-11

How to Cite

Fetterly, D., Manasse, M. ., & Najork, M. . (2004). On The Evolution of Clusters of Near-Duplicate Web Pages. Journal of Web Engineering, 2(4), 228–246. Retrieved from https://journals.riverpublishers.com/index.php/JWE/article/view/4355

Issue

Section

Articles