An An Empirical Study of Web Page Structural Properties




web sites, empirical study, statistical analysis, DOM


The paper reports results on an empirical study of the structural properties of HTML markup in websites. A first large-scale survey is made on 708 contemporary (2019–2020) websites, in order to measure various features related to their size and structure: DOM tree size, maximum degree, depth, diversity of element types and CSS classes, among others. The second part of the study leverages archived pages from the Internet Archive, in order to retrace the evolution of these features over a span of 25 years. The goal of this research is to serve as a reference point for studies that include an empirical evaluation on samples of web pages.


Download data is not yet available.

Author Biographies

Xavier Chamberland-Thibeault, Laboratoire d’informatique formelle Université du Québec à Chicoutimi, Canada

Xavier Chamberland-Thibeault is an M.Sc. candidate at Université du Québec à Chicoutimi, Canada, and a lecturer of Computer Science at Cégep de Jonquière, Canada. In parallel to his work on web site profiling, Xavier has been involved in the development of auto-repair features in Cornipickle, a declarative website testing tool. His work has been published at the International Conference on Web Engineering in 2020 and 2021.

Sylvain Hallé, Laboratoire d’informatique formelle Université du Québec à Chicoutimi, Canada

Sylvain Hallé is the Canada Research Chair in Software Specification, Testing and Verification and a Full Professor of Computer Science at Université du Québec à Chicoutimi, Canada. He started working at UQAC in 2010, after completing a PhD from Université du Québec à Montréal and working as a postdoctoral research at University of California Santa Barbara. He is the lead developer of the Cornipickle declarative web testing tool, and the author of more than 100 scientific publications. Pr. Hallé has earned several best paper awards for his work on the application of formal methods to various types of software systems.


The Moz top 500 websites., Accessed October 20th, 2019.

S. G. Ainsworth, M. L. Nelson, and H. V. de Sompel. Only one out of five archived web pages existed as presented. In Y. Yesilada, R. Farzan, and G. Houben, editors, Proceedings of the 26th ACM Conference on Hypertext & Social Media, HT 2015, Guzelyurt, TRNC, Cyprus, September 1-4, 2015, pages 257–266. ACM, 2015.

X. Chamberland-Thibeault and S. Hallé. Structural profiling of web sites in the wild. In M. Bieliková, T. Mikkonen, and C. Pautasso, editors, Web Engineering - 20th International Conference, ICWE 2020, Helsinki, Finland, June 9-12, 2020, Proceedings, volume 12128 of Lecture Notes in Computer Science, pages 27–34. Springer, 2020.

X. Chamberland-Thibeault and S. Hallé. Longitudinal study of website structure (LabPal instance). DOI: 10.5281/zenodo.4752419.

S. R. Choudhary, M. R. Prasad, and A. Orso. X-PERT: accurate identification of cross-browser issues in web applications. In D. Notkin, B. H. C. Cheng, and K. Pohl, editors, Proc. ICSE 2013, pages 702–711. IEEE Computer Society, 2013.

S. Hallé, N. Bergeron, F. Guerin, G. L. Breton, and O. Beroual. Declarative layout constraints for testing web applications. J. Log. Algebr. Meth. Program., 85(5):737–758, 2016.

S. Hallé, R. Khoury, and M. Awesso. Streamlining the inclusion of computer experiments in a research paper. IEEE Computer, 51(11):78–89, 2018.

I. Hickson, R. Berjon, S. Faulkner, T. Leithead, E. D. Navara, and T. O. S. Pfeiffer. HTML 5: A vocabulary and associated APIs for HTML and XHTML (recommendation). Technical report, World Wide Web Consortium, 2014.

I. Hickson and D. Hyatt. HTML 5: A vocabulary and associated APIs for HTML and XHTML (working draft). Technical report, World Wide Web Consortium, 2008.

B. A. Howell. Proving web history: How to use the Internet archive. Journal of Internet Law, 9(8):3–9, 2006.

J. L. Q. III and R. A. Crudo. Using the Wayback Machine in patent litigation. Landslide Magazine, 6(3), 2014.

A. Lerner, T. Kohno, and F. Roesner. Rewriting history: Changing the archived web from the present. In B. M. Thuraisingham, D. Evans, T. Malkin, and D. Xu, editors, Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, CCS 2017, Dallas, TX, USA, October 30 - November 03, 2017, pages 1741–1755. ACM, 2017.

S. Mahajan, A. Alameer, P. McMinn, and W. G. J. Halfond. Automated repair of layout cross browser issues using search-based techniques. In Proc. ISSTA 2017, pages 249–260, 2017.

N. Nikiforakis, L. Invernizzi, A. Kapravelos, S. Van Acker, W. Joosen, C. Kruegel, F. Piessens, and G. Vigna. You are what you include: large-scale evaluation of remote JavaScript inclusions. In T. Yu, G. Danezis, and V. D. Gligor, editors, the ACM Conference on Computer and Communications Security, CCS’12, Raleigh, NC, USA, October 16-18, 2012, pages 736–747. ACM, 2012.

D. Perry. Acid3 test simplified; all modern browsers score 100, 2011.,news-12583.html, Retrieved January 14th, 2020.

T. A. Walsh, P. McMinn, and G. M. Kapfhammer. Automatic detection of potential layout faults following changes to responsive web pages (N). In M. B. Cohen, L. Grunske, and M. Whalen, editors, Proc. ASE 2015, pages 709–714. IEEE Computer Society, 2015.






ICWE 2020