An An Empirical Study of Web Page Structural Properties
DOI:
https://doi.org/10.13052/jwe1540-9589.2044Keywords:
web sites, empirical study, statistical analysis, DOMAbstract
The paper reports results on an empirical study of the structural properties of HTML markup in websites. A first large-scale survey is made on 708 contemporary (2019–2020) websites, in order to measure various features related to their size and structure: DOM tree size, maximum degree, depth, diversity of element types and CSS classes, among others. The second part of the study leverages archived pages from the Internet Archive, in order to retrace the evolution of these features over a span of 25 years. The goal of this research is to serve as a reference point for studies that include an empirical evaluation on samples of web pages.
Downloads
References
The Moz top 500 websites. https://moz.com/top500, Accessed October 20th, 2019.
S. G. Ainsworth, M. L. Nelson, and H. V. de Sompel. Only one out of five archived web pages existed as presented. In Y. Yesilada, R. Farzan, and G. Houben, editors, Proceedings of the 26th ACM Conference on Hypertext & Social Media, HT 2015, Guzelyurt, TRNC, Cyprus, September 1-4, 2015, pages 257–266. ACM, 2015.
X. Chamberland-Thibeault and S. Hallé. Structural profiling of web sites in the wild. In M. Bieliková, T. Mikkonen, and C. Pautasso, editors, Web Engineering - 20th International Conference, ICWE 2020, Helsinki, Finland, June 9-12, 2020, Proceedings, volume 12128 of Lecture Notes in Computer Science, pages 27–34. Springer, 2020.
X. Chamberland-Thibeault and S. Hallé. Longitudinal study of website structure (LabPal instance). DOI: 10.5281/zenodo.4752419.
S. R. Choudhary, M. R. Prasad, and A. Orso. X-PERT: accurate identification of cross-browser issues in web applications. In D. Notkin, B. H. C. Cheng, and K. Pohl, editors, Proc. ICSE 2013, pages 702–711. IEEE Computer Society, 2013.
S. Hallé, N. Bergeron, F. Guerin, G. L. Breton, and O. Beroual. Declarative layout constraints for testing web applications. J. Log. Algebr. Meth. Program., 85(5):737–758, 2016.
S. Hallé, R. Khoury, and M. Awesso. Streamlining the inclusion of computer experiments in a research paper. IEEE Computer, 51(11):78–89, 2018.
I. Hickson, R. Berjon, S. Faulkner, T. Leithead, E. D. Navara, and T. O. S. Pfeiffer. HTML 5: A vocabulary and associated APIs for HTML and XHTML (recommendation). Technical report, World Wide Web Consortium, 2014. http://www.w3.org/TR/2014/REC-html5-20141028/.
I. Hickson and D. Hyatt. HTML 5: A vocabulary and associated APIs for HTML and XHTML (working draft). Technical report, World Wide Web Consortium, 2008. http://www.w3.org/TR/2008/WD-html5-20080122/.
B. A. Howell. Proving web history: How to use the Internet archive. Journal of Internet Law, 9(8):3–9, 2006.
J. L. Q. III and R. A. Crudo. Using the Wayback Machine in patent litigation. Landslide Magazine, 6(3), 2014.
A. Lerner, T. Kohno, and F. Roesner. Rewriting history: Changing the archived web from the present. In B. M. Thuraisingham, D. Evans, T. Malkin, and D. Xu, editors, Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, CCS 2017, Dallas, TX, USA, October 30 - November 03, 2017, pages 1741–1755. ACM, 2017.
S. Mahajan, A. Alameer, P. McMinn, and W. G. J. Halfond. Automated repair of layout cross browser issues using search-based techniques. In Proc. ISSTA 2017, pages 249–260, 2017.
N. Nikiforakis, L. Invernizzi, A. Kapravelos, S. Van Acker, W. Joosen, C. Kruegel, F. Piessens, and G. Vigna. You are what you include: large-scale evaluation of remote JavaScript inclusions. In T. Yu, G. Danezis, and V. D. Gligor, editors, the ACM Conference on Computer and Communications Security, CCS’12, Raleigh, NC, USA, October 16-18, 2012, pages 736–747. ACM, 2012.
D. Perry. Acid3 test simplified; all modern browsers score 100, 2011. https://www.tomsguide.com/us/acid3-browser-test-web-standard-compatibility-IE9,news-12583.html, Retrieved January 14th, 2020.
T. A. Walsh, P. McMinn, and G. M. Kapfhammer. Automatic detection of potential layout faults following changes to responsive web pages (N). In M. B. Cohen, L. Grunske, and M. Whalen, editors, Proc. ASE 2015, pages 709–714. IEEE Computer Society, 2015.