A New Semantic Approach to Improve Webpage Segmentation





Webpage analysis, webpage segmentation, semantic text similarity, Gestalt Law of grouping


Webpage analysis is carried out for various purposes such as webpage segmentation. The goal of webpage segmentation is to divide a page into blocks that have similar elements. A fusion approach that combines different analyses is required in order to obtain high segmentation accuracy. In this paper, we propose a new fusion model for webpage segmentation, where we (1) merge webpage content into basic-blocks by simulating human perception; and, (2) identify similar blocks using semantic text similarity and regroup these similar blocks as fusion blocks. This approach is applied to three public datasets and evaluated by comparing with state-of-the-art algorithms. The results characterize that our proposed approach outperforms other existing webpage segmentation methods, in terms of accuracy.


Saeedeh Sadat Sajjadi Ghaemmaghami, University of Alberta, Canada

Saeedeh Sadat Sajjadi Ghaemmaghami received the BS and MS degrees in computer engineering from QIAU, Iran, in 2007 and 2012, respectively. She is currently working toward the PhD degree in the Department of Electrical and Computer Engineering at the University of Alberta. Her research interests include webpage analysis, machine learning, natural language processing, and image processing.

James Miller, University of Alberta, Canada

James Miller, P.Eng (Alberta) has been a full professor with the Dept. Electrical and Computer Engineering at The University of Alberta since 2000. Previously, he was a professor at the University of Strathclyde (U.K.) and a principal research scientist at the National Electronics Research Initiative (U.K.). He has been an active researcher for more than thirty years across a wide range of topics, ranging from Computer Vision, Pattern Recognition, Embedded System Design, Software Engineering, Web Engineering and Proactive Analytics. He has published more than 100 articles in peer-reviewed journals including many IEEE and ACM venues.


Ghaemmaghami, S. S. S., & Miller, J. (2021). A New Semantic Approach to Improve Webpage Segmentation. Journal of Web Engineering, 20(4), 963–992. https://doi.org/10.13052/jwe1540-9589.2042


