Integrated-Block: A New Combination Model to Improve Web Page Segmentation

Authors

DOI:

https://doi.org/10.13052/jwe1540-9589.2146

Keywords:

Webpage analysis, webpage segmentation, semantic text similarity, Gestalt Law of grouping

Abstract

Context: Web page segmentation methods have been used for different purposes such as web page classification and content analysis. These methods categorize a web page into different blocks, where each block contains similar components.

Objective: The goal of this paper is to propose a new segmentation approach that semantically segments web pages into integrated blocks and obtains high segmentation accuracy.

Method: In this paper, we propose a new segmentation model that semantically segments web pages into integrated blocks, where (1) it merges web page content into basic-blocks by simulating human perception using Gestalt laws of grouping; and, (2) it utilizes semantic text similarity to identify similar blocks and regroup these similar basic-blocks as integrated blocks.

Results: To verify the accuracy of our approach, we (1) applied it to three datasets, (2) compared it with the five existing state-of-the-art algorithms. The results show that our approach outperforms all the five comparison methods in terms of precision, recall, F-1 score, and ARI.

Conclusion: In this paper, we propose a new segmentation model and apply it to three datasets to (1) generate basic-blocks by simulating human perception to segment a web page, (2) identify semantically related blocks and regroup them as an integrated block, and (3) address limitations found in existing approaches.

Downloads

Download data is not yet available.

Author Biographies

Saeedeh Sadat Sajjadi Ghaemmaghami, University of Alberta, Canada

Saeedeh Sadat Sajjadi Ghaemmaghami received the Ph.D. degree in computer engineering from the University of Alberta in 2021. Her research interests include web page analysis, machine learning, natural language processing, pattern recognition, and data mining.

James Miller, University of Alberta, Canada

James Miller, P.Eng (Alberta) has been a full professor with the Dept. Electrical and Computer Engineering at The University of Alberta since 2000. Previously, he was a professor at the University of Strathclyde (U.K.) and a principal research scientist at the National Electronics Research Initiative (U.K.). He has been an active researcher for more than thirty years across a wide range of topics, ranging from Computer Vision, Pattern Recognition, Embedded System Design, Software Engineering, Web Engineering and Proactive Analytics. He has published more than 100 articles in peer-reviewed journals including many IEEE and ACM venues.

References

P. Malhotra and S. K. Malik, “Web Page Segmentation Towards Information Extraction for Web Semantics,” in International Conference on Innovative Computing and Communications, 2019, pp. 431–442: Springer.

J. Zeleny, R. Burget, and J. Zendulka, “Box clustering segmentation: A new method for vision-based web page preprocessing,” Information Processing & Management, vol. 53, no. 3, pp. 735–750, 2017.

Z. Bu, C. Zhang, Z. Xia, and J. Wang, “An FAR-SW based approach for webpage information extraction,” Information Systems Frontiers, vol. 16, no. 5, pp. 771–785, 2014.

H. F. Eldirdiery and A. Ahmed, “Web document segmentation for better extraction of information: a review,” International Journal of Computer Applications, vol. 110, no. 3, 2015.

C. Kohlschütter, P. Fankhauser, and W. Nejdl, “Boilerplate detection using shallow text features,” in Proceedings of the third ACM international conference on Web search and data mining, 2010, pp. 441–450.

K. Koffka, Principles of Gestalt psychology. Routledge, 2013.

S. E. Palmer, “Modern theories of Gestalt perception,” 1992.

R. J. Sternberg and K. Sternberg, Cognitive psychology. Nelson Education, 2016.

G. Wen, X. Pan, L. Jiang, and J. Wen, “Modeling Gestalt laws for classification,” in 9th IEEE International Conference on Cognitive Informatics (ICCI’10), 2010, pp. 914–918: IEEE.

Z. Jiang, H. Yin, Y. Wu, Y. Lyu, G. Min, and X. Zhang, “Constructing Novel Block Layouts for Webpage Analysis,” ACM Transactions on Internet Technology (TOIT), vol. 19, no. 3, pp. 1–18, 2019.

S. Wang, W. Zhou, and C. Jiang, “A survey of word embeddings based on deep learning,” Computing, vol. 102, no. 3, pp. 717–740, 2020.

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.

Q. Liu, M. J. Kusner, and P. Blunsom, “A survey on contextual embeddings,” arXiv preprint arXiv:2003.07278, 2020.

A. Rogers, O. Kovaleva, and A. Rumshisky, “A primer in bertology: What we know about how bert works,” Transactions of the Association for Computational Linguistics, vol. 8, pp. 842–866, 2020.

S. Gupta, G. Kaiser, D. Neistadt, and P. Grimm, “DOM-based content extraction of HTML documents,” in Proceedings of the 12th international conference on World Wide Web, 2003, pp. 207–214.

Y. Chen, W.-Y. Ma, and H.-J. Zhang, “Detecting web page structure for adaptive viewing on small form factor devices,” in Proceedings of the 12th international conference on World Wide Web, 2003, pp. 225–233.

Q. Fan, C. Yan, and L. Huang, “Discovering Informative Contents of Web Pages,” in International Conference on Web-Age Information Management, 2014, pp. 180–191: Springer.

J. Kong et al., “Web interface interpretation using graph grammars,” IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), vol. 42, no. 4, pp. 590–602, 2011.

A. Sanoja and S. Gançarski, “Block-o-matic: A web page segmentation framework,” in 2014 international conference on multimedia computing and systems (ICMCS), 2014, pp. 595–600: IEEE.

T. Manabe and K. Tajima, “Extracting logical hierarchical structure of HTML documents based on headings,” Proceedings of the VLDB Endowment, vol. 8, no. 12, pp. 1606–1617, 2015.

D. Cai, S. Yu, J.-R. Wen, and W.-Y. Ma, “Extracting Content Structure for Web Pages Based on Visual Representation,” Berlin, Heidelberg, 2003, pp. 406–417: Springer Berlin Heidelberg.

D. Cai, S. Yu, J.-R. Wen, and W.-Y. Ma, “Vips: a vision-based page segmentation algorithm,” 2003.

M. Cormer, R. Mann, K. Moffatt, and R. Cohen, “Towards an improved vision-based web page segmentation algorithm,” in 2017 14th Conference on Computer and Robot Vision (CRV), 2017, pp. 345–352: IEEE.

R. R. Mehta, P. Mitra, and H. Karnick, “Extracting semantic structure of web documents using content and visual information,” in Special interest tracks and posters of the 14th international conference on World Wide Web, 2005, pp. 928–929.

W. Liu, X. Meng, and W. Meng, “Vide: A vision-based approach for deep web data extraction,” IEEE Transactions on Knowledge and Data Engineering, vol. 22, no. 3, pp. 447–460, 2009.

R. Kumar, J. O. Talton, S. Ahmad, and S. R. Klemmer, “Bricolage: A Structured-Prediction Algorithm for Example-Based Web Design,” Proc. CHI 2011, 2011.

Z. Xu and J. Miller, “Identifying semantic blocks in Web pages using Gestalt laws of grouping,” World Wide Web, vol. 19, no. 5, pp. 957–978, 2016.

Z. Xu and J. Miller, “Cross-browser differences detection based on an empirical metric for web page visual similarity,” ACM Transactions on Internet Technology (TOIT), vol. 18, no. 3, pp. 1–23, 2018.

Z. Xu and J. Miller, “A new webpage classification model based on visual information using gestalt laws of grouping,” in International Conference on Web Information Systems Engineering, 2015, pp. 225–232: Springer.

Z. Xu and J. Miller, “Estimating similarity of rich internet pages using visual information,” International Journal of Web Engineering and Technology, vol. 12, no. 2, pp. 97–119, 2017.

C. Kohlschütter and W. Nejdl, “A densitometric approach to web page segmentation,” in Proceedings of the 17th ACM conference on Information and knowledge management, 2008, pp. 1173–1182.

S. S. Sajjadi-Ghaemmaghami and J. Miller, “A New Semantic Approach to Improve Web page Segmentation”, Journal of Web Engineering, vol. 20, no. 4, pp. 963992, June 2021, DOI: 10.13052/jwe1540-9589. 2042.

J. Blustein, N. R. D. Matteo, and D. Macrini, “Designing Experiments to Compare Web Page Segmenters,” presented at the Proceedings of the 2nd International Workshop on Human Factors in Hypertext, Hof, Germany, 2019. Available: https://doi-org.login.ezproxy.library.ualberta.ca/10.1145/3345509.3349280

M. Kovacevic, M. Diligenti, M. Gori, and V. Milutinovic, “Recognition of Common Areas in a Web Page Using Visual Information: a possible application in a page classification,” in 2002 IEEE International Conference on Data Mining, 2002. Proceedings., 2002, pp. 250–257: IEEE.

W. H. Gomaa and A. Fahmy, “A Survey of Text Similarity Approaches,” International Journal of Computer Applications, vol. 68, pp. 13–18, 2013.

G. Liu, C. Guo, L. Xie, W. Liu, N. Xiong, and G. Chen, “An intelligent CNN-VAE text representation technology based on text semantics for comprehensive big data,” arXiv preprint arXiv:2008.12522, 2020.

J. Yan, “Text Representation,” in Encyclopedia of Database Systems, L. Liu and M. T. ÖZsu, Eds. Boston, MA: Springer US, 2009, pp. 3069–3072.

T. Mikolov, Q. V. Le, and I. Sutskever, “Exploiting similarities among languages for machine translation,” arXiv preprint arXiv:1309.4168, 2013.

T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” arXiv preprint arXiv:1301.3781, 2013.

J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word representation,” in Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014, pp. 1532–1543.

Q. Le and T. Mikolov, “Distributed representations of sentences and documents,” in International conference on machine learning, 2014, pp. 1188–1196.

M. E. Peters et al., “Deep contextualized word representations,” arXiv preprint arXiv:1802.05365, 2018.

A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving language understanding by generative pre-training,” 2018.

K. Babić, S. Martinčić-Ipšić, and A. Meštrović, “Survey of Neural Text Representation Models,” Information, vol. 11, no. 11, p. 511, 2020.

N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,” arXiv preprint arXiv:1908.10084, 2019.

S. R. Bowman, G. Angeli, C. Potts, and C. D. Manning, “A large annotated corpus for learning natural language inference,” arXiv preprint arXiv:1508.05326, 2015.

A. Williams, N. Nangia, and S. R. Bowman, “A broad-coverage challenge corpus for sentence understanding through inference,” arXiv preprint arXiv:1704.05426, 2017.

“dataset-popular 2014. A dataset of popular pages (taken from dir.yahoo.com) with manually marked up semantic blocks. Retrieved from https://github.com/rkrzr/dataset-popular.,” ed.

“dataset-random 2014. A dataset of random pages with manually marked up semantic blocks. Retrieved from https://github.com/rkrzr/dataset-random.,” ed.

“Alexa. 2016. The top 500 sites on the web. Retrieved from http://www.alexa.com/topsites.,” ed.

VIPS-JAVA [n.d.]. Implementation of Vision Based Page Segmentation Algorithm in Java. Retrieved from https://github.com/tpopela/vips-java.

A. S. Bozkir and E. A. Sezer, “Layout-based computation of web page similarity ranks,” International Journal of Human-Computer Studies, vol. 110, pp. 95–114, 2018.

D. Chakrabarti, R. Kumar, and K. Punera, “A graph-theoretic approach to webpage segmentation,” in Proceedings of the 17th international conference on World Wide Web, 2008, pp. 377–386.

L. Hubert and P. Arabie, “Comparing partitions,” Journal of classification, vol. 2, no. 1, pp. 193–218, 1985.

K. Y. Yeung and W. L. Ruzzo, “Details of the adjusted rand index and clustering algorithms, supplement to the paper an empirical study on principal component analysis for clustering gene expression data,” Bioinformatics, vol. 17, no. 9, pp. 763–774, 2001.

X. Qiu, T. Sun, Y. Xu, Y. Shao, N. Dai, and X. Huang, “Pre-trained models for natural language processing: A survey,” Science China Technological Sciences, pp. 1–26, 2020.

U. Naseem, I. Razzak, S. K. Khan, and M. Prasad, “A Comprehensive Survey on Word Representation Models: From Classical to State-Of-The-Art Word Representation Language Models,” arXiv preprint arXiv:2010.15036, 2020.

T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” arXiv preprint arXiv:1310.4546, 2013.

A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov, “Bag of tricks for efficient text classification,” arXiv preprint arXiv:1607.01759, 2016.

A. Miaschi and F. Dell’Orletta, “Contextual and Non-Contextual Word Embeddings: an in-depth Linguistic Investigation,” in Proceedings of the 5th Workshop on Representation Learning for NLP, 2020, pp. 110–119.

A. Vaswani et al., “Attention is all you need,” arXiv preprint arXiv:1706.03762, 2017, Available online: https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf (accessed on 29 October 2020).

Sun, Yu, et al. “Ernie 2.0: A continual pre-training framework for language understanding,” Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, No. 05, 2020.

Published

2022-04-16

How to Cite

Ghaemmaghami, S. S. S. ., & Miller, J. . (2022). Integrated-Block: A New Combination Model to Improve Web Page Segmentation. Journal of Web Engineering, 21(04), 1103–1144. https://doi.org/10.13052/jwe1540-9589.2146

Issue

Section

Articles