A New Semantic Approach to Improve Webpage Segmentation
Webpage analysis is carried out for various purposes such as webpage segmentation. The goal of webpage segmentation is to divide a page into blocks that have similar elements. A fusion approach that combines different analyses is required in order to obtain high segmentation accuracy. In this paper, we propose a new fusion model for webpage segmentation, where we (1) merge webpage content into basic-blocks by simulating human perception; and, (2) identify similar blocks using semantic text similarity and regroup these similar blocks as fusion blocks. This approach is applied to three public datasets and evaluated by comparing with state-of-the-art algorithms. The results characterize that our proposed approach outperforms other existing webpage segmentation methods, in terms of accuracy.
J. Zeleny, R. Burget, and J. Zendulka, Box clustering segmentation: A new method for vision-based web page preprocessing, Information Processing & Management, vol. 53, no. 3, pp. 735–750, 2017.
S. Baluja, Browsing on small screens: recasting web-page segmentation into an efficient machine learning framework. pp. 33–42.
L. Bing, R. Guo, W. Lam, Z.-Y. Niu, and H. Wang, Web page segmentation with structured prediction and its application in web page classification. pp. 767–776.
D. Chakrabarti, R. Kumar, and K. Punera, A graph-theoretic approach to webpage segmentation. pp. 377–386.
C. Kohlschütter, and W. Nejdl, A densitometric approach to web page segmentation. pp. 1173–1182.
Z. Jiang, H. Yin, Y. Wu, Y. Lyu, G. Min, and X. Zhang, Constructing Novel Block Layouts for Webpage Analysis, ACM Transactions on Internet Technology (TOIT), vol. 19, no. 3, pp. 1–18, 2019.
J. Kang, J. Yang, and J. Choi, Repetition-based web page segmentation by detecting tag patterns for small-screen devices, IEEE Transactions on Consumer Electronics, vol. 56, no. 2, pp. 980–986, 2010.
Z. Bu, C. Zhang, Z. Xia, and J. Wang, An FAR-SW based approach for webpage information extraction, Information Systems Frontiers, vol. 16, no. 5, pp. 771–785, 2014.
H. F. Eldirdiery, and A. Ahmed, Web document segmentation for better extraction of information: a review, International Journal of Computer Applications, vol. 110, no. 3, 2015.
C. Kohlschütter, P. Fankhauser, and W. Nejdl, Boilerplate detection using shallow text features. pp. 441–450.
K. Koffka, Principles of Gestalt psychology: Routledge, 2013.
S. E. Palmer, “Modern theories of Gestalt perception,” Mind Lang. 5(4), 289–323, 1990.
R. J. Sternberg, and K. Sternberg, Cognitive psychology, 3rd edn, Wadsworth, Belmont, 2003.
Z. Xu, and J. Miller, Identifying semantic blocks in Web pages using Gestalt laws of grouping, World Wide Web, vol. 19, no. 5, pp. 957–978, 2016.
S. Gupta, G. Kaiser, D. Neistadt, and P. Grimm, DOM-based content extraction of HTML documents. pp. 207–214.
Y. Chen, W.-Y. Ma, and H.-J. Zhang, Detecting web page structure for adaptive viewing on small form factor devices. pp. 225–233.
Q. Fan, C. Yan, and L. Huang, Discovering Informative Contents of Web Pages. pp. 180–191.
L. Yi, and B. Liu, Web page cleaning for web mining through feature weighting. pp. 43–48.
L. Yi, B. Liu, and X. Li, Eliminating noisy information in Web pages for data mining, in Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, Washington, D.C., 2003, pp. 296–305.
D. Cai, S. Yu, J.-R. Wen, and W.-Y. Ma, Vips: a vision-based page segmentation algorithm, 2003.
D. Cai, S. Yu, J.-R. Wen, and W.-Y. Ma, Extracting Content Structure for Web Pages Based on Visual Representation, Web Technologies and Applications. pp. 406–417.
A. Sanoja, and S. Gançarski, Block-o-matic: A web page segmentation framework. pp. 595–600.
R. R. Mehta, P. Mitra, and H. Karnick, Extracting semantic structure of web documents using content and visual information. pp. 928–929.
Z. Xu, and J. Miller, Estimating similarity of rich internet pages using visual information, International Journal of Web Engineering and Technology, vol. 12, no. 2, 2017.
Z. Xu, and J. Miller, A new webpage classification model based on visual information using gestalt laws of grouping. pp. 225–232.
Z. Xu, and J. Miller, Cross-browser differences detection based on an empirical metric for web page visual similarity, ACM Transactions on Internet Technology (TOIT), vol. 18, no. 3, pp. 1–23, 2018.
A. Mccallum, and K. Nigam, A Comparison of Event Models for Naive Bayes Text Classification, Work Learn Text Categ, vol. 752, 2001.
J. Hirschberg, and C. D. Manning, Advances in natural language processing, vol. 349, no. 6245, pp. 261–266, 2015.
W. H. Gomaa, and A. Fahmy, A Survey of Text Similarity Approaches, International Journal of Computer Applications, vol. 68, pp. 13–18, 2013.
P. Neculoiu, M. Versteegh, and M. Rotaru, Learning text similarity with siamese recurrent networks. pp. 148–157.
Z. S. Harris, Distributional structure, Word, vol. 10, no. 2–3, pp. 146–162, 1954.
S. Robertson, Understanding inverse document frequency: on theoretical arguments for IDF, Journal of documentation, 2004.
T. Mikolov, Q. V. Le, and I. Sutskever, Exploiting similarities among languages for machine translation, arXiv preprint arXiv:1309.4168, 2013.
T. Mikolov, K. Chen, G. Corrado, and J. Dean, Efficient estimation of word representations in vector space, arXiv preprint arXiv:1301.3781, 2013.
T. Ming, Z. Lei, and Z. Xianchun, Document vector representation based on Word2vec, Computer Science, vol. 43, no. 6, pp. 214–217, 2016.
Y. Wang, Z. Liu, and M. Sun, Incorporating linguistic knowledge for learning distributed word representations, PloS one, vol. 10, no. 4, 2015.
M. Alsuhaibani, D. Bollegala, T. Maehara, and K.-i. Kawarabayashi, Jointly learning word embeddings using a corpus and a knowledge base, PloS one, vol. 13, no. 3, 2018.
Y. Li, B. Wei, Y. Liu, L. Yao, H. Chen, J. Yu, and W. Zhu, Incorporating knowledge into neural network for text representation, Expert Systems with Applications, vol. 96, pp. 103–114, 2018.
Q. Le, and T. Mikolov, Distributed representations of sentences and documents. pp. 1188–1196.
J. H. Lau, and T. Baldwin, An empirical evaluation of doc2vec with practical insights into document embedding generation, arXiv preprint arXiv:1607.05368, 2016.
C. Xing, D. Wang, X. Zhang, and C. Liu, Document classification with distributions of word vectors. pp. 1–5.
gensim, https://radimrehurek.com/gensim/, 2020.
dataset-popular 2014. A dataset of popular pages (taken from dir.yahoo.com) with manually marked up semanticblocks. Retrieved from https://github.com/rkrzr/dataset-popular.
dataset-random 2014. A dataset of random pages with manually marked up semantic blocks. Retrieved from https://github.com/rkrzr/dataset-random.
Alexa. 2016. The top 500 sites on the web. Retrieved from http://www.alexa.com/topsites.
VIPS-JAVA [n.d.]. Implementation of Vision Based Page Segmentation Algorithm in Java. Retrieved from https://github.com/tpopela/vips-java.
A. S. Bozkir, and E. A. Sezer, Layout-based computation of web page similarity ranks, International Journal of Human-Computer Studies, vol. 110, pp. 95–114, 2018.
L. Hubert, and P. Arabie, Comparing partitions, Journal of classification, vol. 2, no. 1, pp. 193–218, 1985.
W. M. Rand, Objective criteria for the evaluation of clustering methods, Journal of the American Statistical association, vol. 66, no. 336, pp. 846–850, 1971.
K. Y. Yeung, and W. L. Ruzzo, Details of the adjusted rand index and clustering algorithms, supplement to the paper an empirical study on principal component analysis for clustering gene expression data, Bioinformatics, vol. 17, no. 9, pp. 763–774, 2001.
Z. Xu, Visual Similarity Analysis of Web Pages based on Gestalt Theory, Department of Electrical and Computer Engineering, University of Alberta, 2017.