Knowledge Based Deep Inception Model for Web Page Classification

Authors

  • Amit Gupta Department of Computer Science and Engineering, Punjab Engineering College (Deemed to be University), Chandigarh, India https://orcid.org/0000-0002-2875-5216
  • Rajesh Bhatia Department of Computer Science and Engineering, Punjab Engineering College (Deemed to be University), Chandigarh, India

DOI:

https://doi.org/10.13052/jwe1540-9589.2075

Abstract

Web Page Classification is decisive for information retrieval and management task and plays an imperative role for natural language processing (NLP) problems in web engineering. Traditional machine learning algorithms excerpt covet features from web pages whereas deep leaning algorithms crave features as the network goes deeper. Pre-trained models such as BERT attains remarkable achievement for text classification and continue to show state-ofthe-art results. Knowledge Graphs can provide rich structured factual information for better language modelling and representation. In this study, we proposed an ensemble Knowledge Based Deep Inception (KBDI) approach
for web page classification by learning bidirectional contextual representation using pre-trained BERT incorporating Knowledge Graph embeddings and fine-tune the target task by applying Deep Inception network utilizing parallel multi-scale semantics. Proposed ensemble evaluates the efficacy of fusing domain specific knowledge embeddings with the pre-trained BERT model. Experimental interpretation exhibit that the proposed BERT fused KBDI model outperforms benchmark baselines and achieve better performance in contrast to other conventional approaches evaluated on web page classification datasets.

Downloads

Download data is not yet available.

References

Brin, S., Page, L. (2012): “Reprint of: The anatomy of a large-scale hypertextual web search engine.” Computer Networks. 56(18): 3825–3833. https://doi:10.1016/j.comnet.2012.10.007.

Altingövde, I. S., Özel, S. A., lusoy, Ö., Özsoyoglu, G., Özsoyoglu, Z. M. (2001). Topic-centric querying of Web information resources. Lecture Notes in Computer Science, 2113, 699–711.

De Bra, P. M. E., & Post, R. D. J. (1994). Information retrieval in the World Wide Web: Making client-based searching feasible. Computer Networks and ISDN Systems, 27(2), 183–192.

Menczer, F., Pant, G., & Srinivasan, P. (2004). Topical Web crawlers: Evaluating adaptive algorithms. ACM Transactions on Internet Technology, 4(4), 378–419.

Qi, X., & Davison, B. D. (2009). Web page classification: Features and algorithms. ACM Computing Surveys, 41(2) (article 12).

Chen, R. C., & Hsieh, C. H. (2006). Web page classification based on a support vector machine using a weighted vote schema. Expert Systems with Applications, 31, 427–435.

Selamat, A., & Omatu, S. (2004). Web page feature selection and classification using neural networks. Information Sciences, 158, 69–88.

Chen, R. C., & Hsieh, C. H. (2006). Web page classification based on a support vector machine using a weighted vote schema. Expert Systems with Applications, 31, 427–435.

Ozel, S. A. (2011). A web page classification system based on a genetic algorithm using tagged-terms as features. Expert Systems with Applications, 38(4), 3407–3415.

Kwon, O., & Lee. J. (2000). Web page classification based on k-nearest neighbour approach. IRAL ’00: Proceedings of the fifth international workshop on Information retrieval with Asian languages (pp. 9–15).

Selamat, A., & Omatu, S. (2004). Web page feature selection and classification using neural networks. Information Sciences, 158, 69–88.

Sara Meshkizadeh, Amir Masoud Rahmani, Mashallah Abassi Dezfuli (2010), “Web Page Classification based on RL features and Features of Sibling Pages”, JCS S, Vol. 8, No. 2.

Nicholas Holden and Alex A. Freitas, (2004), “Web Page Classification with an Ant Colony Algorithm”, Parallel Problem Solving from Nature, LNCS, Springer, Vol. 3242, (pp. 1092–1102).

Rung-Ching Chen, Chung-Hsun Hsieh (2006), “Web Page Classification based on a support Vector Machine using a weighted vote schema”, Expert Systems with Applications, Vol. 31, Issue 2, (pp. 427–435).

Ribeiro, A., Fresno, V., Garcia-Alegre, M. C., & Guinea, D. (2003). Web page classification: A soft computing approach. Lecture Notes in Artificial Intelligence, 2663, 103–112.

Bollacker, K., et al. (2008): “Freebase: a collaboratively created graph database for structuring human knowledge,” in Proceedings of the ACM SIGMOD international conference on Management of data, pp. 1247–1250.

Mitchell, T., et al. (2018): Never-ending learning. Communications of the ACM, 61(5):103–115, 2018.

Erxleben, F., et al. (2014): Introducing wikidata to the linked data web. In Proceedings of the 13th International Semantic Web Conference.

Zhang, X., Zhao, J., LeCun, Y. (2015): Character-level convolutional networks for text classification. In: Advances in Neural Information Processing Systems.

Tai, K.S., Socher, R., Manning, C.D. (2015): Improved semantic representations from tree-structured long short-term memory networks. arXiv preprint arXiv:1503.00075.

Chung, J., et al. (2014): Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555.

Conneau, A., et al. (2016): Very deep convolutional networks for text classification. arXiv preprint arXiv:1606.01781.

Kim, Y. (2014): Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882.

Huang, M., Qian, Q., Zhu, X. (2017): Encoding syntactic knowledge in neural networks for sentiment classification. ACM Trans. Inf. Syst. (TOIS) 35(3), 26.

Ozel, S. A. (2011). A genetic algorithm based optimal feature selection for web page classification. In Innovations in Intelligent Systems and Applications (INISTA), 2011 International Symposium on IEEE. (pp. 282–286).

Wang, B. (2018): Disconnected recurrent neural networks for text categorization. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Long Papers, vol. 1.

Kalchbrenner, N., Grefenstette, E., Blunsom, P. (2014): A convolutional neural network for modelling sentences. arXiv preprint arXiv:1404.2188.

Zhou, C., et al. (2015): A C-LSTM neural network for text classification. arXiv preprint arXiv:1511.08630.

Yang, Z., et al. (2016): Hierarchical attention networks for document classification. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.

Xiao, Y., Cho, K. (2016): Efficient character-level document classification by combining convolution and recurrent layers. arXiv preprint arXiv:1602.00367.

He, K., Zhang, X., Ren, S., Sun, J. (2016): Deep residual learning for image recognition. In: CVPR, (pp. 770–778).

Szegedy, C., Liu, W., Jia, Y., Sermanet, P. (2015): Going deeper with convolutions. In: CVPR, (pp. 1–9).

Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A. (2016): Inception-v4, inception-ResNet and the impact of residual connections on learning. arXiv e-print arXiv:1602.07261.

Peters, M. E., et al. (2018): Deep contextualized word representations. arXiv preprint arXiv:1802.05365.

Radford, A., et al. (2018): Improving language understanding by generative pre-training. https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/languageunsupervised/languageunderstandingpaper.pdf

Howard, J., Ruder, S. (2018): Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146.

Devlin, J., et al. (2018): Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

Yang, B., et al. (2015): Embedding entities and relations for learning and inference in knowledge bases. In International Conference on Learning Representations (ICLR).

Google, https://www.google.com/intl/bn/insidesearch/features/search/knowledge.html, (2014).

Wikidata, http://wikidata.org/, (2012).

Biega, J., et al. (2013): Inside YAGO2s: A transparent information extraction architecture. In Proceedings of the 22nd International Conference on World Wide Web Companion; pp. 325–328.

Minaee., et al. (2021): Deep Learning Based Text Classification: A Comprehensive Review, ACM Computing Surveys (CSUR), vol. 54 (3),pp. 1–40.

Xiaoyu Luo (2021): Efficient English text classification using selected Machine Learning Techniques, Alexandria Engineering Journal, vol. 60(3), pp. 3401–3409.

Bizer, C., et al. (2009): DBpedia-A crystallization point for the Web of data. J. Web Semant; pp. 154–165.

https://kdd.ics.uci.edu/databases/20newsgroups/20newsgroups.data.html

Chai., et al. (2020): Description Based Text Classification with Reinforcement Learning, Proceedings of the 37th International Conference on Machine Learning.

Lewis., et al. (2004): RCV1: A new benchmark collection for text categorization research, The Journal of Machine Learning Research,vol. 5, pp. 361–397.

Li., et al. (2021): Word embedding and text classification based on deep learning methods, MATEC Web of Conferences, https://doi.org/10.1051/matecconf/202133606022

Melanie., et al (2005): A Duplicate Detection Benchmark for XML (and Relational) Data.

Jiang., et al. (2017): Integrating Bidirectional LSTM with Inceptionfor Text Classification, 2017 4th IAPR Asian Conference on PatternRecognition (ACPR).

https://en.wikipedia.org/wiki/Category:Yahoo!

http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/webkb-data.gtar.gz

Pouyanfar, et al. (2017): An efficient Deep Residual-Inception Networkfor Multimedia Classification, 2017 IEEE International Conference on Multimedia and Expo (ICME), https://doi.org/10.1109/ICME.2017.8019447

Downloads

Published

2021-11-16

How to Cite

Gupta, A., & Bhatia, R. . (2021). Knowledge Based Deep Inception Model for Web Page Classification. Journal of Web Engineering, 20(07), 2131–2168. https://doi.org/10.13052/jwe1540-9589.2075

Issue

Section

SPECIAL ISSUE: ADVANCED PRACTICES IN WEB ENGINEERING 2021