Improving Ranking Using Hybrid Custom Embedding Models on Persian Web

Shekoofe  Bostan; Ali Mohammad Zareh  Bidoki; Mohammad-Reza  Pajoohan

doi:10.13052/jwe1540-9589.2253

Authors

Shekoofe Bostan Department of Computer Engineering, Yazd University, Iran https://orcid.org/0009-0007-3614-3299
Ali Mohammad Zareh Bidoki Department of Computer Engineering, Yazd University, Iran
Mohammad-Reza Pajoohan Department of Computer Engineering, Yazd University, Iran

DOI:

https://doi.org/10.13052/jwe1540-9589.2253

Keywords:

Word embedding, Word2Vec, BERT, semantic vector, query, ranking

Abstract

Ranking plays a crucial role in information retrieval systems, especially in the context of web search engines. This article presents a new ranking approach that utilizes semantic vectors and embedding models to enhance the accuracy of web document ranking, particularly in languages with complex structures like Persian. The article utilizes two real-world datasets, one obtained through web crawling to collect a large-scale Persian web corpus, and the other consisting of real user queries and web documents labeled with a relevancy score. The datasets are used to train embedding models using a combination of static Word2Vec and dynamic BERT algorithms. The proposed hybrid ranking formula incorporates these semantic vectors and presents a novel approach to document ranking called HybridMaxSim. Experiments conducted indicate that the HybridMaxSim formula is effective in enhancing the precision of web document ranking up to 0.87 according to the nDCG criterion.

Downloads

Download data is not yet available.

Author Biographies

Shekoofe Bostan, Department of Computer Engineering, Yazd University, Iran

Shekoofe Bostan is a Ph.D. candidate in computer engineering at Yazd University in Iran. She is now a lecturer in the university’s computer engineering department and a software developer at a leading cloud search firm. Her research focuses on deep learning, semantic information retrieval, and semantic analysis of social networks.

Ali Mohammad Zareh Bidoki, Department of Computer Engineering, Yazd University, Iran

Ali Mohammad Zareh Bidoki obtained his Ph.D. in software engineering at the Electrical & Computer Engineering department of University of Tehran. He also obtained a master’s degree from the ECE department of University of Tehran and Bachelor’s degree from the Computer Engineering department of Isfahan University of Technology. He is currently an associate professor in the CE department of Yazd University and CEO of Parsijoo (a search engine company). His expertise is information retrieval and search engines.

Mohammad-Reza Pajoohan, Department of Computer Engineering, Yazd University, Iran

Mohammad-Reza Pajoohan is currently Assistant Professor in the department of Computer Engineering of Yazd University, Yazd, Iran. He got his Ph.D. from the department of Computer Science of Universiti Sains Malaysia (USM) and National University of Singapore (NUS). He graduated with M.Sc. and B.Sc. degrees in computer engineering from Sharif University, Tehran, Iran. His research interests including database, data mining, data science and privacy preserving data publication.

References

D. Lobiyal and v. Mala, “Semantic and Keyword Based Web Techniques in Information Retrieval,” International Conference on Computing, Communication and Automation (ICCCA), IEEE, pp. 23–26, 2016.

C. Mangold, “A survey and classification of semantic search approaches,” International Journal of Metadata, Semantics and Ontologies, no. 2, pp. 23–34, 2007.

F. Almeida and G. Xexeo, “Word embeddings: A survey,” arXiv preprint arXiv, no. 1901, 2019.

M. Hu, “Research on Semantic Information Retrieval Based on Improved Fish Swarm Algorithm,” Journal of Web Engineering, no. 3, pp. 845–860, 2022.

Y. Zhang, R. Jin and Z.-H. Zhou, “Understanding bag-of-words model: a statistical framework,” International Journal of Machine Learning and Cybernetics, no. 1, pp. 43–52, 2010.

G. Salton and C. Buckley, “Term-weighting approaches in automatic text retrieval,” Information processing & management, no. 24, pp. 513–523, 1988.

Y. Bengio, R. Ducharme, P. Vincent and C. Jauvin, “A Neural Probabilistic Language Model,” Machine Learning Research, no. 3, pp. 1137–1155, 2003.

T. Mikolov, K. Chen, G. Corrado and J. Dean, “Efficient estimation of word representations in vector space,” arXiv preprint arXiv, pp. 1301.3781, 2013.

J. Pennington, R. Socher and C. Manning, “GloVe: Global Vectors for Word Representation,” EMNLP, pp. 1532–1543, 2014.

M. Peters, M. Neumann, M. Iyyer and M. Gardner, “Deep contextualized word representations,” arXiv preprint, no. 1802, 2018.

J. Schmidhuber and S. Hochreiter, “Long short-term memory,” Neural computation, no. 9, pp. 1735–1780, 1997.

M. Schuster and K. Paliwal, “Bidirectional Recurrent Neural Networks,” IEEE Transactions on Signal Processing, no. 11, pp. 2673–2681, 1997.

J. Devlin, M.-W. Chang, K. Lee and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv, no. 1810, 2018.

L. Zhenzhong, C. Mingda, G. Sebastian, G. Kevin, S. Piyush and S. Radu, “Albert: A lite bert for self-supervised learning of language representations,” arXiv preprint arXiv, pp. 1909–11942, 2019.

L. Yinhan, O. Myle, G. Naman and D. Jingfei, “Roberta: A robustly optimized bert pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.

S. Victor, D. Lysandre and C. Julien, “DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter,” arXiv preprint arXiv:1910.01108, 2019.

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan and P. Dhariwal, “Language models are few-shot learners,” Advances in neural information processing systems, no. 33, pp. 1877–1901, 2020.

B. Peng, C. Li, P. He, M. Galley and J. Gao, “Instruction tuning with GPT-4,” arXiv preprint arXiv:2304.03277, 2023.

R. Al-Rfou, B. Perozzi and S. Skiena, “Polyglot: Distributed word representations for multilingual nlp,” CoNLL, pp. 183–192, 2013.

P. Bojanowski, E. Grave, A. Joulin and T. Mikolov, “Enriching word vectors with subword information,” Transactions of the Association for Computational Linguistics, no. 5, pp. 135–146, 2017.

A. Hadifar and S. Momtazi, “The impact of corpus domain on word representation: a study on Persian word embeddings,” Language Resources and Evaluation, no. 52, pp. 997–1019, 2018.

M. Zahedi, M. Yadollahi, H. Bokaei and E. DoostMohammadi, “Persian word embedding evaluation benchmarks,” IEEE, pp. 1583–1588, 2018.

M. Farahani, M. Farahani, M. Gharachorloo and M. Manthouri, “Parsbert: Transformer-based model for persian language understanding,” Neural Processing Letters, pp. 3831–3847, 2021.

S. Bostan, A. ZareBidoki and M. Pajoohan, “Semantic word embedding using BERT on the Persian web,” Iranian Journal of Electrical and Computer Engineering, 2023.

L. Shaohua, C. Tat-Seng, Z. Jun and C. Miao, “Generative Topic Embedding: a Continuous Representation of Documents (Extended Version with Proofs),” arXiv preprint arXiv:1606.02979, 2016.

B. Mitra, E. Nalisnick, N. Craswell and R. Caruana, “A Dual Embedding Space Model for Document Ranking,” in 25th International Conference Companion on World Wide Web, 2016.

M. Dehghani, H. Zamani, A. Severyn and J. Kamps, “Neural ranking models with weak supervision,” in In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2017.

C. Xiong, Z. Dai and J. Callan, “End-to-end neural ad-hoc ranking with kernel pooling,” in In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2017.

R. Brochier, A. Guille and J. Velcin, “Global vectors for node representations,” in In The World Wide Web Conference, 2019.

A. Gourru and J. Velcin, “Gaussian embedding of linked documents from a pretrained semantic space,” in In Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20), 2021.

R. Menon, J. Kaartik and K. Nambiar, “Improving ranking in document based search systems,” in In 2020 4th International Conference on Trends in Electronics and Informatics (ICOEI), 2020.

J. Li, C. Guo and Z. Wei, “Improving Document Ranking with Relevance-based Entity Embeddings,” in In 2022 8th International Conference on Big Data and Information Analytics (BigDIA), 2022.

S. Han, X. Wang, M. Bendersky and M. Najork, “Learning-to-Rank with BERT in TF-Ranking,” Google Research Tech Report, 2020.

V. Gupta, A. Dixit and S. Sethi, “A Comparative Analysis of Sentence Embedding Techniques for Document Ranking,” Journal of Web Engineering, pp. 2149–2186, 2022.

R. Mihalcea, C. Corley and C. Strapparava, “Corpus-based and knowledge-based measures of text semantic similarity,” in in the 21st National Conference on Artificial Intelligence, 2006.

K. Jarvelin and J. Kekalainen, “Cumulated gain-based evaluation of IR techniques,” ACM Transactions on Information Systems, no. 20, pp. 422–446, 2002.

S. Qu, Y. Zhang, Y. Ji and Z. Wang, “Online-Review-Driven Products Ranking: A Hybrid Approach,” Systems, p. 11(3), 2023.

Q. Lyu, K. Chakrabarti and S. Hathi, “Hybrid Ranking Network for Text-to-SQL,” arXiv preprint arXiv, 2020.

I. Alghanmi, L. Espinosa-Anke and S. Schockaert, “Combining BERT with static word embeddings for categorizing social media,” in In Proceedings of the sixth workshop on noisy user-generated text (w-nut 2020), 2020.

Improving Ranking Using Hybrid Custom Embedding Models on Persian Web

Authors

DOI:

Keywords:

Abstract

Downloads

Author Biographies

Shekoofe Bostan, Department of Computer Engineering, Yazd University, Iran

Ali Mohammad Zareh Bidoki, Department of Computer Engineering, Yazd University, Iran

Mohammad-Reza Pajoohan, Department of Computer Engineering, Yazd University, Iran

References

Downloads

Published

How to Cite

Issue

Section

IEEE Xplore

ImpactScore

specialissue

issn

cover

Make a Submission

subreq

indexed