Improving Ranking Using Hybrid Custom Embedding Models on Persian Web
DOI:
https://doi.org/10.13052/jwe1540-9589.2253Keywords:
Word embedding, Word2Vec, BERT, semantic vector, query, rankingAbstract
Ranking plays a crucial role in information retrieval systems, especially in the context of web search engines. This article presents a new ranking approach that utilizes semantic vectors and embedding models to enhance the accuracy of web document ranking, particularly in languages with complex structures like Persian. The article utilizes two real-world datasets, one obtained through web crawling to collect a large-scale Persian web corpus, and the other consisting of real user queries and web documents labeled with a relevancy score. The datasets are used to train embedding models using a combination of static Word2Vec and dynamic BERT algorithms. The proposed hybrid ranking formula incorporates these semantic vectors and presents a novel approach to document ranking called HybridMaxSim. Experiments conducted indicate that the HybridMaxSim formula is effective in enhancing the precision of web document ranking up to 0.87 according to the nDCG criterion.
Downloads
References
D. Lobiyal and v. Mala, “Semantic and Keyword Based Web Techniques in Information Retrieval,” International Conference on Computing, Communication and Automation (ICCCA), IEEE, pp. 23–26, 2016.
C. Mangold, “A survey and classification of semantic search approaches,” International Journal of Metadata, Semantics and Ontologies, no. 2, pp. 23–34, 2007.
F. Almeida and G. Xexeo, “Word embeddings: A survey,” arXiv preprint arXiv, no. 1901, 2019.
M. Hu, “Research on Semantic Information Retrieval Based on Improved Fish Swarm Algorithm,” Journal of Web Engineering, no. 3, pp. 845–860, 2022.
Y. Zhang, R. Jin and Z.-H. Zhou, “Understanding bag-of-words model: a statistical framework,” International Journal of Machine Learning and Cybernetics, no. 1, pp. 43–52, 2010.
G. Salton and C. Buckley, “Term-weighting approaches in automatic text retrieval,” Information processing & management, no. 24, pp. 513–523, 1988.
Y. Bengio, R. Ducharme, P. Vincent and C. Jauvin, “A Neural Probabilistic Language Model,” Machine Learning Research, no. 3, pp. 1137–1155, 2003.
T. Mikolov, K. Chen, G. Corrado and J. Dean, “Efficient estimation of word representations in vector space,” arXiv preprint arXiv, pp. 1301.3781, 2013.
J. Pennington, R. Socher and C. Manning, “GloVe: Global Vectors for Word Representation,” EMNLP, pp. 1532–1543, 2014.
M. Peters, M. Neumann, M. Iyyer and M. Gardner, “Deep contextualized word representations,” arXiv preprint, no. 1802, 2018.
J. Schmidhuber and S. Hochreiter, “Long short-term memory,” Neural computation, no. 9, pp. 1735–1780, 1997.
M. Schuster and K. Paliwal, “Bidirectional Recurrent Neural Networks,” IEEE Transactions on Signal Processing, no. 11, pp. 2673–2681, 1997.
J. Devlin, M.-W. Chang, K. Lee and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv, no. 1810, 2018.
L. Zhenzhong, C. Mingda, G. Sebastian, G. Kevin, S. Piyush and S. Radu, “Albert: A lite bert for self-supervised learning of language representations,” arXiv preprint arXiv, pp. 1909–11942, 2019.
L. Yinhan, O. Myle, G. Naman and D. Jingfei, “Roberta: A robustly optimized bert pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.
S. Victor, D. Lysandre and C. Julien, “DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter,” arXiv preprint arXiv:1910.01108, 2019.
T. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan and P. Dhariwal, “Language models are few-shot learners,” Advances in neural information processing systems, no. 33, pp. 1877–1901, 2020.
B. Peng, C. Li, P. He, M. Galley and J. Gao, “Instruction tuning with GPT-4,” arXiv preprint arXiv:2304.03277, 2023.
R. Al-Rfou, B. Perozzi and S. Skiena, “Polyglot: Distributed word representations for multilingual nlp,” CoNLL, pp. 183–192, 2013.
P. Bojanowski, E. Grave, A. Joulin and T. Mikolov, “Enriching word vectors with subword information,” Transactions of the Association for Computational Linguistics, no. 5, pp. 135–146, 2017.
A. Hadifar and S. Momtazi, “The impact of corpus domain on word representation: a study on Persian word embeddings,” Language Resources and Evaluation, no. 52, pp. 997–1019, 2018.
M. Zahedi, M. Yadollahi, H. Bokaei and E. DoostMohammadi, “Persian word embedding evaluation benchmarks,” IEEE, pp. 1583–1588, 2018.
M. Farahani, M. Farahani, M. Gharachorloo and M. Manthouri, “Parsbert: Transformer-based model for persian language understanding,” Neural Processing Letters, pp. 3831–3847, 2021.
S. Bostan, A. ZareBidoki and M. Pajoohan, “Semantic word embedding using BERT on the Persian web,” Iranian Journal of Electrical and Computer Engineering, 2023.
L. Shaohua, C. Tat-Seng, Z. Jun and C. Miao, “Generative Topic Embedding: a Continuous Representation of Documents (Extended Version with Proofs),” arXiv preprint arXiv:1606.02979, 2016.
B. Mitra, E. Nalisnick, N. Craswell and R. Caruana, “A Dual Embedding Space Model for Document Ranking,” in 25th International Conference Companion on World Wide Web, 2016.
M. Dehghani, H. Zamani, A. Severyn and J. Kamps, “Neural ranking models with weak supervision,” in In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2017.
C. Xiong, Z. Dai and J. Callan, “End-to-end neural ad-hoc ranking with kernel pooling,” in In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2017.
R. Brochier, A. Guille and J. Velcin, “Global vectors for node representations,” in In The World Wide Web Conference, 2019.
A. Gourru and J. Velcin, “Gaussian embedding of linked documents from a pretrained semantic space,” in In Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20), 2021.
R. Menon, J. Kaartik and K. Nambiar, “Improving ranking in document based search systems,” in In 2020 4th International Conference on Trends in Electronics and Informatics (ICOEI), 2020.
J. Li, C. Guo and Z. Wei, “Improving Document Ranking with Relevance-based Entity Embeddings,” in In 2022 8th International Conference on Big Data and Information Analytics (BigDIA), 2022.
S. Han, X. Wang, M. Bendersky and M. Najork, “Learning-to-Rank with BERT in TF-Ranking,” Google Research Tech Report, 2020.
V. Gupta, A. Dixit and S. Sethi, “A Comparative Analysis of Sentence Embedding Techniques for Document Ranking,” Journal of Web Engineering, pp. 2149–2186, 2022.
R. Mihalcea, C. Corley and C. Strapparava, “Corpus-based and knowledge-based measures of text semantic similarity,” in in the 21st National Conference on Artificial Intelligence, 2006.
K. Jarvelin and J. Kekalainen, “Cumulated gain-based evaluation of IR techniques,” ACM Transactions on Information Systems, no. 20, pp. 422–446, 2002.
S. Qu, Y. Zhang, Y. Ji and Z. Wang, “Online-Review-Driven Products Ranking: A Hybrid Approach,” Systems, p. 11(3), 2023.
Q. Lyu, K. Chakrabarti and S. Hathi, “Hybrid Ranking Network for Text-to-SQL,” arXiv preprint arXiv, 2020.
I. Alghanmi, L. Espinosa-Anke and S. Schockaert, “Combining BERT with static word embeddings for categorizing social media,” in In Proceedings of the sixth workshop on noisy user-generated text (w-nut 2020), 2020.