A Comparative Analysis of Sentence Embedding Techniques for Document Ranking
DOI:
https://doi.org/10.13052/jwe1540-9589.2177Keywords:
BERT, cosine similarity, document ranking, information retrieval, sentence embeddingAbstract
Due to the exponential increase in the information on the web, extracting relevant documents for users in a reasonable time becomes a cumbersome task. Also, when user feedback is scarce or unavailable, content-based approaches to extract and rank relevant documents are critical as they suffer from the problem of determining semantic similarity between texts of user queries and documents. Various sentence embedding models exist today that acquire deep semantic representations through training on a large corpus, with the goal of providing transfer learning to a broad range of natural language processing tasks such as document similarity, text summarization, text classification, sentiment analysis, etc. So, in this paper, a comparative analysis of six pre-trained sentence embedding techniques has been done to identify the best model suited for document ranking in IR systems. These are SentenceBERT, Universal Sentence Encoder, InferSent, ELMo, XLNet, and Doc2Vec. Four standard datasets CACM, CISI, ADI, and Medline are used to perform all the experiments. It is found that Universal Sentence Encoder and SentenceBERT outperform other techniques on all four datasets in terms of MAP, recall, F-measure, and NDCG. This comparative analysis offers a synthesis of existing work as a single point of entry for practitioners who seek to use pre-trained sentence embedding models for document ranking and for scholars who wish to undertake work in a similar domain. The work can be expanded in many directions in the future as various researchers can combine these strategies to build a hybrid document ranking system or query reformulation system in IR.
Downloads
References
Crabtree, D., Andreae, P., and Goa, X.: The vocabulary problem in human-system communication. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, pp. 191–200 (2007).
Lau, R.Y., Bruza, P.D., and Song, D.: Belief revision for adaptive information retrieval. In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, ACM, pp. 130–137 (2004).
Robertson, S.E., Walker, S.: Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In: Proceedings of the 17th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval. pp. 232–241 (1994).
Lafferty, J.D., Zhai, C.: Document language models, query models, and risk minimization for information retrieval. In: SIGIR 2001: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. pp. 111–119 (2001).
Metzler, D., Croft, W.B.: A markov random field model for term dependencies. In: SIGIR 2005: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. pp. 472–479 (2005).
Rajashekar, T.B., Croft, W.B.: Combining automatic and manual index representations in probabilistic retrieval. J. Am. Soc. Inf. Sci. 46(4), 272–283 (1995).
Lavrenko, V., Croft, W.B.: Relevance-based language models. In: SIGIR 2001: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. pp. 120–127 (2001).
Bouramoul, A., Kholladi, M.K. and Doan, B.L.: Context based query reformulation for information retrieval on the web. In International Arab Conference on Information Technology. ACIT (2009).
Jiang, J., He, D., Han, S., Yue, Z. and Ni, C.: Contextual evaluation of query reformulations in a search session by user simulation. In Proceedings of the 21st ACM international conference on Information and knowledge management, pp. 2635–2638 (2012).
Torjmen-Khemakhem, M. and Gasmi, K.: Document/query expansion based on selecting significant concepts for context based retrieval of medical images. Journal of biomedical informatics, 95, p. 103210 (2019).
Agbele, K.K., Ayetiran, E. and Babalola, O.: A Context-Adaptive Ranking Model for Effective Information Retrieval System (2018).
Montazeralghaem, A., Rahimi, R. and Allan, J.: Relevance Ranking Based on Query-Aware Context Analysis. Advances in Information Retrieval, 12035, p. 446 (2020).
Kim, J.: A Document Ranking Method with Query-Related Web Context. IEEE Access, 7, pp.150168–150174 (2019).
https://towardsdatascience.com/Se-network-embeddings-explained-4d028e6f0526
Naseem, U., Razzak, I., Khan, S.K. and Prasad, M.: A comprehensive survey on word representation models: From classical to state-of-the-art word representation language models. Transactions on Asian and Low-Resource Language Information Processing, 20(5), pp.1–35 (2021).
Zamani, H., and Croft, W. B.: Estimating Embedding Vectors for Queries. Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval – ICTIR ’16 (2016).
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 3111–3119. Curran Associates, Inc. (2013).
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543 (2014).
M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, “Deep contextualized word representations,” in Proc. of NAACL (2018).
Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606 (2016).
Iz Beltagy, Arman Cohan, and Kyle Lo. 2019. SciBERT: Pretrained Contextualized Embeddings for Scientific Text. arXiv preprint arXiv:1903.10676 (2019).
Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St John, Noah Constant, Mario GuajardoCespedes, Steve Yuan, Chris Tar, et al. 2018. Universal sentence encoder. arXiv preprint arXiv:1803.11175 (2018).
Alexis Conneau, Douwe Kiela, Holger Schwenk, Loic Barrault, and Antoine Bordes. 2017. Supervised learning of universal sentence representations from natural language inference data. arXiv preprint arXiv:1705.02364 (2017).
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. In Proc. of NAACL (2018).
Wei Yang, Haotian Zhang, and Jimmy Lin. 2019. Simple Applications of BERT for Ad Hoc Document Retrieval. arXiv preprint arXiv:1903.10972 (2019).
Zhuyun Dai and Jamie Callan. 2019. Deeper Text Understanding for IR with Contextual Neural Language Modeling. arXiv preprint arXiv:1905.09217 (2019).
Kitchenham, B. and Brereton, P.: A systematic review of systematic review process research in software engineering. Information and software technology, 55(12), pp.2049–2075 (2013).
Sethi, S., and Dixit, A. (2017). An Automatic User Interest Mining Technique for Retrieving Quality Data. International Journal of Business Analytics (IJBAN), 4(2), 62–79.
Yao, X., Van Durme, B., Clark, P.: Automatic coupling of answer extraction and information retrieval. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics. pp. 159–165 (2013).
Chen, T., Van Durme, B.: Discriminative information retrieval for question answering sentence selection. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. pp. 719–725 (2017).
Dai, Z., Callan, J.: Context-aware term weighting for first-stage passage retrieval. In: The 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (to appear) (2020).
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. pp. 4171–4186 (2019).
Nogueira, R., Yang, W., Lin, J., Cho, K.: Document expansion by query prediction. CoRR abs/1904.08375 (2019).
Gao L., Dai Z., Chen T., Fan Z., Van Durme B., Callan J.: Complement Lexical Retrieval Model with Semantic Residual Embeddings. In: Hiemstra D., Moens MF., Mothe J., Perego R., Potthast M., Sebastiani F. (eds) Advances in Information Retrieval. ECIR 2021. Lecture Notes in Computer Science, vol 12656. Springer, Cham (2021).
Guo, J., Fan, Y., Ai, Q., Croft, W.B.: A deep relevance matching model for ad-hoc retrieval. In: Proceedings of the 25th ACM International Conference on Information and Knowledge Management. pp. 55–64 (2016).
Luyu Gao, Zhuyun Dai, Tongfei Chen, Zhen Fan, Benjamin Van Durme, Jamie Callan.: “Chapter 10 Complement Lexical Retrieval Model with Semantic Residual Embeddings”, Springer Science and Business Media LLC (2021).
Deerwester, S.C., Dumais, S.T., Landauer, T.K., Pumas, G.W., Harshman, R.A.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391–407 (1990).
Bromley, J., Guyon, I., LeCun, Y., Sackinger, E., Shah, R.: Signature verification using a siamese time delay neural network. In: Advances in Neural Information Processing Systems 6. pp. 737–744 (1993).
Caid, W.R., Dumais, S.T., Gallant, S.I.: Learned vector-space models for document retrieval. Inf. Process. Manag. 31(3), 419–429 (1995).
Lee, K., Chang, M., Toutanova, K.: Latent retrieval for weakly supervised open domain question answering. In: Proceedings of the 57th Conference of the Association for Computational Linguistics. pp. 6086–6096 (2019).
Guu, K., Lee, K., Tung, Z., Pasupat, P., Chang, M.: REALM: retrieval-augmented language model pre-training. CoRR abs/2002.08909 (2020).
Chang, W., Yu, F.X., Chang, Y., Yang, Y., Kumar, S.: Pre-training tasks for embedding-based large-scale retrieval. In: 8th International Conference on Learning Representations (2020).
Azad, H.K. and Deepak, A.: Query expansion techniques for information retrieval: a survey. Information Processing & Management, 56(5), pp. 1698–1735 (2019).
Hassan, H.A.M., Sansonetti, G., Gasparetti, F., Micarelli, A. and Beel, J.: Bert, elmo, use and infersent sentence encoders: The panacea for research-paper recommendation?. In RecSys (Late-Breaking Results), pp. 6–10 (2019).
Zamani, H., Mitra, B., Song, X., Craswell, N., and Tiwary, S.: Neural ranking models with multiple document fields. In: Proceedings of the eleventh ACM international conference on web search and data mining. pp. 700–708 (2018).
Wang, L., Luo, Z., Li, C., He, B., Sun, L., Yu, H., and Sun, Y.: An end-to-end pseudo relevance feedback framework for neural document retrieval. Information Processing & Management, 57(2) (2020).
Cao, K., Chen, C., Baltes, S., Treude, C., Chen, X.: Automated Query Reformulation for Efficient Search based on Query Logs From Stack Overflow. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE), IEEE, pp. 1273–1285 (2021).
Bhopale, A.P., and Tiwari, A.: Leveraging Neural Network Phrase Embedding Model for Query Reformulation in Ad-Hoc Biomedical Information Retrieval, Malaysian Journal of Computer Science, Vol. 34, Issue 2, pp. 151–170 (2021).
Padaki, R., Dai, Z., Callan, J.: Rethinking Query Expansion for BERT Reranking. In European Conference on Information Retrieval, Springer, Cham, pp. 297–304 (2020).
Fan-Jiang, S.W., Lo, T.H., Chen, B.: Spoken Document Retrieval Leveraging Bert-Based Modeling and Query Reformulation. In ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, pp. 8144–8148 (2020).
Zheng, Z., Hui, K., He, B., Han, X., Sun, L., Yates, A.: Contextualized query expansion via unsupervised chunk selection for text retrieval, Information Processing & Management, 58(5), p. 102672 (2021).
Viji, D. and Revathy, S.: A hybrid approach of Weighted Fine-Tuned BERT extraction with deep Siamese Bi–LSTM model for semantic text similarity identification. Multimedia Tools and Applications, pp. 1–27 (2022).
Lamsiyah, S., El Mahdaouy, A., El Alaoui, S.O. and Espinasse, B.: Unsupervised query-focused multi-document summarization based on transfer learning from sentence embedding models, BM25 model, and maximal marginal relevance criterion. Journal of Ambient Intelligence and Humanized Computing, pp. 1–18 (2021).
Sethi, S., and Dixit, A. (2019). A novel page ranking mechanism based on user browsing patterns. In Software Engineering (pp. 37–49). Springer, Singapore.
Young, T., Hazarika, D., Poria, S. and Cambria, E.: Recent trends in deep learning based natural language processing. ieee Computational intelligenCe magazine, 13(3), pp. 55–75 (2018).
Tekir, S. and Bastanlar, Y.: Deep learning: Exemplar studies in natural language processing and computer vision. Data Mining-Methods, Applications and Systems (2020).
Zhou, M., Duan, N., Liu, S. and Shum, H.Y.: Progress in neural NLP: modeling, learning, and reasoning. Engineering, 6(3), pp. 275–290 (2020).
https://medium.com/inside-machine-learning/what-is-a-transformer-d07dd1fbec04
Le, Q. and Mikolov, T.: Distributed representations of sentences and documents. In International conference on machine learning, pp. 1188–1196 (2014).
https://radimrehurek.com/gensim/models/doc2vec.html
Dai, A.M., Olah, C., Le, Q.V. and Corrado, G.S.: Document embedding with paragraph vectors In: NIPS Deep Learning Workshop (2014).
Ai, Q., Yang, L., Guo, J. and Croft, W.B.: Analysis of the paragraph vector model for information retrieval. In Proceedings of the 2016 ACM international conference on the theory of information retrieval, pp. 133–142 (2016).
Ai, Q., Yang, L., Guo, J. and Croft, W.B.: Improving language estimation with the paragraph vector model for ad-hoc retrieval. In Proceedings of the 39th international ACM SIGIR conference on research and development in information retrieval, pp. 869–872 (2016).
Breja, M. and Jain, S.K.: Analyzing Linguistic Features for Answer Re-Ranking of Why-Questions. Journal of Cases on Information Technology (JCIT), 24(3), pp.1–16 (2022).
Reimers, N. and Gurevych, I.: Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084 (2019).
Conneau, A., Kiela, D., Schwenk, H., Barrault, L. and Bordes, A.: Supervised learning of universal sentence representations from natural language inference data. arXiv preprint arXiv:1705.02364 (2017).
Reimers, N. and Gurevych, I.: Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084 (2019).
Cer, D., Yang, Y., Kong, S.Y., Hua, N., Limtiaco, N., John, R.S., Constant, N., Guajardo-Céspedes, M., Yuan, S., Tar, C. and Sung, Y.H.: Universal sentence encoder. arXiv preprint arXiv:1803.11175 (2018).
https://www.analyticsvidhya.com/blog/2019/03/learn-to-use-elmo-to-extract-features-from-text/
Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R.R. and Le, Q.V.: Xlnet: Generalized autoregressive pretraining for language understanding. Advances in neural information processing systems, 32 (2019).
Sethi, S. (2021). An optimized crawling technique for maintaining fresh repositories. Multimedia Tools and Applications, 80(7), 11049–11077.