A Semantic Similarity Measure for Scholarly Document Based on the Study of n-gram


  • Yannick-Ulrich Tchantchou Samen Department of Mathematic and Computer Science, Faculty of Science, University of Maroua, P.O Box: 814 Maroua, Cameroon




Semantic Similarity, n-gram, Natural Language Processing, Scholarly Document, Similarity Measure


The performance of information retrieval systems is closely related to the ability of similarity measures to accurately determine the similarity value between documents or between a query and a document. In this paper, the issue of similarity measures in the context of scholarly documents is addressed. A semantic similarity measure is proposed. This similarity measure is able to exploit the metadata contained in the scientific articles, as well as the important n-grams identified in them. To evaluate the accuracy of our similarity measure, a dataset of articles is built as well as their similarity values manually estimated by human experts. Experiments performed on this dataset using Pearson correlation show that the similarity values obtained using the proposed measure are very close to those estimated by human experts.


Download data is not yet available.

Author Biography

Yannick-Ulrich Tchantchou Samen, Department of Mathematic and Computer Science, Faculty of Science, University of Maroua, P.O Box: 814 Maroua, Cameroon

Yannick-Ulrich Tchantchou Samen received a BSc of pure Mathematics, a MSc of error correcting code from the Dept. of Mathematics, Faculty of Science, at the University of Yaounde 1, Cameroon, in 2011, and 2013 respectively. He received a PhD of Semantic Web at the Institute of Mathematics and Physical Sciences, University of Abomey-calavi, Benin in 2017. He has been with the Laboratory of Research in Computer science and Applications (LRSIA) since 2017 as a Researcher. Since 2021, he is a Lecturer of Computer Science from the Dept. of Mathematics and Computer Science at the University of Maroua. His current research areas include Semantic Web, Information Filtering, Natural Language Processing, and Web mining.


S. Giridhar, and K. Bhutani. Importance of Similarity Measures in Effective Web Information Retrieval. International Journal on Recent and Innovation Trends in Computing and Communication, 6, 29–33, 2018.

D. Ifenthaler. Measures of Similarity. In: Seel N.M. (eds) Encyclopedia of the Sciences of Learning. Springer, Boston, MA. https://doi.org/10.1007/978-1-4419-1428-6_503, 2012.

K.P. Reddy, T.R. Reddy, G.A. Naidu, and B. Vishnu, Impact of Similarity Measures in Information Retrieval. International Journal of Computational Engineering Research (IJCER), 8(6), 54–59, 2018.

J. Wang and Y. Dong. Measurement of Text Similarity: A Survey. Information, 11, 421, 2020. https://doi.org/10.3390/info11090421.

W.H. Gomaa and A.A. Fahmy. A survey of text similarity approaches. international journal of Computer Applications, 68(13), 13–18, 2013.

Y. Song, X. Wang, W. Quan et al. A new approach to construct similarity measure for intuitionistic fuzzy sets. Soft Comput 23, 1985–1998, 2019. https://doi.org/10.1007/s00500-017-2912-0

F. Lan. Research on Text Similarity Measurement Hybrid Algorithm with Term Semantic Information and TF-IDF Method. Advances in Multimedia, 2022. https://doi.org/10.1155/2022/7923262

X. Wan. Beyond topical similarity: a structural similarity measure for retrieving highly similar documents. Knowl. Inf. Syst. 15, 1, 55–73, 2008.

F.L. Liu, B.W. Zhang, D. Ciucci, W.Z. Wu and F. Min. A comparison study of similarity measures for covering-based neighborhood classifiers, Information Sciences, V. 448–449, pp. 1–17, 2018. https://doi.org/10.1016/j.ins.2018.03.030.

R. Subhashini and V.J.S. Kumar. “Evaluating the Performance of Similarity Measures Used in Document Clustering and Information Retrieval”, 2010 First International Conference on Integrated Intelligent Computing, pp. 27–31, 2010. doi: 10.1109/ICIIC.2010.42.

S. Wan and R.A. Angryk, “Measuring semantic similarity using wordnet-based context vectors,” 2007 IEEE International Conference on Systems, Man and Cybernetics, pp. 908–913, 2007. doi: 10.1109/ICSMC.2007.4413585.

R. Ibrahim, S. Zeebaree and K. Jacksi. Survey on semantic similarity based on document clustering. Adv. Sci. Technol. Eng. Syst. J, 4(5), 115–122, 2019.

R. Mihalcea, C. Corley and C. Strapparava. Corpus based and knowledge-based measures of text semantic similarity. In Proceedings of the American Association for Artificial Intelligence. (Boston, MA), 2006.

F. Chen, C. Lu, H. Wu, and M. Li. A semantic similarity measure integrating multiple conceptual relationships for web service discovery. Expert Systems with Applications, 67, 19–31, 2017.

A. Yousfi, M.H. El Yazidi and A. Zellou. “CSSM: A Context-Based Semantic Similarity Measure.” 2020 IEEE 2nd International Conference on Electronics, Control, Optimization and Computer Science (ICECOCS). IEEE, 2020.

C. Little, D. Mclean, K. Crockett and B. Edmonds. A semantic and syntactic similarity measure for political tweets. IEEE Access, 8, 154095–154113, 2020.

R. Meymandpour and J.G. Davis. A semantic similarity measure for linked data: An information content-based approach. Knowledge-Based Systems, 109, 276–293, 2016.

A. Adhikari, B. Dutta, A. Dutta, D. Mondal and S. Singh. An intrinsic information content‐based semantic similarity measure considering the disjoint common subsumers of concepts of an ontology. Journal of the Association for Information Science and Technology, 69(8), 1023–1034, 2018.

Y. Jiang, X. Wang and H.T. Zheng. A semantic similarity measure based on information distance for ontology alignment. Information Sciences, 278, 76–87, 2014.

A.J.M. Zou and M.R. Valizadeh. A proposed query-sensitive similarity measure for information retrieval, 2006.

K. Pushpalatha and V.S. Ananthanarayana. “An information theoretic similarity measure for unified multimedia document retrieval.” 7th International Conference on Information and Automation for Sustainability. IEEE, 2014.

Y. Gupta, A. Saini and A.K. Saxena. Fuzzy logic-based approach to develop hybrid similarity measure for efficient information retrieval. Journal of Information science, 40(6), 846–857, 2014.

C. Ramya, S.P. Paramesh and K. S. Shreedhara. “A New Similarity Measure for Web Information Retrieval using PSO Approach.” 2018 3rd International Conference on Computational Systems and Information Technology for Sustainable Solutions (CSITSS). IEEE, 2018.

M. Eminagaoglu. “A new similarity measure for vector space models in text classification and information retrieval.” Journal of Information Science, 2020.

D. Tkaczyk, P. Szostek, M. Fedoryszak et al. CERMINE: automatic extraction of structured metadata from scientific literature. IJDAR 18, 317–335, 2015.

H. Ahmed. Detecting opinion spam and fake news using n-gram analysis and semantic similarity. PhD Thesis, University of Ahram Canadian, 2012.

Y.U.T. Samen and E.C. Ezin. “An Improving Mapping Process Based on a Clustering Algorithm for Modeling Hybrid and Dynamic Ontological User Profile”, 2017 13th International Conference on Signal-Image Technology & Internet-Based Systems (SITIS), pp. 1–8, 2017. doi: 10.1109/SITIS.2017.12.

R. Baeza-Yates and B. Ribeiro-Neto. Modern information retrieval (Vol. 463). New York: ACM press, 1999.

H.T. Mohamed Ali, T. Zesch and M.B. Aouicha. “A survey of semantic relatedness evaluation datasets and procedures.” Artificial Intelligence Review 53.6, 4407–4448, 2020.





The future of the analysis of web-based documents