An Analysis on Semantic Interpretation of Tamil Literary Texts
DOI:
https://doi.org/10.13052/jmm1550-4646.1839Keywords:
Discourse Parsing, Tamil literature, Text classification, Discourse-based Clustering, Information Retrieval, Mobile Application, Multimedia, Natural Language ProcessingAbstract
The interaction between a computer and a human or natural language is known as Natural Language Processing (NLP). The ultimate goal is to make the natural language text understandable, which in turn, requires its meaning to be captured. Text can be analyzed on several levels, such as lexical, syntax, semantics, discourse, and pragmatics. These NLP tasks deal with text at different levels, such as word, phrase, sentence, paragraph, and document. Discourse analysis is a type of text analysis that goes beyond the sentence level. The discourse analysis is currently performed on expository (essay) type of texts. There are currently no state-of-the-art NLP applications that handle Tamil literary texts at a discourse level. Tamil classical literature is rich with ethical, moral, and philosophical values that should be explored for the benefit of society. This paper proposes an automatic semantic interpretation framework for Tamil literary texts using discourse parsing by giving works on discourse parsing, text classification, discourse-based clustering and information retrieval, and Tamil language and Tamil literatures. This semantic interpretation can be developed as a smart mobile application using multimedia components. This paper also discusses how the Tamil literary text processing differs from the essay type of text.
Downloads
References
Thompson, S. A. and Mann, W. C. (1987). Rhetorical structure theory: A theory of text organization. The structure of discourse, Norwood NJ, Ablex.
Mann, W. C. and Thompson, S. A. (1988). Rhetorical structure theory: Toward a functional theory of text organization. Text, 8(3), 243–281.
Subba, R. and Di Eugenio, B. (2009, June). An effective discourse parser that uses rich linguistic information. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics (pp. 566–574).
Hernault, H., Prendinger, H., du Verle, D. A. and Ishizuka, M. (2010). HILDA: A discourse parser using support vector machine classification. Dialogue & Discourse, 1(3), 1–33.
Hernault, H., Bollegala, D. and Ishizuka, M. (2010, October). A semi-supervised approach to improve classification of infrequent discourse relations using feature vector extension. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing (pp. 399–409).
Alonso, L., Castellón, I., Gibert, K. and Padró, L. (2002, October). An empirical approach to discourse markers by clustering. In Catalonian Conference on Artificial Intelligence (pp. 173–183). Springer, Berlin, Heidelberg.
Miltsakaki, E., Joshi, A., Prasad, R. and Webber, B. (2004). Annotating discourse connectives and their arguments. In Proceedings of the Workshop Frontiers in Corpus Annotation at HLT-NAACL 2004 (pp. 9–16).
Ramesh, B. P. and Yu, H. (2010). Identifying discourse connectives in biomedical text. In AMIA Annual Symposium Proceedings (Vol. 2010, p. 657). American Medical Informatics Association.
Al-Salemi, B. and Aziz, M. J. A. (2011). Statistical Bayesian learning for automatic Arabic text categorization. Journal of computer Science, 7(1), 39.
Rizzo, G., Tomassetti, F., Vetro, A., Ardito, L., Torchiano, M., Morisio, M. and Troncy, R. (2017). Semantic enrichment for recommendation of primary studies in a systematic literature review. Digital Scholarship in the Humanities, 32(1), 195–208.
Fauzi, M. A., Arifin, A. Z. and Yuniarti, A. (2017). Arabic book retrieval using class and book index based term weighting. International Journal of Electrical & Computer Engineering, 7(6), 2088–8708.
Zamani, H., Dehghani, M., Croft, W. B., Learned-Miller, E. and Kamps, J. (2018, October). From neural re-ranking to neural ranking: Learning a sparse representation for inverted indexing. In Proceedings of the 27th ACM international conference on information and knowledge management, (pp. 497–506).
Liu, L., Liu, L., Fu, X., Huang, Q., Zhang, X. and Zhang, Y. (2018). A cloud-based framework for large-scale traditional Chinese medical record retrieval. Journal of biomedical informatics, 77, 21–33.
Elanchezhiyan, K., Geetha, T. V., Ranjani, P. and Karky, M. (2011). Kuralagam - Concept Relation based Search Engine for Thirukkural. In Tamil Internet Conference, University of Pennsylvania, Philadelphia, USA, 19–23.
Madhavan, K. V., Nagarajan, S. and Sridhar, R. (2012). Rule based classification of tamil poems. International Journal of Information and Education Technology, 2(2), 156.
Sridevi, N. and Subashini, P. (2013). Optimized Framework for Classification of 11th Century Handwritten Ancient Tamil Scripts using Computational Intelligence. International Journal of Computer Science. 2 (2), 14–23.
Ghosh, S., Johansson, R., Riccardi, G. and Tonelli, S. (2011, November). Shallow discourse parsing with conditional random fields. In Proceedings of 5th International Joint Conference on Natural Language Processing (pp. 1071–1079).
Ghosh, S., Johansson, R., Riccardi, G. and Tonelli, S. (2012, May). Improving the Recall of a Discourse Parser by Constraint-based Postprocessing. In LREC (pp. 2791–2794).
Ghosh, S., Riccardi, G. and Johansson, R. (2012, July). Global features for shallow discourse parsing. In Proceedings of the 13th Annual Meeting of the Special Interest Group on Discourse and Dialogue (pp. 150–159).
Subalalitha, C. N. and Ranjani, P. (2014). A unique indexing technique for discourse structures. Journal of Intelligent Systems, 23(3), 231–243.
Sobha Lalitha Devi, Lakshmi S and Sindhuja Gopalan (2014). “Discourse Tagging for Indian Languages”, In A. Gelbukh (ed), Computational Linguistics and Intelligent Text Processing, Springer LNCS Vol 8403, pp. 469–480.
Sobha Lalitha Devi, Sindhuja Gopalan, Lakshmi S (2014). Automatic Identification of Discourse Relations in Indian Languages. In proceedings of 2nd Workshop on Indian Language Data: Resources and Evaluation, Organized under LREC2014, Reykjavik, Iceland.
Lin, Z., Ng, H. T. and Kan, M. Y. (2014). A PDTB-styled end-to-end discourse parser. Natural Language Engineering, 20(2), 151–184.
Ji, Y. and Eisenstein, J. (2014, June). Representation learning for text-level discourse parsing. In Proceedings of the 52nd annual meeting of the association for computational linguistics (volume 1: Long papers) (pp. 13–24).
Sidarenka, U., Peldszus, A. and Stede, M. (2015). Discourse Segmentation of German Texts. J. Lang. Technol. Comput. Linguistics, 30(1), 71–98.
Bhatia, P., Ji, Y. and Eisenstein, J. (2015). Better document-level sentiment analysis from RST discourse parsing. arXiv preprint arXiv:1509.01599.
Subalalitha, C. N. and Ranjani, P. (2015). Building a Language-Independent Discourse Parser using Universal Networking Language. Computational Intelligence, 31(4), 593–618.
Stede, M., Afantenos, S., Peldzsus, A., Asher, N. and Perret, J. (2016, May). Parallel discourse annotations on a corpus of short texts. In 10th International Conference on Language Resources and Evaluation (LREC 2016) (pp. 1051–1058).
Ji, Y., Haffari, G. and Eisenstein, J. (2016). A latent variable recurrent neural network for discourse relation language models. arXiv preprint arXiv:1603.01913.
Ji, Y. and Smith, N. (2017). Neural discourse structure for text categorization. arXiv preprint arXiv:1702.01829.
Luyckx, K., Daelemans, W. and Vanhoutte, E. (2006). Stylogenetics: Clustering-based stylistic analysis of literary corpora. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC’06), Genoa, Italy.
Rysová, M. and Rysová, K. (2014, December). The centre and periphery of discourse connectives. In Proceedings of the 28th Pacific Asia Conference on Language, Information and Computing (pp. 452–459).
Rutherford, A. and Xue, N. (2015). Improving the inference of implicit discourse relations via classifying explicit discourse connectives. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 799–808).
Braud, C. and Denis, P. (2016, November). Learning connective-based word representations for implicit discourse relation identification. In Empirical Methods on Natural Language Processing.
Malmi, E., Pighin, D., Krause, S. and Kozhevnikov, M. (2017). Automatic prediction of discourse connectives. arXiv preprint arXiv:1702. 00992.
Rysová, M. and Rysová, K. (2018). Primary and secondary discourse connectives: Constraints and preferences. Journal of Pragmatics, 130, 16–32.
Al-Badarneh, A., Al-Shawakfa, E., Bani-Ismail, B., Al-Rababah, K. and Shatnawi, S. (2017). The impact of indexing approaches on Arabic text classification. Journal of Information Science, 43(2), 159–173.
Xu, S. (2018). Bayesian Naïve Bayes classifiers to text classification. Journal of Information Science, 44(1), 48–59.
Bahgat, E. M., Rady, S., Gad, W. and Moawad, I. F. (2018). Efficient email classification approach based on semantic methods. Ain Shams Engineering Journal, 9(4), 3259–3269.
Ragini, J. R., Anand, P. R. and Bhaskar, V. (2018). Big data analytics for disaster response and recovery through sentiment analysis. International Journal of Information Management, 42, 13–24.
Elnagar, A., Al-Debsi, R. and Einea, O. (2020). Arabic text classification using deep learning models. Information Processing & Management, 57(1), 102121.
Cunha, W., Canuto, S., Viegas, F., Salles, T., Gomes, C., Mangaravite, V. Resende, E., Rosa, T., Goncalves M. A. and Rocha, L. (2020). Extended pre-processing pipeline for text classification: On the role of meta-feature representations, sparsification and selective sampling. Information Processing & Management, 57(4), 102263.
El Hindi, K. M., Aljulaidan, R. R. and AlSalman, H. (2020). Lazy fine-tuning algorithms for naïve Bayesian text classification. Applied Soft Computing, 96, 106652.
Zheng, J., Cai, F., Chen, H. and de Rijke, M. (2020). Pre-train, Interact, Fine-tune: a novel interaction representation for text classification. Information Processing & Management, 57(6), 102215.
Khurana, A. and Verma, O. P. (2020). Novel approach with nature-inspired and ensemble techniques for optimal text classification. Multimedia Tools and Applications, 79(33), 23821–23848.
Aljedani, N., Alotaibi, R. and Taileb, M. (2020). HMATC: Hierarchical multi-label Arabic text classification model using machine learning. Egyptian Informatics Journal.
Waheeb, S. A., Khan, N. A., Chen, B. and Shang, X. (2020). Machine learning based sentiment text classification for evaluating treatment quality of discharge summary. Information, 11(5), 281.
HaCohen-Kerner, Y., Miller, D. and Yigal, Y. (2020). The influence of preprocessing on text classification using a bag-of-words representation. PloS one, 15(5), e0232525.
López-Úbeda, P., Díaz-Galiano, M. C., Martín-Noguerol, T., Luna, A., Ureña-López, L. A. and Martín-Valdivia, M. T. (2021). Automatic medical protocol classification using machine learning approaches. Computer Methods and Programs in Biomedicine, 200, 105939.
Wang, R., Ridley, R., Qu, W. and Dai, X. (2021). A novel reasoning mechanism for multi-label text classification. Information Processing & Management, 58(2), 102441.
Luo, X. (2021). Efficient English text classification using selected machine learning techniques. Alexandria Engineering Journal, 60(3), 3401–3409.
Meng, L., Tan, A. H. and Wunsch II, D. C. (2019). Online multimodal co-indexing and retrieval of social media data. In Adaptive resonance theory in social media data clustering, (pp. 155–174). Springer, Cham.
Tekli, J., Chbeir, R., Traina, A. J. and Traina Jr, C. (2019). SemIndex+: A semantic indexing scheme for structured, unstructured, and partly structured data. Knowledge-Based Systems, 164, 378–403.
Samia, Z. and Khaled, R. (2020). Multi-agents indexing system (MAIS) for plagiarism detection. Journal of King Saud University-Computer and Information Sciences.
Agosti, M., Marchesin, S. and Silvello, G. (2020). Learning unsupervised knowledge-enhanced representations to reduce the semantic gap in information retrieval. ACM Transactions on Information Systems (TOIS), 38(4), 1–48.
Prasath, R., Sarkar, S. and O’Reilly, P. (2015, April). Improving cross language information retrieval using corpus based query suggestion approach. In International Conference on Intelligent Text Processing and Computational Linguistics, (pp. 448–457). Springer, Cham.
Subalalitha, C. N. and Anita, R. (2016). An approach to page ranking based on discourse structures. Journal of Communications Software and Systems, 12(4), 195–200.
Giridharan, R., Vellingiriraj, E. K. and Balasubramanie, P. (2016, April). Identification of Tamil ancient characters and information retrieval from temple epigraphy using image zoning. In 2016 International conference on recent trends in information technology (ICRTIT), (pp. 1–7). IEEE.
Sankaralingam, C., Rajendran, S., Kavirajan, B., Kumar, M. A. and Soman, K. P. (2017, September). Onto-thesaurus for Tamil language: Ontology based intelligent system for information retrieval. In 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI), (pp. 2396–2396). IEEE.
Thenmozhi, D. and Aravindan, C. (2018). Ontology-based Tamil–English cross-lingual information retrieval system. Sādhanā, 43(10), 1–14.
Subalalitha, C. N. and Poovammal, E. (2018). Automatic bilingual dictionary construction for Tirukural. Applied Artificial Intelligence, 32(6), 558–567.
Subalalitha, C. N. (2019). Information extraction framework for Kurunthogai. Sādhanā, 44(7), 1–6.
Anita, R. and Subalalitha, C. N. (2019, July). An Approach to Cluster Tamil Literatures Using Discourse Connectives. In 2019 IEEE 1st International Conference on Energy, Systems and Information Processing (ICESIP) (pp. 1–4). IEEE.
Anita, R. and Subalalitha, C. N. (2019, December). Building Discourse Parser for Thirukkural. In Proceedings of the 16th International Conference on Natural Language Processing (pp. 18–25).
Saravanan, M. S. (2020). Semantic document clustering based indexing for Tamil language information retrieval system. Journal of Critical Reviews, 7(14), 2999–3007.
Vinotheni, C., Pandian, S. L. and Lakshmi, G. (2021). Modified convolutional neural network of Tamil character recognition. In Advances in Distributed Computing and Machine Learning, 127 (pp. 469–480). Springer, Singapore.
Anita, R. and Subalalitha, C. N. (2021). A discourse-based information retrieval for Tamil literary texts. Journal of Information and Communication Technology, 20(3), 353–389.
Anand Kumar, M., Dhanalakshmi, V., Soman, K. P. and Rajendran, S. (2010). A sequence labeling approach to morphological analyzer for tamil language. International Journal on Computer Science and Engineering, 2(06), 1944–1951.
Ravi, L., Subramaniyaswamy, V., Vijayakumar, V., Chen, S., Karmel, A. and Devarajan, M. (2019). Hybrid location-based recommender system for mobility and travel planning. Mobile Networks and Applications, 24(4), 1226–1239.
Makaju, S., Prasad, P. W. C., Alsadoon, A., Singh, A. K. and Elchouemi, A. (2018). Lung cancer detection using CT scan images. Procedia Computer Science, 125, 107–114.