Enhanced Authorship Verification for Textual Similarity with Siamese Deep Learning

Rebeh Imane  Aouchiche; Fatima Boumahdi; Mohamed Abdelkarim Remmide; Karim Hemina; Amina Guendouz

doi:10.13052/jmm1550-4646.2043

Authors

Rebeh Imane Aouchiche LRDSI Laboratory, Department of Computer Science, Faculty of Sciences, University of Blida 1, Blida, Algeria
Fatima Boumahdi LRDSI Laboratory, Department of Computer Science, Faculty of Sciences, University of Blida 1, Blida, Algeria
Mohamed Abdelkarim Remmide LRDSI Laboratory, Department of Computer Science, Faculty of Sciences, University of Blida 1, Blida, Algeria
Karim Hemina LRDSI Laboratory, Department of Computer Science, Faculty of Sciences, University of Blida 1, Blida, Algeria
Amina Guendouz LRDSI Laboratory, Faculty of Technology, University of Blida 1, Blida, Algeria

DOI:

https://doi.org/10.13052/jmm1550-4646.2043

Keywords:

Authorship verification, similarity learning, siamese neural network, LSTM, CNN, BERT, natural language processing, deep learning

Abstract

The internet is filled with documents written under false names or without revealing the author’s identity. Identifying the authorship of these documents can help decrease the success rate of potential criminals for financial or legal consequences. Most previous research on authorship verification focused on general text, but social media texts like tweets are more challenging since they are short, improperly structured, and cover a wide range of subjects. This paper proposes a new approach to determining textual similarity between these challenging messages. Inspired by the popularity of the Siamese networks in determining input similarity, four deep learning models based on this architecture were developed: a long-short-term memory (LSTM), a convolutional neural network (CNN), a combination of the two and a BERT model. These models were evaluated on a Twitter-based dataset, and the results show that the Siamese CNN-LSTM similarity model achieved the best performance with 0,97 accuracy.

Downloads

Download data is not yet available.

Author Biographies

Rebeh Imane Aouchiche, LRDSI Laboratory, Department of Computer Science, Faculty of Sciences, University of Blida 1, Blida, Algeria

Rebeh Imane Ammar Aouchiche is an assistant professor in the Department of Computer Science at Saad Dahlab University, Blida, Algeria. She is also currently pursuing her Ph.D. at the same university. Her research interests include Deep Learning, Natural Language Processing, cybersecurity, and Social Networks, with a particular focus on Authorship Analysis using machine and deep learning techniques. She has contributed to publications in these fields.

Fatima Boumahdi, LRDSI Laboratory, Department of Computer Science, Faculty of Sciences, University of Blida 1, Blida, Algeria

Fatima Boumahdi received a Ph.D. degree in computer science from the National School of Computer Science (ESI), Algier, Algeria, in 2015. She is currently an associate professor in Sciences Faculty at Saad Dahlab University, Blida, Algeria. She published numerous publications in the areas of Decision Support Systems, Web information systems, and Service Oriented Architecture. Her current research interests and endeavours mainly go out to Deep Learning, Natural Language Processing, Sentiment Analysis, cybersecurity, Trending Topics and Social Networks.

Mohamed Abdelkarim Remmide, LRDSI Laboratory, Department of Computer Science, Faculty of Sciences, University of Blida 1, Blida, Algeria

Mohamed Abdelkarim Remmide is a Ph.D. student in computer science University of Saad Dahlab Blida 1, Algier, Algeria. He is currently a part-time teacher at the same university. His research interest is in the area of application of deep learning in cybersecurity, currently focusing on the detection of social engineering attack as well as case-based reasoning systems.

Karim Hemina, LRDSI Laboratory, Department of Computer Science, Faculty of Sciences, University of Blida 1, Blida, Algeria

Karim Hemina is a software engineer and AI Ph.D student at university Saad Dahlab Blida (USDB) in Algeria, his research focuses on natural language processing (NLP) mainly on the use of machine learning techniques for fake news detection on social networks. He occupies the position of Software projects manager, and he is a part time teacher at USDB, he teaches labs related to artificial intelligence and natural language processing.

Amina Guendouz, LRDSI Laboratory, Faculty of Technology, University of Blida 1, Blida, Algeria

Amina Guendouz is currently serving as a lecturer in science and technology Faculty, at Saad Dahlab Blida 1 University, Blida, Algeria. She earned a PhD degree in computer science from Saad Dahlab University, in 2021. Deep learning, Natural Language Processing, Trending Topics and Social Networks are some of her research interests.

References

Ahmed Abbasi and Hsinchun Chen. Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace. ACM Transactions on Information Systems (TOIS), 26(2):1–29, 2008.

Imane Rebeh Ammar Aouchiche, Fatima Boumahdi, Amina Madani, and Mohamed Abdelkarim Remmide. Hate speech prediction on social media. SN Computer Science, 4(3):229, 2023.

Maike Behrendt and Stefan Harmeling. Arguebert: How to improve bert embeddings for measuring the similarity of arguments. In Proceedings of the 17th Conference on Natural Language Processing (KONVENS 2021), pages 28–36, 2021.

Benedikt Boenninghoff, Robert M Nickel, Steffen Zeiler, and Dorothea Kolossa. Similarity learning for authorship verification in social media. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2457–2461. IEEE, 2019.

Benedikt Boenninghoff, Julian Rupp, Robert M Nickel, and Dorothea Kolossa. Deep bayes factor scoring for authorship verification. arXiv preprint arXiv:2008.10105, 2020.

Michael Brennan, Sadia Afroz, and Rachel Greenstadt. Adversarial stylometry: Circumventing authorship recognition to preserve privacy and anonymity. ACM Transactions on Information and System Security (TISSEC), 15(3):1–22, 2012.

Marcelo Luiz Brocardo, Issa Traore, Sherif Saad, and Isaac Woungang. Authorship verification for short messages using stylometry. In 2013 International Conference on Computer, Information and Telecommunication Systems (CITS), pages 1–6. IEEE, 2013.

Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard Säckinger, and Roopak Shah. Signature verification using a “siamese” time delay neural network. Advances in neural information processing systems, 6, 1993.

Omar Canales, Vinnie Monaco, Thomas Murphy, Edyta Zych, John Stewart, Charles Tappert Alex Castro, Ola Sotoye, Linda Torres, and Greg Truley. A stylometry system for authenticating students taking online tests. P. of Student-Faculty Research Day, Ed., CSIS. Pace University, 2011.

Yu-hsin Chen, Ignacio Lopez Moreno, Tara Sainath, Mirkó Visontai, Raziel Alvarez, and Carolina Parada. Locally-connected and convolutional neural networks for small footprint speaker recognition. 2015.

Sumit Chopra, Raia Hadsell, and Yann LeCun. Learning a similarity metric discriminatively, with application to face verification. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), volume 1, pages 539–546. IEEE, 2005.

Maciej Eder, Maciej Piasecki, and Tomasz Walkowiak. An open stylometric system based on multilevel text analysis. Cognitive Studies, (17), 2017.

Oren Halvani, Lukas Graner, and Roey Regev. Cross-domain authorship verification based on topic agnostic features. In CLEF (Working Notes), 2020.

Karim Hemina, Fatima Boumahdi, Amina Madani, and Mohamed Abdelkarim Remmide. A cross-validated fine-tuned gpt-3 as a novel approach to fake news detection. In Hind Zantout and Hani Ragab Hassen, editors, Proceedings of the International Conference on Applied Cybersecurity (ACS) 2023, pages 41–48, Cham, 2023. Springer Nature Switzerland.

David I Holmes. The evolution of stylometry in humanities scholarship. Literary and linguistic computing, 13(3):111–117, 1998.

Marjan Hosseinia and Arjun Mukherjee. Experiments with neural networks for small and large scale authorship verification. arXiv preprint arXiv:1803.06456, 2018.

Junlin Hu, Jiwen Lu, and Yap-Peng Tan. Discriminative deep metric learning for face verification in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1875–1882, 2014.

Catherine Ikae. Unine at pan-clef 2021: Authorship verification. In CLEF (Working Notes), pages 1995–2003, 2021.

Gregory Koch, Richard Zemel, Ruslan Salakhutdinov, et al. Siamese neural networks for one-shot image recognition. In ICML deep learning workshop, volume 2, page 0. Lille, 2015.

Moshe Koppel and Jonathan Schler. Authorship verification as a one-class classification problem. In Proceedings of the twenty-first international conference on Machine learning, page 62, 2004.

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Communications of the ACM, 60(6):84–90, 2017.

Jiwen Lu, Junlin Hu, and Jie Zhou. Deep metric learning for visual understanding: An overview of recent advances. IEEE Signal Processing Magazine, 34(6):76–84, 2017.

Andrei Manolache, Florin Brad, Antonio Barbalau, Radu Tudor Ionescu, and Marius Popescu. Veridark: A large-scale benchmark for authorship verification on the dark web. arXiv preprint arXiv:2207.03477, 2022.

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.

Jonas Mueller and Aditya Thyagarajan. Siamese recurrent architectures for learning sentence similarity. In Proceedings of the AAAI conference on artificial intelligence, volume 30, 2016.

Daniel Neil. Deep neural networks and hardware systems for event-driven data. PhD thesis, ETH Zurich, 2017.

Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014.

Elvys Linhares Pontes, Stéphane Huet, Andréa Carneiro Linhares, and Juan-Manuel Torres-Moreno. Predicting the semantic textual similarity with siamese cnn and lstm. arXiv preprint arXiv:1810.10641, 2018.

Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084, 2019.

Mohamed Abdelkarim Remmide, Fatima Boumahdi, and Narhimene Boustia. Phishing email detection using bi-gru-cnn model. In International conference on applied cybersecurity, pages 71–77. Springer, 2021.

Mohamed Abdelkarim Remmide, Fatima Boumahdi, Narhimene Boustia, Chalabia Lilia Feknous, and Romaissa Della. Detection of phishing urls using temporal convolutional network. Procedia Computer Science, 212:74–82, 2022.

Claude Sammut and Geoffrey I Webb. Encyclopedia of machine learning and data mining. Springer Publishing Company, Incorporated, 2017.

Conrad Sanderson and Simon Guenter. Short text authorship attribution via sequence kernels, markov chains and author unmasking: An investigation. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, pages 482–491, 2006.

Prasha Shrestha, Sebastian Sierra, Fabio A González, Manuel Montes-y Gómez, Paolo Rosso, and Thamar Solorio. Convolutional neural networks for authorship attribution of short texts. In EACL (2), pages 669–674, 2017.

Marina Sokolova, Nathalie Japkowicz, and Stan Szpakowicz. Beyond accuracy, f-score and roc: a family of discriminant measures for performance evaluation. In AI 2006: Advances in Artificial Intelligence: 19th Australian Joint Conference on Artificial Intelligence, Hobart, Australia, December 4–8, 2006. Proceedings 19, pages 1015–1021. Springer, 2006.

Yanyan Wang, Qun Chen, Murtadha HM Ahmed, Zhaoqiang Chen, Jing Su, Wei Pan, and Zhanhuai Li. Supervised gradual machine learning for aspect-term sentiment analysis. Transactions of the Association for Computational Linguistics, 11:723–739, 2023.

Janith Weerasinghe and Rachel Greenstadt. Feature vector difference based neural network and logistic regression models for authorship verification. In CEUR workshop proceedings, volume 2695, 2020.

Wenshuo Yang, Jiyi Li, Fumiyo Fukumoto, and Yanming Ye. Hscnn: A hybrid-siamese convolutional neural network for extremely imbalanced multi-label text classification. In Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pages 6716–6722, 2020.

Shenglong Zhang and Ying Liu. Metaphor detection via linguistics enhanced siamese network. In Proceedings of the 29th International Conference on Computational Linguistics, pages 4149–4159, 2022.

Enhanced Authorship Verification for Textual Similarity with Siamese Deep Learning

Authors

DOI:

Keywords:

Abstract

Downloads

Author Biographies

Rebeh Imane Aouchiche, LRDSI Laboratory, Department of Computer Science, Faculty of Sciences, University of Blida 1, Blida, Algeria

Fatima Boumahdi, LRDSI Laboratory, Department of Computer Science, Faculty of Sciences, University of Blida 1, Blida, Algeria

Mohamed Abdelkarim Remmide, LRDSI Laboratory, Department of Computer Science, Faculty of Sciences, University of Blida 1, Blida, Algeria

Karim Hemina, LRDSI Laboratory, Department of Computer Science, Faculty of Sciences, University of Blida 1, Blida, Algeria

Amina Guendouz, LRDSI Laboratory, Faculty of Technology, University of Blida 1, Blida, Algeria

References

Downloads

Published

How to Cite

Issue

Section

interview

splissue

award

2020 Best Paper Award

issn

cover

Make a Submission

subreq

indexed

logo