Curated Hinglish Dataset for Deep Learning-Based Misogyny Detection

Deepti  Negi; Himani  Maheshwari; Chandrakala  Arya; Umesh  Chandra; Gaurav  Shukla

doi:10.13052/jrss0974-8024.1918

Authors

Deepti Negi School of Computing, Graphic Era Hill University, Dehradun, 248001, India
Himani Maheshwari School of Computing, Graphic Era Hill University, Dehradun, 248001, India
Chandrakala Arya School of Computing, Graphic Era Hill University, Dehradun, 248001, India
Umesh Chandra Department of Statistics & Computer Science, Banda University of Agriculture & Technology, Banda, 210001, India
Gaurav Shukla Department of Statistics & Computer Science, Banda University of Agriculture & Technology, Banda, 210001, India

DOI:

https://doi.org/10.13052/jrss0974-8024.1918

Keywords:

Misogyny detection, Hindi English code-mixed text, deep learning algorithm, BERT, offensive language, social media platform

Abstract

Social networking sites serves as influential medium for sharing information and communication; however, their mostly unregulated and open frameworks have also turned them into fertile ground for the dissemination of offensive content. The simplicity of sharing content, coupled with user anonymity and vast reach, facilitates the swift circulation of offensive, abusive, and discriminatory remarks. Engagement-driven algorithms may unintentionally promote such harmful content, increasing its visibility and impact. Consequently, offensive content on platforms like Twitter, YouTube, Facebook, and Reddit frequently gains traction, fuelling online hostility, social division, and tangible real-world effects. Offensive content about women is a prevailing subject on social media platforms. Instances of misogyny are disproportionately represented on social media platforms and misogyny is a substantial societal concern which needs to be addressed.

While exhaustive research work has been done for offensive language detection in monolingual settings, the domain of misogyny detection in code-mixed texts is relatively underexplored and there is lack of studies that tackle misogyny detection in under-resourced languages. One of the major causes is unavailability of appropriate Hindi-English mixed-coded language dataset. Therefore, in attempt to bridge this research gap our study focuses on developing a dataset and leveraging deep learning techniques on this high-quality curated dataset containing Hindi-English code-mixed comments from multiple social media platforms. This dataset contains 17,234 comments from different social media platforms, annotated manually into misogynistic and non-misogynistic based on the content. Our study also demonstrates a detailed comparison between baseline machine learning, deep learning, and transformer-based approaches utilising our own curated Hinglish dataset. The results indicated that fine-tuned BERT outperformed the deep learning algorithms with highest 0.92 accuracy.

Downloads

Download data is not yet available.

Author Biographies

Deepti Negi, School of Computing, Graphic Era Hill University, Dehradun, 248001, India

Deepti Negi received her master’s degree in computer application from Hemavati Nandan Bahuguna University in 2008. She is currently working as an Assistant Professor at the School of computing, Graphic Era Hill University. Her research areas include natural language processing, deep learning, and social network analysis.

Himani Maheshwari, School of Computing, Graphic Era Hill University, Dehradun, 248001, India

Himani Maheshwari is currently working as Assistant Professor in School of Computing, Graphic Era Hill University, Dehradun. She completed her Graduation from MJP Rohilkhand University, Bareilly and Post-Graduation from Uttarakhand Technical University, Dehradun. She received her Ph.D. from IIT, Roorkee. She has qualified UGC NET and GATE. She has published 45 research papers in different reputed national and international journals, 10 book chapters and attained 8 copy rights. Her area of specialization is Artificial Intelligence, Big Data Analysis and Machine Learning.

Chandrakala Arya, School of Computing, Graphic Era Hill University, Dehradun, 248001, India

Chandrakala Arya is currently working as Assistant Professor in School of Computing, Graphic Era Hill University, Dehradun. She completed her Graduation from Kumaon University, Nainital and Post Graduation from Uttarakhand Technical University, Dehradun. She Received her Ph.D. from Babasaheb Bhimrao Ambedkar University (Central University) Lucknow. She has qualified UGC NET and GATE. She has published 30 research papers in different reputed national and international conferences and journals. Her area of specialization is Artificial Intelligence, and Machine Learning.

Umesh Chandra, Department of Statistics & Computer Science, Banda University of Agriculture & Technology, Banda, 210001, India

Umesh Chandra is currently working as Assistant Professor in Department of Statistics & Computer Science, College of Agriculture, Banda University of Agriculture & Technology, Banda. He completed his Graduation from Kumaun University, Nainital and Post Graduation from Uttarakhand Technical University, Dehradun. He Received his Ph.D. from IIT, Roorkee. He has published 42 research papers in different reputed national and international journals, 8 book chapters and attained 7 copy rights. He is author of one edited book. His area of specialization is GIS, Artificial Intelligence, Big Data Analysis and Machine Learning.

Gaurav Shukla, Department of Statistics & Computer Science, Banda University of Agriculture & Technology, Banda, 210001, India

Gaurav Shukla is currently working as Assistant Professor in Department of Statistics & Computer Science, College of Agriculture, Banda University of Agriculture & Technology, Banda. He completed his Graduation and Post Graduation from MJP Rohilkhand University, Bareilly. He received his Ph.D. also from MJP Rohilkhand University, Bareilly. He has published 48 research papers in different reputed national and international journals, 7 book chapters and attained 1 copy right. He is author of one book. His area of specialization is Life Testing Models, Applied Statistics and Machine Learning.

References

E. Aïmeur, S. Amri, and G. Brassard, “Fake news, disinformation and misinformation in social media: a review,” Soc Netw Anal Min, vol. 13, no. 1, 2023, doi: 10.1007/s13278-023-01028-5.

M. Mozafari, R. Farahbakhsh, and N. Crespi, “Cross-Lingual Few-Shot Hate Speech and Offensive Language Detection Using Meta Learning,” IEEE Access, vol. 10, pp. 14880–14896, 2022, doi: 10.1109/ACCESS.2022.3147588.

S. Bhaskara, S. P, S. Seth, S. Mohanty, and P. Kanwal, “Detection and Comparison of Abusive and Hate Speech in English and Hinglish with Emojis using Deep Learning and Non-Deep Learning Techniques,” in 2023 4th International Conference for Emerging Technology (INCET), IEEE, May 2023, pp. 1–7. doi: 10.1109/INCET57972.2023.10170633.

S. Kumbale, S. Singh, G. Poornalatha, and S. Singh, “BREE-HD: A Transformer-Based Model to Identify Threats on Twitter,” IEEE Access, vol. 11, no. June, pp. 1–1, 2023, doi: 10.1109/access.2023.3291072.

S. Frenda, B. Ghanem, M. Montes-Y-Gómez, and P. Rosso, “Online hate speech against women: Automatic identification of misogyny and sexism on twitter,” Journal of Intelligent and Fuzzy Systems, vol. 36, no. 5, pp. 4743–4752, 2019, doi: 10.3233/JIFS-179023.

I. Kayes and A. Iamnitchi, “Privacy and security in online social networks: A survey,” Online Soc Netw Media, vol. 3–4, pp. 1–21, 2017, doi: 10.1016/j.osnem.2017.09.001.

S. Ali, N. Islam, A. Rauf, I. U. Din, M. Guizani, and J. J. P. C. Rodrigues, “Privacy and security issues in online social networks,” Future Internet, no. 12, pp. 1–12, 2018, doi: 10.3390/fi10120114.

A. Sharma and R. Kaushal, “Detecting Hate Speech in Hindi in Online Social Media,” in 2023 3rd International Conference on Intelligent Communication and Computational Techniques, ICCT 2023, Institute of Electrical and Electronics Engineers Inc., 2023. doi: 10.1109/ICCT56969.2023.10075749.

P. Parikh, H. Abburi, N. Chhaya, M. Gupta, and V. Varma, “Categorizing Sexism and Misogyny through Neural Approaches,” ACM Transactions on the Web, vol. 15, no. 4, Jul. 2021, doi: 10.1145/3457189.

A. Singh, D. Sharma, and V. K. Singh, “Misogynistic attitude detection in YouTube comments and replies: A high-quality dataset and algorithmic models,” Comput Speech Lang, vol. 89, Jan. 2025, doi: 10.1016/j.csl.2024.101682.

M. S. Jahan and M. Oussalah, “A systematic review of hate speech automatic detection using natural language processing,” Neurocomputing, vol. 546, p. 126232, 2023, doi: 10.1016/j.neucom.2023.126232.

Devansh Mody, YiDong Huang, Thiago Eustaquio Alves de Oliveira, A curated dataset for hate speech detection on social media text, Data in Brief, Volume 46, 2023, https://doi.org/10.1016/j.dib.2022.108832.

B. Krenn, J. Petrak, M. Kubina, and C. Burger, “GERMS-AT: A Sexism/Misogyny Dataset of Forum Comments from an Austrian Online Newspaper,” 2024. [Online]. Available: https://www.britannica.

W. Sharif, S. Abdullah, S. Iftikhar, D. Al-madani, and S. Mumtaz, “Enhancing Hate Speech Detection in the Digital Age: A Novel Model Fusion Approach Leveraging a Comprehensive Dataset,” IEEE Access, vol. 12, no. December 2023, pp. 27225–27236, 2024, doi: 10.1109/ACCESS.2024.3367281.

R. Kumar, B. Lahiri, and A. K. Ojha, “Aggressive and Offensive Language Identification in Hindi, Bangla, and English: A Comparative Study,” SN Comput Sci, vol. 2, no. 1, pp. 1–20, 2021, doi: 10.1007/s42979-020-00414-6.

E. Guest, B. Vidgen, N. Sastry, G. Tyson, and H. Margetts, “An Expert Annotated Dataset for the Detection of Online Misogyny,” 1350. [Online]. Available: https://github.com/ellamguest/.

S. Yadav, A. Kaushik, and K. McDaid, “Exploratory Data Analysis on Code-mixed Misogynistic Comments,” Mar. 2024, [Online]. Available: http://arxiv.org/abs/2403.09709.

E. Fersini et al., “SemEval-2022 Task 5: Multimedia Automatic Misogyny Identification.” [Online]. Available: https://cloud.google.com/vision/docs/.

S. Sultan Saruar Jahan et al., “Deep learning based misogynistic Bangla text identification from social media,” Computing and Informatics, vol. 42, pp. 993–1012, 2023, doi: 10.31577/cai.

D. Grosz and P. Conde-Cespedes, “Automatic Detection of Sexist Statements Commonly Used at the Workplace,” Jul. 2020, [Online]. Available: http://arxiv.org/abs/2007.04181.

Karishma, S., and Akila, V. Multiclass Classification of Hindi-English Code Mixed Misogyny Comments Using Recurrent Neural Networks. 2025 International Conference on Emerging Technologies in Engineering Applications (ICETEA), 1–6.

S. R. R. Rahman, J. U. Tanvin and M. N. Islam, “A Hybrid Deep Learning Model for Sentiment Analysis of Multilingual Comments on Trending YouTube Videos” 2025 International Conference on Electrical, Computer and Communication Engineering (ECCE), Chittagong, Bangladesh, 2025, pp. 1–6, doi: 10.1109/ECCE64574.2025.11013083.

A. Phadte and M. L. Dhore, “Sentiment Analysis of English-Marathi-Konkani Code-Mixed Social Media Text: A Multilingual Approach,” 2025 International Conference on Computing Technologies (ICOCT), Bengaluru, India, 2025, pp. 1–10, doi: 10.1109/ICOCT64433.2025.11118344.

A. Phadte and M. L. Dhore, “Advancements in Sentiment Analysis of Code-Mixed Text: A Survey of Multilingual Models and Emerging Innovations,” 2025 9th International Conference on Computing, Communication, Control and Automation (ICCCBEA), Pune, India, 2025, pp. 01–10, doi: 10.1109/ICCUBEA65967.2025.11283754.

D. S. AbdElminaam et al., “Harnessing Machine Learning and Deep Learning for Multilingual Sentiment Analysis: A Comparative Study on Arabic and English Social Media Data,” 2025 International Mobile, Intelligent, and Ubiquitous Computing Conference (MIUCC), Cairo, Egypt, 2025, pp. 198–205, doi: 10.1109/MIUCC66482.2025.11196877.

A. Tontodimamma, E. Nissi, A. Sarra, and L. Fontanella, “Thirty years of research into hate speech: topics of interest and their evolution,” Scientometrics, vol. 126, no. 1, pp. 157–179, Jan. 2021, doi: 10.1007/s11192-020-03737-6.

M. Mozafari, R. Farahbakhsh, and N. Crespi, “A BERT-Based Transfer Learning Approach for Hate Speech Detection in Online Social Media,” Oct. 2019, [Online]. Available: http://arxiv.org/abs/1910.12574.

S. Khan et al., “BiCHAT: BiLSTM with deep CNN and hierarchical attention for hate speech detection,” Journal of King Saud University – Computer and Information Sciences, vol. 34, no. 7, pp. 4335–4344, 2022, doi: 10.1016/j.jksuci.2022.05.006.

P. Badjatiya, S. Gupta, M. Gupta, and V. Varma, “Deep learning for hate speech detection in tweets,” 26th International World Wide Web Conference 2017, WWW 2017 Companion, no. 2, pp. 759–760, 2017, doi: 10.1145/3041021.3054223.

S. Kamble and A. Joshi, “Hate Speech Detection from Code-mixed Hindi-English Tweets Using Deep Learning Models,” 2018, [Online]. Available: http://arxiv.org/abs/1811.05145.

A. Khan, A. Ahmed, S. Jan, M. Bilal, and M. F. Zuhairi, “Abusive Language Detection in Urdu Text: Leveraging Deep Learning and Attention Mechanism,” IEEE Access, vol. PP, p. 1, 2024, doi: 10.1109/ACCESS.2024.3370232.

V. B. M. L. V. P. Komal Florio, 2019 8th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW). 2019.

Curated Hinglish Dataset for Deep Learning-Based Misogyny Detection

Authors

DOI:

Keywords:

Abstract

Downloads

Author Biographies

Deepti Negi, School of Computing, Graphic Era Hill University, Dehradun, 248001, India

Himani Maheshwari, School of Computing, Graphic Era Hill University, Dehradun, 248001, India

Chandrakala Arya, School of Computing, Graphic Era Hill University, Dehradun, 248001, India

Umesh Chandra, Department of Statistics & Computer Science, Banda University of Agriculture & Technology, Banda, 210001, India

Gaurav Shukla, Department of Statistics & Computer Science, Banda University of Agriculture & Technology, Banda, 210001, India

References

Downloads

Published

How to Cite

Issue

Section

ImpactScore

SpecialIssue

ISSN

Cover

Submission

Indexing

UGCInfo

OpenAccess