×
Home Current Archive Editorial board
Instructions for papers
For Authors Aim & Scope Contact
Original scientific article

A ROBUST MACHINE LEARNING-BASED ENSEMBLE LEARNING FRAMEWORK FOR HATE SPEECH DETECTION IN LOW-RESOURCE SOCIAL MEDIA TEXT

By
Husnain Saleem Orcid logo ,
Husnain Saleem

Gomal University , Dera Ismail Khan , Pakistan

Muhammad Javed Orcid logo ,
Muhammad Javed

Gomal University , Dera Ismail Khan , Pakistan

Kiran Hanif Orcid logo ,
Kiran Hanif

Gomal University , Dera Ismail Khan , Pakistan

Asad Ullah Orcid logo ,
Asad Ullah

Gomal University , Dera Ismail Khan , Pakistan

Muhammad Usman Ghani Orcid logo ,
Muhammad Usman Ghani

Gomal University , Dera Ismail Khan , Pakistan

Muhammad Waqas Orcid logo ,
Muhammad Waqas

Gomal University , Dera Ismail Khan , Pakistan

Muhammad Ali Khan Orcid logo ,
Muhammad Ali Khan

Gomal University , Dera Ismail Khan , Pakistan

Sheraz Ali Hassan Orcid logo
Sheraz Ali Hassan

Gomal University , Dera Ismail Khan , Pakistan

Abstract

The low-resource social media text i.e., Urdu tweets containing hate speech are identified with the help of a machine learning-based ensemble approach. The dataset used for this study consisted of 8,800 tweets and half of them were labeled as Hateful and the other half as No-Hate. In preprocessing, we took into account the features of Urdu normalizing the characters, eliminating frequent words, and filtering the punctuation. TF-IDF was used to extract features based on unigrams and bigrams and the number of terms was restricted to 5,000. At first, Logistic Regression, Multinomial Naive Bayes, and Support Vector Classifier were chosen as the base learners and the Logistic Regression was used again as meta-learner in the last layer of the ensemble. The training data consisted of 80% and the rest, 20%, data was used to test the performance of models. Compared to other baseline ensemble approaches and classifiers including Random Forest, Gradient Boosting, AdaBoost, Bagging, Soft Voting, and Hard Voting, our proposedmachine learning based-stacking ensemble approach achieved a high accuracy of 86.53%, precision of 85.45%, and recall of 86.96% and F1-score of 86.20%. The research indicates that the machine learning-based stacking ensemble approach plays a vital role in the identification of hate speech in Urdu Tweets.

References

1.
Vidgen B, Yasseri T. Detecting weak and strong Islamophobic hate speech on social media. Journal of Information Technology & Politics. 2020 Jan 2;17(1):66-78.
2.
Imomova U, Fayzullayeva D, Turdibayev D, Gulomjonova N, Kenjaev B, Shadyeva N, et al. A critical discourse analysis of linguistic framing in climate change skepticism across media and political narratives. International journal of aquatic research and environmental studies. 2025;5:121-31.
3.
Founta A, Djouvas C, Chatzakou D, Leontiadis I, Blackburn J, Stringhini G, et al. Large scale crowdsourcing and characterization of twitter abusive behavior. In Proceedings of the international AAAI conference on web and social media 2018 Jun 15 (Vol. 12, No. 1).
4.
Nayak P, Mathur D. Evaluating the impact of social media algorithms on information dissemination. International Academic Journal of Innovative Research. 2021;8(2):21–4.
5.
Khan L, Amjad A, Ashraf N, Chang HT, Gelbukh A. Urdu sentiment analysis with deep learning methods. IEEE access. 2021 Jun 28;9:97803-12.
6.
Khan AR, Karim A, Sajjad H, Kamiran F, Xu J. A clustering framework for lexical normalization of Roman Urdu. Natural Language Engineering. 2022 Jan;28(1):93-123.
7.
Zhang L, Wang S, Liu B. Deep learning for sentiment analysis: A survey. Wiley interdisciplinary reviews: data mining and knowledge discovery. 2018 Jul;8(4):e1253.
8.
Nirosha G, Velmani RD. Raspberry Pi based Sign to speech conversion system for mute community. InIOP Conference Series: Materials Science and Engineering 2020 Dec 1 (Vol. 981, No. 4, p. 042005). IOP Publishing.
9.
Östling R, Tiedemann J. Neural machine translation for low-resource languages. arXiv preprint arXiv:1708.05729. 2017 Aug 18.
10.
Prabu K, Sudhakar P. An automated intrusion detection and prevention model for enhanced network security and threat assessment. International Journal of Computer Networks and Applications. 2023 Aug;10(4):621.
11.
Wolpert DH. Stacked generalization. Neural networks. 1992 Jan 1;5(2):241-59.
12.
Dharmireddi S, Mahdi HM, Rajendran M, Suryasa IW, Soy A. Artificial Intelligence-Driven Natural language processing for the futuristic Language Processing. In2025 International Conference on Computational Innovations and Engineering Sustainability (ICCIES) 2025 Apr 24 (pp. 1-6). IEEE.
13.
MacAvaney S, Yao HR, Yang E, Russell K, Goharian N, Frieder O. Hate speech detection: Challenges and solutions. PloS one. 2019 Aug 20;14(8):e0221152.
14.
Mim SJ, Mahmud T, Ali MH, Aziz MT. Stacking ensemble framework for hate speech detection in bangla videos. In2024 IEEE International Conference on Computing, Applications and Systems (COMPAS) 2024 Sep 25 (pp. 1-7). IEEE.
15.
Daud A, Khan W, Che D. Urdu language processing: a survey. Artificial Intelligence Review. 2017 Mar;47(3):279-311.
16.
Bilal M, Khan A, Jan S, Musa S. Context-aware deep learning model for detection of roman urdu hate speech on social media platform. IEEE Access. 2022 Oct 21;10:121133-51.
17.
Khan MM, Shahzad K, Malik MK. Hate speech detection in roman urdu. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP). 2021 Mar 9;20(1):1-9.
18.
Humayoun M. Abusive and threatening language detection in Urdu using supervised machine learning and feature combinations. arXiv preprint arXiv:2204.03062. 2022 Apr 6.
19.
Khan L, Amjad A, Ashraf N, Chang HT. Multi-class sentiment analysis of urdu text using multilingual BERT. Scientific Reports. 2022 Mar 31;12(1):5436.
20.
Arshad MU, Ali R, Beg MO, Shahzad W. UHated: hate speech detection in Urdu language using transfer learning. Language Resources and Evaluation. 2023 Jun;57(2):713-32.
21.
Amjad M  , Ashraf N, Sidorov G, Zhila A, Chanona-Hernandez L, Gelbukh A. Automatic abusive language detection in Urdu tweets. Acta Polytechnica Hungarica. 2021;8860.
22.
Adeeba F, Yousuf MI, Anwer I, Tariq SU, Ashfaq A, Naqeeb M. Addressing cyberbullying in Urdu tweets: a comprehensive dataset and detection system. PeerJ Computer Science. 2024 Apr 29;10:e1963.
23.
Saleem H. Performance Assessment of ML and DL Models in Detecting Hate Speech from Mixed English–Roman Urdu Text with Small-Scale Datasets. Advances in Artificial Intelligence and Machine Learning. 2025; 5 (2): 220. InWorkshop on Speech and Language Technologies for Dravidian Languages 2023 (Vol. 244, p. 249).
24.
Santosh TY, Aravind KV. Hate speech detection in hindi-english code-mixed social media text. InProceedings of the ACM India joint international conference on data science and management of data 2019 Jan 3 (pp. 310-313).
25.
Al-Hassan A, Al-Dossari H. Detection of hate speech in Arabic tweets using deep learning. Multimedia systems. 2022 Dec;28(6):1963-74.
26.
Khan AR, Karim A, Sajjad H, Kamiran F, Xu J. A clustering framework for lexical normalization of Roman Urdu. Natural Language Engineering. 2022 Jan;28(1):93-123.
27.
Pitsilis GK, Ramampiaro H, Langseth H. Effective hate-speech detection in Twitter data using recurrent neural networks. Applied Intelligence. 2018; 48:4730–42.
28.
Hansen LK, Salamon P. Neural network ensembles. IEEE transactions on pattern analysis and machine intelligence. 2002 Aug 6;12(10):993-1001.
29.
Founta A, Djouvas C, Chatzakou D, Leontiadis I, Blackburn J, Stringhini G, et al. Large scale crowdsourcing and characterization of twitter abusive behavior. In Proceedings of the international AAAI conference on web and social media 2018 Jun 15 (Vol. 12, No. 1).

Citation

This is an open access article distributed under the  Creative Commons Attribution Non-Commercial License (CC BY-NC) License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. 

Article metrics

Google scholar: See link

The statements, opinions and data contained in the journal are solely those of the individual authors and contributors and not of the publisher and the editor(s). We stay neutral with regard to jurisdictional claims in published maps and institutional affiliations.