dc.description.abstract |
Hate speech is an offensive communication aimed at a certain group. It refers to
online mass media in which people communicate information, ideas, messages, and
other stuff. Social media platforms and online forums enable user interaction with
user-generated material, making them indispensable in everyday life. Individuals
must be protected from harmful behavior by enhanced surveillance and effective
policies. Hate speech is commonly characterized as "a deliberate act of assault
directed against a specific group to harm them due to specific characteristics of their
identity". The research gaps are listed below. Existing techniques for identifying and
classifying hate speech are insufficient. It highlights the need for improved methods
to address the evolving nature of hate speech. Second, existing techniques have
limited adaptability. Finally, established models face challenges with complex social
media terminology; this study seeks to enhance English hate speech detection using
advanced deep learning techniques. This research aimed to build models with deep
neural networks and embedded words. Our approach uses transformer-based models
with hyperparameter tuning and generative configurations to enhance precision and
efficacy. GPU acceleration is used for efficient training models and execution. This
research proposes methods such as replacing emojis with text descriptions and
removing special characters while retaining emojis to improve interpretation and
context preservation. Data is acquired from social media via APIs and data providers
before being preprocessed for noise removal, deduplication, normalization,
tokenization, stop word removal, and lemmatization. With text features designed for
analysis, the data is separated into training, validation, and testing sets. Numerical
representations are constructed utilizing TF-IDF and word embeddings, such as
Word2Vec and GloVe. Convolutional Neural Networks (CNNs) were used to detect
specific sentences and Long Short-Term Memory Networks (LSTMs) to grasp the
context. Both models are trained and optimized, and their efficiency is measured
using accuracy, precision, recall, and F1-score. NLTK (Natural Language Toolkit) is
a powerful Python tool for developing NLP baseline models. The baseline model, the
Gradient Boosting Classifier, achieved an accuracy of 0.93, demonstrating excellent
performance in traditional machine learning techniques. Our strategy will soon
improve hate speech identification using code-mixed languages. |
en_US |