Towards the Importance of the Type of Deep Neural Network and Employment of Pre-trained Word Vectors for Toxicity Detection: An Experimental Study

Abdullah Talha Kabakus

Department of Computer Engineering, Faculty of Engineering, Duzce University, Turkey

E-mail: talhakabakus@duzce.edu.tr

Received 18 February 2021; Accepted 30 July 2021; Publication 27 October 2021

Abstract

As a natural consequence of offering many advantages to their users, social media platforms have become a part of their daily lives. Recent studies emphasize the necessity of an automated way of detecting offensive posts in social media since these ‘toxic’ posts have become pervasive. To this end, a novel toxic post detection approach based on Deep Neural Networks was proposed within this study. Given that several word embedding methods exist, we shed light on which word embedding method produces better results when employed with the five most common types of deep neural networks, namely, Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM), GRU (Gated Recurrent Unit), Bidirectional Long Short-Term Memory (BiLSTM), and a combination of CNN and BiLSTM. To this end, the word vectors for the given comments were obtained through four different methods, namely, ( $i$ ) $W o r d 2 v e c$ , ( $i i$ ) $f a s t T e x t$ , ( $i i i$ ) $G l o V e$ , and ( $i v$ ) the $E m b e d d i n g$ layer of deep neural networks. Eventually, a total of twenty benchmark models were proposed and both trained and evaluated on a gold standard dataset which consists of $16 K$ tweets. According to the experimental result, the best $F 1 - s c o r e$ , $84.844 %$ , was obtained on the proposed CNN model without employing pre-trained word vectors which outperformed the state-of-the-art works and implies the effective embedding ability of CNNs. Other key findings obtained through the conducted experiments are that the models, that constructed word embeddings through the $E m b e d d i n g$ layers, obtained higher $F 1 - s c o r e$ s and converged much faster than the models that utilized pre-trained word vectors.

Keywords: Word embedding, word vector, deep neural network, convolutional neural network, recurrent neural network, toxic comment detection.

1 Introduction

Social media platforms have become the essential reason for using the web in the last decade as a recent report by the cooperation of We Are Social and Hootsuite highlights that almost $85 %$ of the web users ( $3.8$ billion of $4.5$ billion) use social media platforms and almost $60 %$ of the world’s population uses the web [1]. Social media platforms let users share their opinions on anything they want. Alongside sharing their own opinions, people tend to use these platforms to seek others’ opinions about anything they are interested in. During this bi-directional opinion sharing on these platforms, debates happen due to the ability to add comments. Cyberbullying has been defined by many national crime prevention councils as most studies reported that the prevalence rates of cyberbullying range from $10 %$ to $40 %$ of internet users [2]. The effects of cyberbullying vary from temporary anxiety to suicide [3]. When one of these sides uses ‘toxic comments’ as a result of using offensive language, unfortunately, virtual fights do happen which can lead people both to stop expressing themselves and to stop seeking others’ opinions out of fear of abuse through social media platforms [4] which eventually means less number of visitors and less money when the subscription models and ads are considered [5]. Also, legal reasons might require these platforms to propose countermeasures against the toxic comments such as actively monitoring all the available content on the platform and deleting when such comments are detected within $24$ hours [5]. Therefore, a mechanism for the detection of these toxic comments is a key necessity for social media platforms to let people use these platforms without any concerns. This detection mechanism needs to be ‘automated’ to not be time-consuming and labor-intensive when the fact that an enormous volume of content posted daily on the social media platforms was considered. Toxic comment detection is a challenging task. First of all, what makes a comment ‘toxic’ is quite subjective. Not all comments that contain swear words convey negative meanings as expected. One example of this case is that $78 %$ of the swearing posts in the Formspring,¹ a question-and-answer-based social network, are actually non-toxic comments [6]. Similarly, it is reported that $82 %$ of the toxic tweets in a Twitter toxic comment (tweet) dataset do not use any swear words [6]. Therefore, the presence of swear words is neither necessary nor sufficient to regard a comment as ‘toxic’ [6]. A toxic comment is generally classified into five classes as follows [5]: ( $i$ ) obscene language, ( $i i$ ) insult, ( $i i i$ ) threat, ( $i v$ ) identity hate, and ( $v$ ) severe toxic.

The first step of detecting toxic comments is understanding the content of these comments. Word embeddings are the vectorial representations of words that each vector consists of real numbers. This mapping lets creating more compact and expressive word presentations [7] that can be utilized with many machine learning algorithms. Word embedding has been widely used in Natural Language Processing (NLP) applications such as machine translation [8], conversational dialog systems, and sentiment classification [9]. Words can be represented as vectors or bag-of-words. The biggest advantage of utilizing vector representations, that are constructed through word embeddings, instead of the bag-of-words representation, is that vector representations are capable of capturing the semantic meaning of the text [10]. Various types of deep neural networks have demonstrated the state-of-the-art in many domains. In addition to various deep neural networks, there exist various widely used state-of-the-art word embedding methods in the literature. Therefore, the main motivation of this study is to shed light on which type of deep neural network and which word embedding method provide the best performance for toxic detection. To this end, in this study, a total of twenty models based on the five most commonly used types of deep neural networks, namely, Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM), GRU (Gated Recurrent Unit), Bidirectional LSTM (BiLSTM), and a combination of $C N N$ and $B i L S T M$ , were proposed with utilizing three state-of-the-art word embedding methods, namely, $W o r d 2 v e c$ [11, 12], $f a s t T e x t$ [13], and GloVe (Global Vectors for Word Representation) [14] for toxic detection on a gold standard dataset. The main contributions of this study are given as follows:

• A total of twenty models based on the five types of widely-used deep neural networks, namely, $C N N$ , $L S T M$ , $G R U$ , $B i L S T M$ , and a combination of $C N N$ and $B i L S T M$ , were experimented in order to reveal the performance differences of these deep neural networks for the toxic detection task.

• The effects of various word embedding methods on the proposed deep neural network models were investigated in order to shed light on which embedding method provides a better result for the toxic comment detection task when it is deployed with a deep neural network.

• Instead of utilizing handcrafted features, which are not robust against the variations of writing style, the features of the proposed models were automatically extracted thanks to the proposed deep neural network models.

• Instead of training the deep neural networks for a fixed number of epochs, the training process was dynamically managed thanks to an employed callback. To the best of our knowledge, this is the first study in the literature on toxic comment detection that employed this technique.

• The learning rate of each proposed deep neural network was adapted to the performance of the network thanks to an employed callback. To the best of our knowledge, this is the first study in the literature on toxic comment detection that employed this technique.

2 Related Work

Davidson et al. [15] employed a variety of traditional machine learning algorithms, namely, Logistic Regression, Naïve Bayes, Decision Tree, Random Forest, and Linear SVM, for hate speech detection. To this end, they utilized the Twitter API to construct their own dataset to evaluate the employed models. According to the experimental result, the best performing model, which was based on Logistic Regression with L2 regularization, achieved an overall $p r e c i s i o n$ of $0.91$ , a $r e c a l l$ of $0.90$ , and an $F 1 - s c o r e$ of $0.90$ on the constructed dataset. They also reported that both Logistic Regression and Linear SVM significantly better performance than the other models.

Zhang et al. [16] proposed an approach for toxic detection that was based on a deep neural network that was a combination of Convolutional and GRU networks. According to the conducted experiments, the best $F 1 - s c o r e$ , $0.94$ , of the proposed model was obtained on the dataset constructed by Davidson et al. [15]. Another finding of this study was that the proposed model outperformed the other SVM and CNN models. Similar to our model, they implemented their model by using Keras. But unlike our implementation, they employed Theano as the backend of Keras instead of TensorFlow.

Badjatiya et al. [17] investigated hate speech detection on Twitter. To this end, they employed three methods, namely, ( $i$ ) CNN, ( $i i$ ) LSTM, and ( $i i i$ ) fastText. When it comes to word embedding, they utilized two methods to initialize the word embeddings, namely, ( $i$ ) random embeddings, and ( $i i$ ) GloVe embeddings. According to the experimental result, the proposed LSTM model with Gradient Boosted Decision Trees on the randomly initialized embeddings achieved the best performance on a Twitter dataset [18] with providing a $p r e c i s i o n$ , $r e c a l l$ , and $F 1 - s c o r e$ of $0.73$ . It is worth mentioning that, as the authors mentioned in the paper, the random embedding method surprisingly yielded a better performance for their model. The methodological issue in this study is that they extracted the features by utilizing the complete labeled dataset which leads to an artificial increase in the performance of the model while underestimating the generalization error as a result of overfitting [19]. This issue was discussed in detail by Arango et al. [19] and they reported that replication of this model obtained a macro averaged F1-score of $73.1 %$ when this methodological issue was taken into account. They trained the models for a fixed number ( $10$ ) of epochs. Unlike this training preference, the number of epochs the model was trained for was dynamically set for the proposed model which means that the training was stopped when the model’s performance has started decreasing. Another difference in terms of model training is that the learning rate of the proposed model was dynamically adapted through the performance of the model in the last epochs.

Zampieri et al. [20] proposed an approach to detect both the toxic language and the type of toxic in social media. To this end, they experimented on three methods, namely, ( $i$ ) SVM, ( $i i$ ) BiLSTM, and ( $i i i$ ) CNN. In order to evaluate the proposed models based on the aforementioned methods, they complied a new Twitter dataset, namely, OLID (Offensive Language Identification Dataset) which consisted of a total of $14, 100$ tweets. According to the experimental result, the best $M a c r o - F 1$ $s c o r e$ , $0.80$ , was obtained by the proposed CNN model for the toxic language detection task.

Park and Fung [21] proposed three methods based on CNN to classify sexist and racist abusive language, namely, ( $i$ ) CharCNN which is a character-level CNN, ( $i i$ ) WordCNN which segments the input sentences into words and converted into $300$ -dimensional embeddings, and ( $i i i$ ) HybridCNN which is a combination of CharCNN, and WordCNN that utilizes both character and word inputs. They utilized the Twitter datasets provided by Waseem and Hovy [18] and concatenated them into a new one. Then they divided this concatenated dataset into three datasets as follows: ( $i$ ) “One-step” dataset that aims multi-class segmentation for the labels “ $n o n e$ ”, “ $r a c i s m$ ”, and “ $s e x i s m$ ”, ( $i i$ ) “Two-step-1” dataset that merges the “ $s e x i s m$ ”, and “ $r a c i s m$ ” labels into one abusive label and aims abusive language detection, and ( $i i i$ ) “Two-step-2” dataset that consists of the labels “ $n o n e$ ”, and “ $a b u s i v e$ ” in order to experiment a classifier to distinguish “ $s e x i s m$ ” and “ $r a c i s m$ ”. According to the experimental result, the proposed HybridCNN achieved an $F 1 - s c o r e$ of $0.827$ on the “One-step” dataset and the Logistic Regression achieved an $F 1 - s c o r e$ of $0.824$ on the “Two-step” dataset.

Twitter is not the only social media platform that has been investigated for the toxic detection task. Van Hee et al. [22] utilized the ASKfm, an online question-and-answer platform, to construct a dataset. The constructed dataset comprised $113, 698$ manually annotated posts written in English. They proposed a model based on Linear SVM and evaluated its performance on their own dataset. The proposed model was tuned thanks to the Grid Search, a widely-used tuning method to reveal the optimum values of hyper-parameters. According to the experimental result, the proposed model achieved an $F 1 - s c o r e$ of $0.6426$ . In addition to the ASKfm, Wikipedia was also utilized for toxic detection task in the literature. Chakrabarty [23] proposed an approach based on a $6$ -headed machine learning TF-IDF (Term Frequency-Inverse Document Frequency) model. This proposed model was evaluated on a labeled Wikipedia comment dataset and the model achieved a $M a c r o - F 1 s c o r e$ of $0.9817$ . Similar to the ASKfm, Formspring is another online question-answer platform that was utilized for the toxic detection task. Reynolds et al. [24] constructed a dataset by crowd-sourcing the Formspring platform which consisted of $3, 915$ posts. In order to label these posts, the authors utilized the Amazon Mechanical Turk service, an online service to set labels for the given datasets. They employed a C4.5 decision tree and an instance-based learner for toxic detection on the constructed dataset. According to the experimental result, both learners achieved an $a c c u r a c y$ of $78.5 %$ for the true positives. Agrawal and Awekar [6] proposed deep neural networks, which were similar to the models proposed by Badjatiya et al. [17], for detecting toxic comments on Twitter, Formspring, and Wikipedia. They experimented with four types of deep neural networks, namely, ( $i$ ) CNN, ( $i i$ ) LSTM, ( $i i i$ ) BiLSTM, and ( $i v$ ) BiLSTM with an attention mechanism. They evaluated the proposed models on the same dataset of Badjatiya et al. [17] but they oversampled the data from the “ $h a t e$ ” class after observing that only a few posts were marked as “ $h a t e$ ” for their training set. The methodological issue here is that they oversampled the dataset before the train-test split which is a reason for the ‘overfitting’ problem [19]. This issue was discussed in detail by Arango et al. [19] and they reported that replication of this model obtained a macro averaged F1-score of $79.6 %$ when this methodological issue was taken into account. Similar to Badjatiya et al. [17], the models within this study were trained for $10$ epochs.

3 Material and Method

In this section, first, the employed word embedding models are briefly described. Second, the proposed deep neural network models for toxic detection are described. Finally, the gold standard dataset that was used by the proposed models is described.

3.1 Word Embedding Methods

In the following subsections, the three employed word embedding methods, namely, ( $i$ ) $W o r d 2 V e c$ , ( $i i$ ) $f a s t T e x t$ , and ( $i i i$ ) $G l o V e$ , are briefly described. The word embedding models based on these methods were implemented using a widely used Python library, namely, gensim [25].

3.1.1 Word2Vec

$W o r d 2 v e c$ is a state-of-the-art technique to efficiently create word embeddings from a given text, which is generally a large corpus of text. $W o r d 2 v e c$ was proposed in 2013 by a team whose members work at Google.

3.1.2 GloVe

$G l o V e$ is a global log-bilinear regression model for unsupervised learning of word representations that aims to represent words as numerical sequences that are trained on aggregated global word-word co-occurrence statistics from a corpus. $G l o V e$ was proposed in 2014 and was developed as an open-source project at Stanford University.

3.1.3 fastText

$f a s t T e x t$ is another widely used word embedding method that is based on $W o r d 2 v e c$ and developed by Facebook AI Research (FAIR) lab. The initial version of $f a s t T e x t$ was released in 2015. The advantages of $f a s t T e x t$ are listed as follows: ( $i$ ) It is robust for rare words, ( $i i$ ) it is able to group inflected words, and ( $i i i$ ) it is fast to compute [26].

An overview of the employed word embedding methods within this study is listed in Table 1. In order to compare the performance of the word embedding methods, their pre-trained models, which were trained on the same corpus, namely, Common Crawl, were employed. Alongside these methods, as an additional embedding method, the pre-trained word vectors were not given to the proposed deep neural networks in order to reveal the efficiency of deep neural networks to create word vectors from the given textual data.

Table 1 An overview of the employed word embedding methods

Model	Embedding Method	Vector Size	Vocabulary Size	Corpus
$E_{1}$	$W o r d 2 v e c$	$300$	$2.8 M$	Common Crawl
$E_{2}$	$f a s t T e x t$	$300$	$1 M$	Common Crawl
$E_{3}$	$G l o V e$	$300$	$1.2 M$	Common Crawl

3.2 Proposed Deep Neural Network Models

A total of twenty models, which were based on the five of the most popular deep neural networks, namely, ( $i$ ) $C N N$ , ( $i i$ ) $L S T M$ , ( $i i i$ ) $G R U$ , ( $i v$ ) $B i L S T M$ [27], and ( $v$ ) a combination of $C N N$ and $B i L S T M$ , were proposed to reveal the efficiency of employing the pre-trained word vectors instead of the word embedding ability of deep neural networks. In the following subsections, the proposed deep neural network models are described. The proposed models were implemented using Keras [28], which is a Python library and acts as a wrapper for various backends. TensorFlow [29], an open-source machine learning library developed by Google, was employed as the backend of Keras since it is the recommended one [30] by the developer of Keras, who also works as a software engineer at Google. The proposed models were experimentally designed as there are no clear rules that can be applied to any kind of deep neural networks except the best practices, which were strictly followed. An overview of the proposed model is presented in Figure 1. The pseudocode of the proposed model is listed in Table 2.

Figure 1 An overview of the proposed model.

Table 2 The pseudocode of the proposed model

Preprocess data

Load all tokens into a list and assign to

t

Calculate maximum sequence length of

t

and assign to

m a x_l e n

Construct word embeddings according to

m a x_l e n

Construct deep neural network model

Define patience and monitored criterion as

p

, and

c r i t

Train deep neural network

Set learning rate to 0.001 and assign to

l r

Factor

l r

by 0.1 when

c r i t

is not improved

Stop training when

c r i t

is not improved for

p

Evaluate deep neural network

Figure 2 A plot of the distribution of the lengths of the tweets ( $d o c_l e n$ ).

3.2.1 Proposed CNN model

The proposed $C N N$ model consisted of $6$ layers. The model starts with an $E m b e d d i n g$ layer to utilize or construct the word embeddings for the given comments. The input length of the $E m b e d d i n g$ layer (denoted with $m a x_l e n$ ), $22$ , was determined by the sum of the mean of the length of each tweet (denoted with $d o c_l e n$ ) with the standard deviation (denoted with $s t d$ ) of the $d o c_l e n$ as the corresponding equation is given in Equation (1). A plot of the distribution of the lengths of tweets ( $d o c_l e n$ ) is presented in Figure 2. It is worth mentioning that each $E m b e d d i n g$ layer in the proposed benchmark models was configured similarly.

m a x_l e n = m e a n (d o c_l e n) + s t d (d o c_l e n)

(1)

After the $E m b e d d i n g$ layer, a convolutional layer (denoted with $C o n v$ ) was employed to perform convolution operations on the input. After that, a $M a x P o o l i n g$ layer, which is responsible for reducing the dimensionality of the input by applying the maximum function on it, was employed. After the $C o n v$ layer, a $G l o b a l M a x P o o l i n g$ layer was employed which sets the pool size to the input size to apply the maximum function to the entire input. Following this, a $D r o p o u t$ layer, which is a widely used regularization technique that randomly drops neurons from the neural network to prevent the ‘overfitting’ problem [31], one of the biggest challenges of deep neural networks as a result of increased depth and complexity of deep neural networks [32, 33], was employed. Finally, a $D e n s e$ layer, which is a fully-connected neural network component that is responsible for classification, was employed. This $D e n s e$ layer was configured with $3$ outputs units since the aim of the model is a trinary classification. The $R e c t i f i e d L i n e a r U n i t (R e L U)$ [34] was employed as the activation function of the $C o n v$ layer. The activation function of the $D e n s e$ layer was set to the $s o f t m a x$ since the aim of the model is to make a trinary prediction for each given tweet. $L 2 r e g u l a r i z a t i o n$ was employed with this $C N N$ layer to the kernel weights matrix, bias vector, and the output of the layer to further prevent the overfitting problem in addition to the employed $D r o p o u t$ layer [35]. The layers of the proposed $C N N$ model with their configuration parameters are listed in Table 3.

Table 3 The layers of the proposed $C N N$ model with their configuration parameters

Layer	Configuration Parameters
$E m b e d d i n g$	– $i n p u t d i m e n s i o n$ : Total number of words in the vocabulary – $o u t p u t d i m e n s i o n$ : $200$ – $i n p u t l e n g t h$ : Maximum length of the comments
$D r o p o u t$	– $d r o p o u t_r a t e$ : $0.8$
$C o n v$	– $n u m b e r o f f i l t e r s$ : $64$ – $k e r n e l s i z e$ : $5$ – $a c t i v a t i o n f u n c t i o n$ : $R e L U$ – $p a d d i n g$ : $` s a m e^{'}$
$G l o b a l M a x P o o l i n g$	$-$
$D r o p o u t$	– $d r o p o u t_r a t e$ : $0.5$
$D e n s e$	– $n u m b e r o f u n i t s$ : $3$ – $a c t i v a t i o n f u n c t i o n$ : $s o f t m a x$

3.2.2 Proposed LSTM Model

An $L S T M$ network is a special type of Recurrent Neural Network (RNN) that is capable of learning long-term dependencies thanks to having a long-term memory, and forget gates which lets the model overcome the general problem of gradient descent [36]. The proposed $L S T M$ model consisted of $5$ layers. Similar to the proposed $C N N$ model, the $L S T M$ model starts with an $E m b e d d i n g$ layer to utilize the word embeddings for the given comments. After that, a $D r o p o u t$ layer was employed to prevent the overfitting. Then, an $L S T M$ layer with $50$ output units was employed. After the $L S T M$ layer, another $D r o p o u t$ layer was employed. Finally, a $D e n s e$ layer was employed for the classification. Similar to the proposed $C N N$ model, the final $D e n s e$ layer was configured with $3$ output units. The layers of the proposed $L S T M$ model with their configuration parameters are listed in Table 4.

Table 4 The layers of the proposed $L S T M$ model with their configuration parameters

Layer	Configuration Parameters
$E m b e d d i n g$	– $i n p u t d i m e n s i o n$ : Total number of words in the vocabulary – $o u t p u t d i m e n s i o n$ : $200$ – $i n p u t l e n g t h$ : Maximum length of the comments
$D r o p o u t$	– $d r o p o u t_r a t e$ : $0.25$
$L S T M$	– $n u m b e r o f u n i t s$ : $50$
$D r o p o u t$	– $d r o p o u t_r a t e$ : $0.5$
$D e n s e$	– $n u m b e r o f u n i t s$ : $3$ – $a c t i v a t i o n f u n c t i o n$ : $s o f t m a x$

3.2.3 Proposed GRU model

A $G R U$ is also another special type of $R N N$ . The key difference between a $G R U$ and an $L S T M$ is that while an $L S T M$ has three gates, namely, ( $i$ ) input, ( $i i$ ) output, and ( $i i i$ ) forget gates, a $G R U$ has two gates, namely, ( $i$ ) reset, and ( $i i$ ) update gates. This architectural difference makes $G R U$ simpler, faster to train, and better on generalization on small data compared to $L S T M$ [16]. So, a new benchmark model was constructed solely based on this architectural difference from the constructed $L S T M$ model. Therefore, the $L S T M$ layer in the previous $L S T M$ model was replaced with a $G R U$ layer for this new benchmark model. The layers of the proposed $G R U$ model with their configuration parameters are listed in Table 5.

Table 5 The layers of the proposed $G R U$ model with their configuration parameters

Layer	Configuration Parameters
$E m b e d d i n g$	– $i n p u t d i m e n s i o n$ : Total number of words in the vocabulary – $o u t p u t d i m e n s i o n$ : $200$ – $i n p u t l e n g t h$ : Maximum length of the comments
$D r o p o u t$	– $d r o p o u t_r a t e$ : $0.25$
$G R U$	– $n u m b e r o f u n i t s$ : $50$
$D r o p o u t$	– $d r o p o u t_r a t e$ : $0.5$
$D e n s e$	– $n u m b e r o f u n i t s$ : $3$ – $a c t i v a t i o n f u n c t i o n$ : $s o f t m a x$

3.2.4 Proposed BiLSTM model

The proposed $B i d i r e c t i o n a l L S T M (B i L S T M)$ model is very similar to the proposed $L S T M$ model as it consisted of $5$ layers. The only difference between these two models is that, unlike the proposed $L S T M$ model, the $L S T M$ layers in the $B i L S T M$ model are bidirectional which comes from the idea of presenting each training sequence forward and backward to two separate $R N N$ s, both of which are connected to the same output layer [37]. In theory, this architectural modification lets capturing richer contextual information instead of using high-order factorization [38]. The layers of the proposed $B i L S T M$ model with their configuration parameters are listed in Table 6.

Table 6 The layers of the proposed $B i L S T M$ model with their configuration parameters

Layer	Configuration Parameters
$E m b e d d i n g$	– $i n p u t d i m e n s i o n$ : Total number of words in the vocabulary – $o u t p u t d i m e n s i o n$ : $200$ – $i n p u t l e n g t h$ : Maximum length of the comments
$D r o p o u t$	– $d r o p o u t_r a t e$ : $0.25$
$B i L S T M$	– $n u m b e r o f u n i t s$ : $50$
$D r o p o u t$	– $d r o p o u t_r a t e$ : $0.5$
$D e n s e$	– $n u m b e r o f u n i t s$ : $3$ – $a c t i v a t i o n f u n c t i o n$ : $s o f t m a x$

3.2.5 Proposed CNN $+$ BiLSTM model

The proposed $C N N + B i L S T M$ model is a sequential model that consists of a $C N N$ and a $B i L S T M$ for encoding, and decoding, respectively. This model is a typical ‘encoder-decoder’ network. The layers of the proposed $C N N + B i L S T M$ model with their configuration parameters are listed in Table 7.

Table 7 The layers of the proposed $C N N + B i L S T M$ model with their configuration parameters

Layer	Configuration Parameters
$E m b e d d i n g$	– $i n p u t d i m e n s i o n$ : Total number of words in the vocabulary – $o u t p u t d i m e n s i o n$ : $200$ – $i n p u t l e n g t h$ : Maximum length of the comments
$D r o p o u t$	– $d r o p o u t_r a t e$ : $0.8$
$C o n v$	– $n u m b e r o f f i l t e r s$ : $64$ – $k e r n e l s i z e$ : $5$ – $a c t i v a t i o n f u n c t i o n$ : $R e L U$ – $p a d d i n g$ : $` s a m e^{'}$
$B i L S T M$	– $n u m b e r o f u n i t s$ : $50$
$D r o p o u t$	– $d r o p o u t_r a t e$ : $0.5$
$D e n s e$	– $n u m b e r o f u n i t s$ : $3$ – $a c t i v a t i o n f u n c t i o n$ : $s o f t m a x$

3.3 Training Configuration

Since one of the aims of the proposed study is to shed light on the performance differences between the types of widely-used deep neural networks, all of the proposed deep neural network models were trained under the same configuration as follows: The Adaptive Moment Estimation (Adam) [39], which was proposed as an extension to the Stochastic Gradient Descent (SGD) [40], was employed as the optimization algorithm, which is responsible for updating the weights of the layers on the basis of the calculated loss. Another optimization algorithm, the Root Mean Square Propagation (RMSprop) has achieved a slightly worse performance than $A d a m$ . The Categorical Cross-Entropy function was employed as the loss function since our task is a multi-class classification problem. A loss function in a deep neural network is responsible for calculating the loss of the model for each iteration to let the model reduce the loss on the next iteration. The metric that was evaluated by the model was defined as $a c c u r a c y$ . $20 %$ of the training set was used as the validation set. The $b a t c h s i z e$ , which is the number of samples utilized in one iteration, was set to $128$ . The training was started with the $E a r l y S t o p p i n g$ callback, which is responsible for stopping the training when the model’s monitored metric has stopped improving for the pre-defined number of epochs (a.k.a. patience). As a result of this mechanism, the $E a r l y S t o p p i n g$ callback helps to prevent the overfitting problem [41–43]. The monitored metric and patience of the employed $E a r l y S t o p p i n g$ callback were defined as the $v a l i d a t i o n l o s s$ (the loss calculated for the validation set), and $5$ , respectively. Another callback that was employed is the callback of reducing the learning rate of the model when the monitored metric has stopped improving. For this callback, similar to the $E a r l y S t o p p i n g$ callback, the $v a l i d a t i o n l o s s$ was monitored and the learning rate of the model was factored by $0.1$ when the $v a l i d a t i o n l o s s$ has not decreased for $2$ epochs. The lower bound of the learning rate was defined as $e^{- 6}$ . To the best of our knowledge, this is the first study in the toxic comment detection literature that employs these callbacks during the training of the deep neural network. The hyper-parameters of the employed $A d a m$ algorithm were set after the model optimization (described in Section 3.6) and are listed in Table 8.

Table 8 The hyper-parameters of the employed $A d a m$ algorithm

Hyper-parameter	Value
Learning rate	$0.001$
Beta 1	$0.9$
Beta 2	$0.999$
Epsilon	$e^{- 8}$
Decay rate	$0.00001$

3.4 Data Preparation

The proposed models were both trained and evaluated on a gold standard toxic tweets dataset which was proposed by Waseem and Hovy [18] and consisted of $16 K$ tweets. The dataset contains a label that defines the toxicity type alongside the Id (Identifier) of the tweet defined by Twitter for each tweet. Since the Twitter terms prohibit sharing the texts of tweets, the dataset contains the Ids of tweets instead of their texts. Thanks to the Tweepy [44], an open-source Python library for accessing the Twitter API, the texts of the tweets were retrieved through their Ids. This dataset was annotated by crowd-sourcing the tweets of more than $600$ Twitter users. The labels that were defined for the tweets in the dataset are “ $r a c i s m$ ”, “ $s e x i s m$ ”, and “ $n e i t h e r$ ”. The proposed model within this study is a multi-class model that aims to classify the given samples into the given three labels, namely, ( $i$ ) “ $r a c i s m$ ”, ( $i i$ ) “ $s e x i s m$ ”, and ( $i i i$ ) the “ $n e i t h e r$ ”. The preprocessing phase, whose effect on the performance of neural networks is proven [45], contains the following operations to prepare data to be ready to yield into the proposed deep neural network:

• The tweets were lowercased to regard any forms of a word in terms of the case (e.g., uppercase, lowercase, sentence case, etc.) the same.

• The meaningless parts of tweets in terms of toxicity detection, namely, weblinks, usernames, and mentions were cleared from the tweets.

• The punctuation marks and whitespace characters were cleared from the tweets.

• The prefix of declaring that a tweet is a retweet of another one, the “ $R T$ ”, was cleared from the tweets.

• The stop words, which are the most common words in a language, were cleared from the tweets in order to focus on more important, meaningful words for the aim of the model. The stop words of English were obtained through a widely used Python library for NLP, namely, Natural Language Toolkit (NLTK) [46].

• The $173$ tweets, whose texts were completely removed (became empty Strings) after the aforementioned preprocessing operations, were dropped from the dataset.

The utilized $T w i t t e r$ dataset ( $D$ ) does not provide separated training and test sets. Therefore, the dataset was split into subsets as follows: $70 %$ of it ( $11, 834$ tweets), and $30 %$ of it ( $5, 073$ tweets) were utilized as the training set, and the test set, respectively. An overview of the utilized dataset is given in Table 9.

Table 9 An overview of the utilized $T w i t t e r$ dataset

Dataset	Number of Samples Labeled with “Racism”	Number of Samples Labeled with “Sexism”	Number of Samples Labeled with “Neither”
$T w i t t e r$ ( $D$ )	$1, 976$	$3, 430$	$11, 501$

3.5 Model Optimization

There are no standard guidelines for the construction of deep neural networks. Deep neural networks contain many hyper-parameters, which are parameters whose optimal values are revealed empirically based on the performance evaluation of deep neural networks. This process is known as the ‘hyper-parameter optimization’ task and is very determinative on the performance of the trained network since hyper-parameters of deep neural networks significantly affect the training process. Therefore, a wide range of hyper-parameters was empirically experimented with in an automated manner to reveal their optimal values as they are listed in Table 10. To this end, an open-source Python library that was specifically designed for Keras and was recommended by the developer of Keras [30] for the hyper-parameter optimization task, namely, Hyperas [47], was employed. The proposed models, which are described in the previous subsections, were optimized according to the experimental result of hyper-parameter optimization.

Table 10 The experimented values of hyper-parameters during the hyper-parameter optimization of the proposed models

Hyper-parameter	Experimented Values/Algorithms
Activation function	$R e L U$ , $t a n h$ , $s i g m o i d$ , $s o f t m a x$
Optimization algorithm	$A d a m$ , $S G D$ , $R M S p r o p$ , $A d a d e l t a$
Loss function	$B i n a r y C r o s s - E n t r o p y$ , $C a t e g o r i c a l C r o s s - E n t r o p y$ , $S p a r s e C a t e g o r i c a l C r o s s - E n t r o p y$ , $P o i s s o n$ , $K L D i v e r g e n c e$
Learning rate	$e^{- 6}, e^{- 5}, e^{- 4}, e^{- 3}, e^{- 2}$
Momentum	$0.0$ , $0.9$ , $0.99$
Decay rate	$0.00001$ , $0.0001$ , $0.001$ , $0.01$ , $0.1$
Dropout rate	$0.25$ , $0.5$ , $0.6$ , $0.7$ , $0.8$
Kernel size	$3$ , $5$ , $7$
Number of filters	$8$ , $16$ , $32$ , $64$ , $128$ , $256$
Number of units in $L S T M$ layers	$8$ , $16$ , $32$ , $50$ , $64$ , $100$ , $128$ , $256$
Number of units in the $D e n s e$ layers (except for the last $D e n s e$ layer which was fixed to $3$ )	$32$ , $64$ , $128$ , $256$
Batch size	$64$ , $128$ , $256$
Number of $C o n v$ layers	$1$ , $2$ , $3$
Number of $L S T M$ / $G R U$ / $B i L S T M$ layers	$1$ , $2$ , $3$

4 Experimental Result and Discussion

All the experiments that were conducted within this study were carried out on the Colaboratory (a.k.a. Google Colab) [48], which is a platform maintained by Google that provides free powerful GPUs alongside permanent (through the provided integration with the Google Drive service) and temporary storages. Google Colab comes with many useful Python data science libraries including but not limited to TensorFlow, Keras, scikit-learn, NumPy, and Pandas on a Linux server operating system. NLTK is one of these built-in libraries of Google Colab and it was utilized during the data preprocessing phase as is described in detail in the previous section. In addition to these advantages, it is possible to execute the Linux Command-Line Interface (CLI) commands on the provided machine which lets us download the necessary resources (e.g., the pre-trained word vectors for the utilized word embedding methods, the utilized dataset, the NLTK resources for English, etc.).

The motivations behind the conducted experiments were twofold: First, we shed light on the performance differences between the types of the most commonly used deep neural networks. Second, we shed light on the performance differences between various widely-used word embedding methods. To this end, a total of twenty benchmark models were constructed. The proposed models were both trained and evaluated on the utilized $T w i t t e r$ dataset ( $D$ ), which is described in the previous section. The confusion matrix is the de facto standard of measuring the efficiency of classifiers and defines the $a c c u r a c y$ , $p r e c i s i o n$ , $r e c a l l$ , and $F 1 - s c o r e$ evaluation metrics. Since the detection of the toxicity level is a classification problem, the confusion matrix was employed to measure the efficiency of the proposed model. While $` P o s i t i v e^{'} (P)$ corresponds to the target class of classifiers, $` N e g a t i v e^{'} (N)$ corresponds to the opposite class of the $P$ . $P r e c i s i o n$ is the ratio of correctly classified $p o s i t i v e s$ ( $T P$ ) to all $p o s i t i v e$ predictions ( $P$ ). $R e c a l l$ (a.k.a. $T r u e P o s i t i v e R a t e$ ) is the ratio of correctly classified $p o s i t i v e s$ ( $T P$ ) to all (actual) $p o s i t i v e$ samples. $A c c u r a c y$ is the ratio of correct predictions to all predictions. $P r e c i s i o n$ and $r e c a l l$ are often in tension since the improvement of $p r e c i s i o n$ typically reduces the $r e c a l l$ . $F 1 - s c o r e$ is the harmonic mean of $p r e c i s i o n$ and $r e c a l l$ and takes both $F a l s e P o s i t i v e (F P)$ and $F a l s e N e g a t i v e (F N)$ scores into account. The higher the $F 1 - s c o r e$ , the better is the result. $` T r u e^{'} (T)$ , and $` F a l s e^{'} (F)$ correspond to correctly classified samples, and incorrectly classified samples, respectively. The formulas of these evaluation metrics are given in the following equations:

$Accuracy = (TP + TN) / (T + F)$	(2)
$Precision = TP / (TP + FP)$	(3)
$Recall = TP / (TP + FN)$	(4)
$F1 - score = (2 * Precision * Recall) / (Precision + Recall)$	(5)

Each experiment within this study was repeated $10$ times and the final scores are the averages of the scores obtained in these $10$ trials. According to the experimental result of the evaluation of the constructed benchmark models on the $T w i t t e r$ dataset ( $D$ ) as it is listed in Table 11, the benchmark model $M_{4}$ , which was based on the proposed $C N N$ model without utilizing pre-trained word vectors, achieved a weighted $F 1 - s c o r e$ of $84.844 %$ , which was the highest weighted $F 1 - s c o r e$ achieved amongst the constructed twenty benchmark models and an accuracy of $84.861 %$ . The most mismatched predictions were observed as follows: $24.42 %$ of the tweets labeled as “ $s e x i s m$ ” were predicted as “ $n o n e$ ”, and $21.84 %$ of the tweets labeled as “ $r a c i s m$ ” were predicted as “ $n o n e$ ”. The most accurately predicted class was found as $` ` n o n e^{''}$ as the precision of this class was calculated as high as $89.05 %$ . The confusion matrix of this evaluation is presented in Figure 3. Other key findings obtained through the conducted experiments are that the models, that constructed word embeddings through the $E m b e d d i n g$ layers, obtained higher $F 1 - s c o r e$ s and converged much faster than the models that utilized pre-trained word vectors.

Table 11 The performance comparison of the constructed benchmark models on the $T w i t t e r$ dataset ( $D$ ) (best in bold)

	Type of Deep	Pre-trained	Number	Weighted
Model	Neural Network	Word Vectors	of Epochs	F1-score (%)
$M_{1}$	$C N N$	$E_{1}$	$84$	$71.742$
$M_{2}$	$C N N$	$E_{2}$	$76$	$72.755$
$M_{3}$	$C N N$	$E_{3}$	$49$	$66.717$
$M_{4}$	$C N N$	$-$	$50$	$84.844$
$M_{5}$	$L S T M$	$E_{1}$	$153$	$71.109$
$M_{6}$	$L S T M$	$E_{2}$	$285$	$68.289$
$M_{7}$	$L S T M$	$E_{3}$	$89$	$68.883$
$M_{8}$	$L S T M$	$-$	$17$	$83.121$
$M_{9}$	$G R U$	$E_{1}$	$104$	$74.140$
$M_{10}$	$G R U$	$E_{2}$	$108$	$75.413$
$M_{11}$	$G R U$	$E_{3}$	$145$	$75.230$
$M_{12}$	$G R U$	$-$	$18$	$81.295$
$M_{13}$	$B i L S T M$	$E_{1}$	$68$	$73.204$
$M_{14}$	$B i L S T M$	$E_{2}$	$389$	$71.347$
$M_{15}$	$B i L S T M$	$E_{3}$	$162$	$74.035$
$M_{16}$	$B i L S T M$	$-$	$14$	$82.621$
$M_{17}$	$C N N + B i L S T M$	$E_{1}$	$31$	$59.271$
$M_{18}$	$C N N + B i L S T M$	$E_{2}$	$11$	$55.312$
$M_{19}$	$C N N + B i L S T M$	$E_{3}$	$33$	$58.693$
$M_{20}$	$C N N + B i L S T M$	$-$	$24$	$83.984$

Figure 3 The confusion matrix of the evaluation of the best benchmark model, namely, $M_{4}$ .

As some of them are described in Section 2, there exist various toxic detection studies that utilize different datasets based on different platforms. To benchmark the performance of the proposed model, the related works, that utilized the same dataset as the proposed model, were included in the comparison. As the comparison is listed in Table 12, the proposed model outperformed the related state-of-the-art studies. The plot of the comparison of the related work in terms of obtained classification score is presented in Figure 4.

5 Conclusion

Table 12 The comparison of the proposed work with the related work that utilized the same dataset ( $D$ ) as the proposed model

Related Work	Target Classes	Evaluation Metric	Metric Score
Badjatiya et al. [17]	$r a c i s m$ , $s e x i s m$ , and $n o n e$	Macro averaged $F 1 - s c o r e$	0.7310²
Park and Fung [21]	$r a c i s m$ , $s e x i s m$ , and $n o n e$	$F 1 - s c o r e$	$0.8270$
	A combination of $a b u s i v e - n o n e$ , and $r a c i s m - s e x i s m$		$0.8240$
Zhang and Luo [49]	$r a c i s m$ , $s e x i s m$ , and $n o n e$	$F 1 - s c o r e$	$0.8300$
Waseem and Hovy [18]	$r a c i s m$ , $s e x i s m$ , and $n o n e$	$F 1 - s c o r e$	$0.7393$
Agrawal and Awekar [6]	$r a c i s m$ , $s e x i s m$ , and $n o n e$	Macro averaged $F 1 - s c o r e$	$0.7960$ $^{2}$
Proposed model	$r a c i s m, s e x i s m$ , $a n d n o n e$	Weighted $F1 - s c o r e$	0.8484

Figure 4 The plot of the comparison of the related work in terms of obtained classification score.

In this study, a novel toxic detection approach based on DNNs was proposed and its efficiency was proven thanks to the conducted experiments. To reveal the most efficient deep neural network model, five widely-used DNNs architectures, namely, ( $i$ ) $C N N$ , ( $i i$ ) $L S T M$ , ( $i i i$ ) $G R U$ , ( $i v$ ) $B i L S T M$ , and ( $v$ ) a combination of $C N N$ and $B i L S T M$ , were investigated. Word embeddings play a critical role in the performance of machine learning models. Therefore, in addition to these DNNs, three widely-used word embedding methods, namely, ( $i$ ) $W o r d 2 V e c$ , ( $i i$ ) $f a s t T e x t$ , and ( $i i i$ ) GloVe, were integrated into the proposed DNNs. As a result of this experimental setup, a total of twenty benchmark models were constructed. These models were evaluated on a gold standard dataset to reveal their efficiency in terms of the detection of toxic posts on social media. According to the experimental result, the proposed $C N N$ model without employing pre-trained word vectors obtained an $F 1 - s c o r e$ of $84.844 %$ , which outperformed the related state-of-the-art studies. Other key findings obtained through the conducted experiments are that the models, that constructed word embeddings through the $E m b e d d i n g$ layers, obtained higher $F 1 - s c o r e$ s and converged much faster than the models that utilized pre-trained word vectors. As future work, we would like to cover fine-grained toxicity categories. Also, the proposed novel toxic post detection approach can be applied to posts in other languages. Finally, the networks of the senders of the posts in social media can be investigated to reveal its effect on toxicity detection.

Notes

¹Formspring was later rebranded as Spring.me.

²According to the paper of Arango et al. [19] which takes the methodological issue of this study into account as is briefly discussed in Section 2. Please refer to their paper for more detail regarding this issue.

Acknowledgment

The author would like to thank Waseem and Hovy for sharing their valuable dataset, and Google for providing free resources through the Colab platform.

Funding Information

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

References

[1] S. Kemp, “Digital 2020,” We Are Social & Hootsuite, 2020. https://wearesocial.com/digital-2020 (accessed Jan. 29, 2021).

[2] E. Whittaker and R. M. Kowalski, “Cyberbullying Via Social Media,” J. Sch. Violence, vol. 14, no. 1, pp. 11–29, 2015, doi: 10.1080/15388220.2014.949377.

[3] S. Hinduja and J. W. Patchin, “Bullying, Cyberbullying, and Suicide,” Arch. Suicide Res., vol. 14, no. 3, pp. 206–221, 2010, doi: 10.1080/ 13811118.2010.494133.

[4] K. Khieu and N. Narwal, “CS224N: Detecting and Classifying Toxic Comments.”

[5] J. Risch and R. Krestel, “Toxic Comment Detection in Online Discussions,” in Deep Learning-Based Approaches for Sentiment Analysis, Springer, 2020, pp. 1–27.

[6] S. Agrawal and A. Awekar, “Deep Learning for Detecting Cyberbullying Across Multiple Social Media Platforms,” in Proceedings of the Advances in Information Retrieval – 40th European Conference on IR Research (ECIR 2018), 2018, pp. 141–153, doi: 10.1007/978-3-319-76941-7_11.

[7] N. Nandakumar, B. Salehi, and T. Baldwin, “A Comparative Study of Embedding Models in Predicting the Compositionality of Multiword Expressions,” in Proceedings of the Australasian Language Technology Association Workshop 2018 (ALTA 2018), 2018, pp. 71–76.

[8] S. Liu, N. Yang, M. Li, and M. Zhou, “A Recursive Recurrent Neural Network for Statistical Machine Translation,” in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL 2014), 2014, pp. 1491–1500, doi: 10.3115/v1/p14-1140.

[9] D. Tang, F. Wei, N. Yang, M. Zhou, T. Liu, and B. Qin, “Learning Sentiment-Specific Word Embedding for Twitter Sentiment Classification,” in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL 2014), 2014, pp. 1555–1565, doi: 10.3115/v1/p14-1146.

[10] H. Li, X. Li, D. Caragea, and C. Caragea, “Comparison of Word Embeddings and Sentence Encodings as Generalized Representations for Crisis Tweet Classification Tasks,” in Proceedings of the ISCRAM Asian Pacific 2018 Conference, 2018, pp. 1–13.

[11] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean, “Distributed Representations of Words and Phrases and their Compositionality,” in Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013, 2013, pp. 3111–3119.

[12] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient Estimation ofWord Representations in Vector Space,” in Proceedings of the International Conference on Learning Representations (ICLR 2013), 2013, pp. 1–12.

[13] T. Mikolov, E. Grave, P. Bojanowski, C. Puhrsch, and A. Joulin, “Advances in Pre-Training Distributed Word Representations,” in Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC 2018), 2018, pp. 52–55.

[14] J. Pennington, R. Socher, and C. D. Manning, “GloVe: Global Vectors for Word Representation,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP 2014), 2014, pp. 1532–1543.

[15] T. Davidson, D. Warmsley, M. Macy, and I. Weber, “Automated Hate Speech Detection and the Problem of Offensive Language,” in Proceedings of the 11th International Conference on Web and Social Media (ICWSM 2017), 2017, pp. 1–4.

[16] Z. Zhang, D. Robinson, and J. Tepper, “Detecting Hate Speech on Twitter Using a Convolution-GRU Based Deep Neural Network,” in Proceedings of the 15th International Conference (ESWC 2018), 2018, pp. 1–15, doi: 10.1007/978-3-319-93417-4_48.

[17] P. Badjatiya, S. Gupta, M. Gupta, and V. Varma, “Deep Learning for Hate Speech Detection in Tweets,” in Proceedings of the 26th International World Wide Web Conference 2017 (WWW ’17 Companion), 2017, pp. 759–760, doi: 10.1145/3041021.3054223.

[18] Z. Waseem and D. Hovy, “Hateful Symbols or Hateful People? Predictive Features for Hate Speech Detection on Twitter,” in Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2016), 2016, pp. 88–93, doi: 10.18653/v1/n16-2013.

[19] A. Arango, J. Pérez, and B. Poblete, “Hate Speech Detection is Not as Easy as You May Think: A Closer Look at Model Validation,” in Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’19), 2019, pp. 45–54, doi: 10.1016/j.is.2020.101584.

[20] M. Zampieri, S. Malmasi, P. Nakov, S. Rosenthal, N. Farra, and R. Kumar, “Predicting the Type and Target of Offensive Posts in Social Media,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019, pp. 1415–1420, doi: 10.18653/v1/n19-1144.

[21] J. H. Park and P. Fung, “One-step and Two-step Classification for Abusive Language Detection on Twitter,” in Proceedings of the 1st Workshop on Abusive Language Online to be held at the annual meeting of the Association of Computational Linguistics (ACL) 2017 (ALW1), 2017, pp. 41–45, doi: 10.18653/v1/w17-3006.

[22] C. Van Hee et al., “Automatic detection of cyberbullying in social media text,” PLoS One, pp. 1–22, 2018, doi: 10.17605/OSF.IO/RGQW8.

[23] N. Chakrabarty, “A Machine Learning Approach to Comment Toxicity Classification,” in Proceedings of the 1st International Conference on Computational Intelligence in Pattern Recognition (CIPR 2019), 2019, pp. 1–10.

[24] K. Reynolds, A. Kontostathis, and L. Edwards, “Using machine learning to detect cyberbullying,” in Proceedings of the 2011 10th International Conference on Machine Learning and Applications and Workshops (ICMLA 2011), 2011, pp. 241–244, doi: 10.1109/ICMLA.2011.152.

[25] R. Rehurek and P. Sojka, “Software Framework for Topic Modelling with Large Corpora,” in Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, 2010, pp. 45–50.

[26] Z. Bairong, W. Wenbo, L. Zhiyu, Z. Chonghui, and T. Shinozaki, “Comparative Analysis of Word Embedding Methods for DSTC6 End-to-End Conversation Modeling Track,” 2017.

[27] S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,” Neural Comput., vol. 9, no. 8, pp. 1735–1780, 1997, doi: 10.1162/neco.1997.9. 8.1735.

[28] F. Chollet, “Keras: the Python deep learning API,” 2021. https://keras.io (accessed Jan. 29, 2021).

[29] M. Abadi et al., “TensorFlow: A System for Large-Scale Machine Learning,” in Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2016), 2016, pp. 265–283.

[30] F. Chollet, Deep Learning with Python. Manning Publications, 2017.

[31] T. Zhang, W. Zheng, Z. Cui, Y. Zong, J. Yan, and K. Yan, “A Deep Neural Network-Driven Feature Learning Method for Multi-view Facial Expression Recognition,” IEEE Trans. Multimed., vol. 18, no. 12, pp. 2528–2536, 2016, doi: 10.1109/TMM.2016.2598092.

[32] A. Mollahosseini, D. Chan, and M. H. Mahoor, “Going deeper in facial expression recognition using deep neural networks,” in 2016 IEEE Winter Conference on Applications of Computer Vision (WACV 2016), 2016, pp. 1–10, doi: 10.1109/WACV.2016.7477450.

[33] Y. Liu, J. A. Starzyk, and Z. Zhu, “Optimized Approximation Algorithm in Neural Networks Without Overfitting,” IEEE Trans. Neural Networks, vol. 19, no. 6, pp. 983–995, 2008, doi: 10.1109/TNN.2007.915114.

[34] V. Nair and G. E. Hinton, “Rectified Linear Units Improve Restricted Boltzmann Machines,” in Proceedings, of the 27th International Conference on Machine Learning (ICML 2010), 2010, pp. 807–814.

[35] J. Xiong, K. Zhang, and H. Zhang, “A Vibrating Mechanism to Prevent Neural Networks from Overfitting,” in Proceedings of the 2019 15th International Wireless Communications and Mobile Computing Conference (IWCMC 2019), 2019, pp. 1737–1742, doi: 10.1109/IWCMC.2019.8766500.

[36] H. Wang, Y. Zhang, and X. Yu, “An Overview of Image Caption Generation Methods,” Comput. Intell. Neurosci., vol. 2020, pp. 1–13, 2020, doi: 10.1155/2020/3062706.

[37] A. Graves and J. Schmidhuber, “Framewise phoneme classification with bidirectional LSTM and other neural network architectures,” Neural Networks, vol. 18, no. 2005 Special Issue, pp. 602–610, 2005, doi: 10.1016/j.neunet.2005.06.042.

[38] W. Wang and B. Chang, “Graph-based Dependency Parsing with Bidirectional LSTM,” in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016), 2016, pp. 2306–2315, doi: 10.18653/v1/p16-1218.

[39] D. P. Kingma and J. L. Ba, “Adam: A Method for Stochastic Optimization,” in Proceedings of the 3rd International Conference on Learning Representations (ICLR 2015), 2015, pp. 1–15.

[40] H. Robbins and S. Monro, “A Stochastic Approximation Method,” Ann. Math. Stat., vol. 22, no. 3, pp. 400–407, 1951, doi: 10.1214/aoms/1177729586.

[41] B. Wu, Z. Liu, Z. Yuan, G. Sun, and C. Wu, “Reducing Overfitting in Deep Convolutional Neural Networks Using Redundancy Regularizer,” in 26th International Conference on Artificial Neural Networks (ICANN 2017), 2017, pp. 49–55, doi: 10.1007/978-3-319-68612-7_6.

[42] X. Ying, “An Overview of Overfitting and its Solutions,” in Proceedings of the International Conference on Computer Information Science and Application Technology (CISAT 2018), 2018, pp. 1–6, doi: 10.1088/1742-6596/1168/2/022022.

[43] R. F. Liao, H. Wen, J. Wu, H. Song, F. Pan, and L. Dong, “The Rayleigh Fading Channel Prediction via Deep Learning,” Wirel. Commun. Mob. Comput., vol. 2018, pp. 1–11, 2018, doi: 10.1155/2018/6497340.

[44] “Tweepy,” 2021. https://www.tweepy.org (accessed Jan. 29, 2021).

[45] J. Camacho-Collados and M. T. Pilehvar, “On the Role of Text Preprocessing in Neural Network Architectures: An Evaluation Study on Text Categorization and Sentiment Analysis,” in Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, 2018, pp. 40–46, doi: 10.18653/v1/w18-5406.

[46] S. Bird, E. Klein, and E. Loper, Natural Language Processing With Python, 1st ed. O’Reilly Media, 2009.

[47] M. Pumperla, “Hyperas by maxpumperla,” 2021. http://maxpumperla.com/hyperas/ (accessed Jan. 29, 2021).

[48] “Colaboratory,” Google, 2021. https://colab.research.google.com (accessed Jan. 29, 2021).

[49] Z. Zhang and L. Luo, “Hate Speech Detection: A Solved Problem? The Challenging Case of Long Tail on Twitter,” Semant. Web, vol. 1, no. 0, pp. 1–21, 2018.

Biography

Abdullah Kabakus received the bachelor’s degree in computer engineering from Cankaya University in 2010, the master’s degree in computer engineering from Gazi University in 2014, and the philosophy of doctorate degree in Electrical-Electronics & Computer Engineering from Duzce University in 2017, respectively. He is currently working as an Associate Professor at the Department of Computer Engineering, Faculty of Engineering, Duzce University. His research areas include mobile security, deep learning, and social network analysis. He has been serving as a reviewer for many highly-respected journals.