Toxic Comments Classification

11 min readMar 22, 2021

Text data is available everywhere and due to its unstructured nature, it becomes difficult and time consuming to extract information from it. Social media platform is one such source of text data, where people are constantly posting their opinion on different topics. However, growing use of social media platforms has also increased the cases of threats of abuse and harassment online. This has been seen as a major problem in social media online platforms, forcing the platforms to either limit or completely shut down the user comments. Therefore, there has been a constant effort to identify the toxic levels in the comments made on such platforms to facilitate more productive and respectful discussions.

In this post, the focus is to solving the Toxic comment classification challenge posted on Kaggle. The challenge focuses on solving a multi-label classification problem for classifying the comments to one of the six toxic labels. Multi-label learning differs from multi-class learning in the sense that in multi-label an example or an instance can be associated with multiple class labels simultaneously, while in multi-class problem, an example can belong to only a single class. As a multi-label classification problem, here a comment can be classified to have no label or one or more than one labels. The six toxic labels presented in the data are “Toxic”, ”Sever Toxic”, ”Obscene”, ”Threat”, ”Insult”, and ”Identity”.

The full code used in this post can be accessed at Github

Data

As stated above, the data for the problem is extracted from Kaggle Competition. As an input, it contains raw text of comments posted on social media and six of the labels as the output represented as binary labels(1,0), indicating either presence of the label for the comment or no presence.

# Read datadf = pd.read_csv("data/train.csv")
df.head()

Length of the Comments

We can first visualize the length of the comments. We will see a rough distribution of the comments lengths by counting the number of words used in the comments.

The distribution of the comment lengths from the plot show that majority of the comments have roughly text length up to 200 words, while few of the comments seem to have very long lengths.

Distribution of labels

First, we can see the counts of comments for each of the toxic labels. This will give us the understanding on what are the frequent type of toxic labels being assigned to the comments.

More comments are seen to be labeled as ‘toxic’, ‘obscene’ and ‘insult’, while other labels are comparatively very few in numbers. Also, we see that from a train sample of 111699, the number of comments seen in the labels are still low in comparison to the total number of the comments in the train sample. By, further looking at the the data, we can find that almost 90% of the comments have no any toxic labels, meaning they are clean comments. This make the problem also an imbalance classification problem.

Number of comments with no any labels(clean tweet): 143346 Percentage of comments with no any labels: 89.923%

Worldcloud

To quickly visualize the frequent words used in the text, world cloud can be used for the purpose, that gives an understanding of what kind of words are being used.

The frequent words seen from the word cloud currently does not make any meaningful distinction. But, it can clearly be seen that words are mostly related to online discussions.

To check how often the labels appear together in a comment, we can look at the correlation for the labels.

The correlation plot show that there is good relation between the labels toxic and insult, toxic and obscene, and obscene and insult. So, in the presence of one of the labels in the pairs for a comment, it is likely that the comment will also have the other label.

Modeling

Data Partition

For modeling, we will partition the data into train and test splits. After the partition, we will train the model on the train data evaluate on the test split.

Train: (111699, 8) Test: (47872, 8)

Before we go into modeling, we will define the evaluation metrics for evaluating the model performance. As we saw that this is the multi-label classification problem along with imbalance data, we will have to be careful in selecting the correct evaluation metrics.

Evaluation Metrics

As per the Kaggle Competition, the evaluation metric that will be used for the evaluation is mean column-wise ROC AUC. It represents the average of the individual AUC scores of each predicted column(a label is represented as a column). So, we will use it as our main evaluation metric. Using the same approach, here, we will also calculate the accuracy and log loss.

ROC AUC: AUC stands for area under the ROC curve, where ROC is the Receiver Operator Characteristics, that is the graphical representation of performance of a binary classification model. It is created by plotting False Positive Rate against the True Positive Rate at different threshold values. AUC is simply a numerical score that summarizes the ROC curve, to evaluate the performance of the classifier. Its value ranges between 0 and 1, where 1 is the perfect score and score of 0.5 indicates the model being no good than a random guessing. As stated above, we will calculate AUC score individually for each labels and average them.

Accuracy: Accuracy simply represents the number of correct classifications made out of the total samples. It can be calculated as number of correct predictions divided by the total data samples. But, for multi-label classification, the accuracy can be calculated as subset accuracy , also referred as Exact Match Ratio. This is considered to be a harsh metric, where the predicted set of labels for an example should exactly match the actual labels. However, here we will simply follow the Kaggle approach and calculate individual accuracy scores for each of the labels and take the average of it as a final score.

Log loss: Log loss, also known as cross entropy , is the classification evaluation metric based on probabilities. It evaluates the classification performance by comparing the actual labels with its predicted probabilities, where it penalizes the predictions that is far from the actual labels. Its value increases when the predicted probabilities are far from actual labels, and lower value indicates good performance of the classifier. Mathematically, for a binary classification it can be calculated as:

logloss = −(ylog(p)+(1−y)log(1−p))

Bag of Words Representation

As an initial approach, we will use the bag of words representation to present the text for the modeling using TFIDF vectorization. Bag of words is a commonly used method for text or document classification, where it extracts features(unique words occurring in all the text documents) from text documents. Each document is then represented with the features, where it states the presence of the word (or wordcount) in the document. With TFIDF vectorization, instead of the simple word count, it represents a score that explains how relevant the word is to a document in all documents.

Before converting the texts to bag of words, we can clean the text, where we will remove the punctuations, digits, stop-words and perform stemming.

An example of before and after applying the text processing is given below.

Original Text: 
"   Happy Birthday!   Hey,  .  Just stopping by to wish you a Happy Birthday from the Wikipedia Birthday Committee!   Have a great day!    "   
Clean Text: 
happi birthday hey stop wish happi birthday wikipedia birthday committe great day

Now, applying TFIDF vectorizer we can create the bag of words. Here, we will use TFIDF vectorizer with n-grams of 1 and 2, i.e. we will create word features including a single word and two words. Also we will restrict the maximum features(vocabulary size) to 1000 .

With bag of words being created we can now move to modeling. For the modeling, we will apply Binary Relevance and Classifier Chain with Logistic Regression and Naive Bayes models.

Binary Relevance

Binary Relevance is one of the simplest approach to solving multi-label classification problem. Similar to one-vs-rest approach for multi-class problems, it transforms the multi-label task to a number of independent binary classification tasks(one for each label), where predicting each label is considered to be a separate task. It suffers from the weakness, that it ignores the correlation between the labels. An example of how the multi-label problem is transformed to Binary Relevance is seen in figure below.

The results of the Binary Relevance with Logistic Regression and Naive Bayes show Logistic Regression to be performing better.

Model               Accuracy    AUC    Logloss
Logistic Regression  0.979     0.960   0.064
MultinomialNB        0.976     0.955   0.071

Classifier Chain

Similar to Binary Relevance classifier chain is a problem transformation method for multi-label classification. It tries to improve performance over Binary Relevance by taking advantages of labels associations, which is ignored in Binary Relevance. Classifier chain method generates a chain of binary classifiers. In doing so, it adds an additional input feature in the classifier as an output from the previous classifier in the chain. This approach allows the classifiers to take into account the label association present, helping in improving the classification performance. The multi-label problem transformation with classifier chain is shown below.

With classifier chain, the performance of the model did not improve, but we see Logistic regression still performing better than Naive Bayes. The performance with chain classifier may not have improved because earlier we saw that there was not any significant correlation between the labels and chain classifier performs better with associations between the labels.

Model                 Accuracy  AUC     Log loss
Logistic Regression   0.979     0.950    0.069
MultinomialNB         0.968     0.946    0.124

Word Embeddings

Word embeddings are type of word representation method that represents each word by real-valued vectors with a predefined vector length. These vectors are learned mostly through deep learning methods. With word embeddings, words with similar meaning are represented with similar vector representation that allows for capturing the word meaning. We can train our own word embeddings or use a pretrained word embeddings, that has already been trained on a large corpus of text. Here, we will use a pretrained word2vec model from google to extract the word embeddings.

To represent a sentence with word embeddings(sentence embedding), we will take the average of word embeddings of words in the sentence.

As above, we saw that Binary Relevance with Logistic Regression is performing the best. Here, we will again perform the modeling with Binary Relevance and Logistic regression, but now we will use the word embeddings instead of the bag of words.

The results with word embeddings is still not giving any improvements, the results are very similar to the bag of words method. We can try performing hyperparameter tuning for Logistic Regression to see if we can achieve any further improvement.

Model                 Accuracy   AUC    Log loss
Logistic Regression    0.975    0.965    0.071

Deep Learning

Modeling with LSTM

LSTM(long short-term memory) is a type of recurrent neural networks(RNN). It is mostly used in problems related to sequence predictions as it can handle the order dependence in the sequence. LSTM is being widely used for text classification problem, as it helps in maintaining the order of words in the text that can preserve the semantic meaning of the text.

The first step for data preparation for LSTM is to tokenize the words and represent them as integers, where each unique word is represented by an integer value. We also need to specify the vocabulary size that determines the number of most frequent words to use in the modeling.

After tokenizing the text, we need to perform padding that ensures all the text sequence are of same length. With a given sequence length, padding assigns 0s at beginning of sentence or after the end of sentence to match the required sequence length to examples that have sequence length less than the specified length and will truncate the examples that have higher sequence length than the specified length.

Now we are ready to define our model and train it.

The results from the LSTM gave improvements over Logistic Regression model for the metrics Accuracy and logloss, but the AUC score is very similar. So, with improvement for two of the metrics, we can still consider LSTM to be performing better than the Logistic Regression Model.

loss: 0.0527  accuracy: 0.9816  auc: 0.9634

Transfer Learning with BERT

Transfer learning is a popular approach in deep learning, where a pretrained model is used as the starting point for training a new model in similar task. It is getting popular in the filed of computer vision and natural language processing as it saves huge computational and time resources required for modeling on such tasks. In addition, with knowledge gained from pretrained models, transfer learning helps in generalization and performance improvement.

With advancements in researches in Natural language processing, there has been many sophisticated techniques to address the problem. BERT(Bidirectional Encoder Representations from Transformers) is one of them, which has been producing state of the art results in problems related to natural language processing. It is a Transfomers based machine learning technique developed by Google and is available as a pretrained model. A good introduction to BERT is available by Jay Alammar here.

Tensorflow Hub provides an extensive list of such pretrained models, including BERT and also easy to use code example to fine-tune the models. Therefore, we will use Tensorflow Hub to fine-tune BERT model for our classification problem.

For BERT, since it is a pretrained model, we will have to process our data according to the process used in the pretrained model. For that, tensorflow hub provides the necessary processing helper and we will simply make use of it.

We have now our data ready and also we have specified the paths to BERT model and its text processing handler. We can now create our model and train it.

# evaluate modelloss, accuracy, auc = bert_history.model.evaluate(test_ds)Accuracy: 0.9836  AUC: 0.9798  Logloss: 0.042

As we see, with transfer learning using BERT, we achieved a higher performance compared to the previous models.

Conclusion

We have covered different modeling approaches to solve the multi-label classification problem. By far, we saw the BERT model performing the best with the mean AUC ROC score of 0.978 on the test data. Now for the final evaluation, we can create submissions from the models we have trained so far and submit to the kaggle page to get the final results and see if the BERT model is really performing the best.

After submitting the results to Kaggle we get the following scores for the models and so, we can confirm that the BERT model is actually performing good. However, we can also see that our baseline model, which was a simplistic method of using Binary Relevance with Logistic Regression is also giving good results.

# Kaggle Scores
Logistic Regression : 0.95875
LSTM Model          : 0.97128
BERT                : 0.98172

References

https://link.springer.com/article/10.1007/s11704-017-7031-7#:~:text=Binary%20relevance%20is%20arguably%20the,(one%20per%20class%20label).

https://machinelearningmastery.com/what-are-word-embeddings/

https://www.tensorflow.org/tutorials/text/classify_text_with_bert