Detecting Constructiveness in Online Article Comments

P romoting constructiveness in online comment sections is an essential step to make the internet a more productive place. Instead of giving feedback by simply pointing out mistakes or attempting to hurt, constructiveness can be used through argumentation and respectful discourse techniques in order to capitalize on these past mistakes for future improvements.

Similarly to sentiment analysis or toxicity detection, it is possible to use deep learning techniques to model constructiveness and classify article comments. Current state-of-the-art models use the transformer architecture, which is more efficient than usual recurrent cells (LSTM, GRU…) at processing input sequences in parallel. In this post we will be using Distilbert (Sanh et al., 2019), created by HuggingFace . This model is a distilled version of BERT (Devlin et al., 2018), which essentially means a much lighter model that nearly reaches similar performance. We will use the ktrain Python library (Maiya, 2020), which allows very easy implementation of state-of-the-art models in Tensorflow Keras. For more information about ktrain implementation, this tutorial and the official doc can help you!

The amount of labeled data is rather limited for constructiveness, so we will use the biggest and most recent one, the Constructiveness Comments Corpus (C3), available on Kaggle , and detailed in Kolhatkar et al (2020). The dataset is composed of 12000 news article comments and contains several tags for constructiveness and toxicity, but the only one we will use is constructive_binary .

Data processing

As usual, the first step is to import all the libraries we need for the project. You will need to install ktrain , which can be tricky depending on your setup, and then import it. We also import the basic trio pandas , numpy and matplotlib , as well as some metric and splitting tools from scikit-learn .

Let’s start by reading the dataset file in a pandas DataFrame. We should also take a look at the columns that interest us, comment_text and constructive_binary .

DataFrame head of C3

We will soon need to know what maximum input length we want to allow, so it is useful to get some insight on comments length. Run the code below to display a description of the DataFrame comment column.

The results show a length mean of 71 tokens, and that the 90th and 99th percentiles correspond to 157 and 362 tokens, respectively, so setting the MAXLEN between 150 to 200 tokens seems like like a good idea. Let’s say 150 to save some memory space.

Next, we should define a few global variables, including the max input length , the path to the dataset and the path to where you will save your trained model , the target label names and the HuggingFace model we want to use, in our case bert-base-uncased , which only uses lowercased inputs.

Now let’s take a look at the output class distribution to check if the dataset is balanced or not:

Target class distribution

There are more constructive comments than non-constructive comments, so the dataset is slightly imbalanced. During the validation and testing steps, we want to keep a good representativeness of the output classes, meaning that we want the target class distribution to be the same in all train/val/test sets. For that, we will use stratified splitting implemented with scikit-learn. Let’s start by splitting training (that we call intermediate set) and test sets by setting 20% of the full set aside. The index of the DataFrames can be reset afterwards for more convenience.

It is good practice to set another 10 to 20% of the training data for validation purpose, moreover using a validation set is possible in ktrain, so we again use stratified splitting and set 10% aside. Once again, reset the indexes for each DataFrame.

Perfect! We now have our three datasets ready for use, so let’s split inputs and outputs to feed our machine learning model. You can read the input column (X) into a numpy array, and cast the output (y) as a small int.

Classification

It is now time to initialize the ktrain module by loading a Transformer object which takes the MODEL_NAME , MAXLEN and LABELS defined earlier. State-of-the-art architectures do not necessarily require special preprocessing such as stopword removal, punctuation removal, etc. Instead, they use a special unsupervised tokenization methods that optimizes the amount of out-of-vocabulary (OOV) called WordPiece. Thanks to this, and because our data set is not very noisy, we simply use the ktrain preprocessor. Finally, you can get a classifier object, and then a learner object that takes a batch_size hyperparameter. This hyperparameter is logically tunable to your convenience (just like MAXLEN), but watch out for memory issues when increasing it.

The following code does all of that for you:

Right before training the model, you can choose to easily optimize the learning rate tuning process by using a learning rate finder (Smith, 2018), which basically trains the model for a short time while exponentially increasing the learning rate:

Learning rate finder graph

A good learning rate value to select is located on the first significant decreasing slope, a bit before the first flat area, where the red arrow points at. In this case, a decent value is 0.00001. Let’s now finally train the model. We use a cycle training policy (Smith, 2017) to successively increase and decrease the learning rate, but other policies are available to you if you want to change! 4 epochs is definitely enough for the model to converge, but this is also tunable to you convenience (training such large models takes a lot of time and resources!).

Training verbose

Once training is over, you can validate with the learner object’s validate method, but we’ll skip this part to keep it short.

The predictor object from ktrain allows to make predictions on new data. Running the following code preprocesses and classifies the whole test set. It also prints several usual metrics to help you interpret the results.

Results for constructiveness classification on test set

We reach 0.94 weighted F1 and 0.94 accuracy, which is pretty good, great job! Your Distilbert model is now fit to detect constructive news article comments. You can save the model with the following line of code, but beware, it’s rather heavy (~300MB).

It is now up to you to tune the hyperparameters, use even more advanced machine learning methods or heavier models (BERT, XLNet, etc.) to try and achieve better results!

Data processing

Classification

Recommend

趣店第一季度营收9.579亿元同比下降54.3%

日本一信用卡公司数据显示：新冠疫情令老年人网购骤增

Transform an SVG into a React Component with SVGR

Facial Reconstruction using Autoencoders

Redmi 10X Pro上手体验：不止首发天玑820，还有30倍变焦

ThinkPad X1 Carbon 2020首发评测：硬核升级带来哪些改变？

维珍轨道公司首次火箭试射失败：测试提前终止无人员伤亡

Deep Learning for Computer Vision

vSphere with Kubernetes on VCF 4.0 Consolidated Architecture

撸了个 Azure，老哥们进来帮忙测试下速度

About Joyk