Media Bias Detection using Deep Learning Libraries in Python

yUnYRnR.jpg!web

Photo by Pablo García on Unsplash

Oct 9 ·9min read

More than once I have encountered News Stories that you can tell their political inclination right away. This is because News Outlets are more than likely always biased ¹ ². If we can identify these patterns by eye, I wonder, then surely we can build an algorithm that can use information from the text to identify written Media Bias. In this report I will describe what I did to achieve exactly this. I used a Python environment with Tensorflow and Keras to build a Neural Network capable of identifying with very good performance if News Stories are Left or Right-leaning. I went further and tried to identify not only bias, but also the Outlet (or source) of the Story. I will describe in detail the methods that I used to build and train the Network as well as the ones I used to visualize the results and performance. Here we go!

All data was acquired from All the News dataset created by Andrew Thomson. It is freely available and you can download it anytime. It is separated into three large CSV files, all containing a table that looks like this:

MbmQzmA.png!web

News Outlets raw table

Because we are interested in the content and in the outlet name only, we will focus on two columns. Column 3 contains the publication or outlet name, while 9 contains the content. We then need to extract this information and to store it accordingly so we can proceed with the analysis. But first, let’s import all required modules (adapt your code for latest releases if required, e.g: TensorFlow 2):

Modules for the pipeline

Each of the files described above contains around 50,000 entries, so to make the analysis faster we can extract a portion of this data. For my entire pipeline and to save time, I decided to randomly select ~40% of articles using this simple trick (you can change this of course):

p = 0.4
df = pd.read_csv('articles.csv',header=None,skiprows=lambda i: 1>0 and random.random() > p)

This will take from articles.csv a fraction of the data equal to p.

The next step is probably the most subjective in our entire pipeline. We have assign News Outlets either a Left or a Right inclination. For simplicity, I decided to use only two from each side and used allsides.com and mediabiasfactckeck.com to assign their bias. Based on information extracted from these websites and other sources¹ ², I decided to assign The Atlantic and The New York Times a Left bias and The New York Post and Breitbart a Right bias. I then filtered from the original files all rows containing these outlets with:

and created one big array of stories with:

n_s = list(n_s_b.iloc[:,9].values) + list(n_s_p.iloc[:,9].values) \
 + list(n_s_a.iloc[:,9].values) + list(n_s_n.iloc[:,9].values)

Note that n_s is an array that contains only the content of all stories ordered according to the name in the original array extracted from the above code, so Breitbart and Post stories come first followed by Atlantic and New York Times.

Great! What to do next? An important pre-processing step, and especially because we are dealing with Natural Language Processing, is to delete words that can add noise to the analysis. I decided to delete the name of the outlet, which is usually mentioned within the story as it can add “bias” to our analysis. This is simply done with:

n_s = [word.replace('New York Post','') for word in n_s]
n_s = [word.replace('Breitbart','') for word in n_s]
n_s = [word.replace('New York Times','') for word in n_s]
n_s = [word.replace('Atlantic','') for word in n_s]

Next step is to create a class array. We know how many articles each outlet has and we know their political bias. We can create two arrays, one for an Outlet classifier and one for a Bias classifier with:

If you are following the methods, you can see that classes_All is an array with length equal to n_s that contains integers from 1 to 4, each corresponding to one of the four outlets, while classes_Bias contains a 1 for outlets considered to be inclined to the Right and a 2 for those inclined to the Left (see previous code to understand this further). Like this, n_s is our feature array (which has been cleaned), as it contains a list of stories and these two arrays our class arrays. This means we are almost done with pre-processing.

A crucial final step is to transform the stories (actual News) to something that a Neural Network can understand. To to this, I used the amazing Universal Sentence Encoder from TensorFlow Hub that transforms any given sentence (in our case, a News Story) to an embedding vector of length 512, so in the end we will have an array of size number_of_stories X 512 . This was done with:

NOTE: I used a similar approach before to classify Literary Movements, you can check ithere if you want.

To compute the embedding matrix we simply have to run the function we just defined:

e_All = similarity_matrix(n_s)

Finally! we are done with pre-processing! e_All is our feature array, while classes_All and classes_Bias our class arrays. With this we are ready to build a Classifier with Keras now.

I don’t want to take too much time explaining how to build a Neural Network, there are many hands-on articles published in Towards Data Science and many other sources where you can read and follow very good tutorials to do so³ ⁴ ⁵. In here I will simply present an architecture that yielded good results. This architecture is one of many previous iterations and one that I personally found to work. With that said, let’s dig into it! The classifier I built looks like this:

Jj6zee7.png!web

News Bias classifier. Image rendered with ann-visualizer .

It has an input layer of 512 neurons (one per embedding value) and two hidden layers of 40 neurons, each with a dropout layer with a fixed dropout fraction of 0.25. It also has an output layer using a Softmax activation function with four neurons to classify Media Outlets and two neurons to classify Media Bias (not show here). In terms of code it looks like this:

Media Bias Classifier

For both scenarios (outlets and bias) all parameters are the same except for the optimizer’s learning rate and of course the amount of neurons in the output layer (either four or two). Intuitively it makes scene that simpler solutions (binary in the case of bias) should be optimized faster. For this reason I decided to use a learning rate of 0.00015 for Outlet classification and 0.0005 for Bias classification.

Finally and before checking results, we need to split data into training and test set. For this we will used a Stratified Shuffle Splitter , so this looks like this:

With this done, we only have to train the network. We can do this with:

And now we are ready to check the results!

Let’s look at Outlet classification first. After training (about 15 min on a 8-core CPU machine, no GPU needed), the validation and training accuracy curves look like this (code for visualizing is provided later in the text):

z6niu2z.png!web

What does it mean? Well, first of, Accuracy of the validation set is almost 0.80, which can be considered good. Remember, this is the accuracy from four Media Outlets! Meaning that this Neural Network can identify with relatively good performance the source of the News being reported based only in written semantic content , very cool! Let’s now look at the Loss to validate if our model is properly fitted:

7RFnYrq.png!web

Since both curves “look” the same (the difference between them is minimal), we can say that our Neural Network has learned rules that are general enough to avoid overfitting, but specific enough to yield good performance on a validation set, a quality of Deep Learning that is always desirable⁶. With this in mind, we can now look at the confusion matrix to go deeper into the results:

It is clear that News stories from Breitbart and The New Work Post were classified with higher accuracy, but this could be due to the fact that there is simply a higher representation of these two classes. We could try resampling methods to balance the classes, but for now I would leave it this way and get back to resampling for our binary classification task described in the next paragraphs.

Moving forward, we can explore the results for Media Bias classification. Remember, this is a binary classifier (either inclined to the Left or to the Right). First, accuracy and loss curves:

fiq6jyJ.png!web

Very impressive if you ask me! We got an Accuracy of almost 0.9, which means that effectively, we are identifying with a very good performance if a News Story is biased towards the Left or to the Right. What about the confusion matrix?

Looks good, but it is clear that Left-leaning reports are underrepresented and hence the accuracy is not fully reflecting what is going on here. Accuracy alone can be misleading. Another metric that can be helpful for uneven class distributions is the F1 score⁷. We can very easily compute it with sklearn’s built-in function, so simply doing this does the trick:

from sklearn.metrics import f1_score
f1_score(y_test,news_DNN.predict_classes(X_test))

Which throws:

>>> 0.78

Good, but not perfect. We can try balancing the classes in the training set only⁸ with an oversampler to get better results. I will use SMOTE from the amazing Python library imblearn , so simply add this line just before fitting:

After training with balanced data, the confusion matrix now looks like this:

More balanced. What about F1 and Accuracy? 0.80 and 0.88 respectively. Very good!

NOTE: All visualization was done using seaborn . Here’s the code I wrote to create all the figures:

Code to visualize results

QV3Mvuq.jpg!web

Photo by AbsolutVision on Unsplash

In this text, I walked you through the steps I took to build, train and visualize a News Outlet/Bias classifier, which yielded very good results. We were able to get fair accuracy scores for Outlet classification, which suggests that the nature of text alone might contain information about where a given Story was written. It is difficult to draw precise conclusions with a sample this size, but the idea is there, we can use Keras, TensorFlow and visualization libraries in Python to classify News Stories. What was more remarkable is the fact that Bias can be identified with very good performance, meaning that to some extent, semantic rules in written Media implicitly contain a political Bias. In a way we knew this from the beginning, but it is nonetheless interesting to note that this information is there and that with the proper tools can be automatically identified. What other problems you think this approach could solve?

Now that you have the tools you can copy the code and use it in your own project.

Thanks for reading!

References:

[1] Media Bias Chart, https://www.adfontesmedia.com/

[2] Budak, C., Goel, S., & Rao, J. M. Fair and balanced? quantifying media bias through crowdsourced content analysis. (2016) Public Opinion Quarterly , 80 (S1), 250–271.

[3]Introduction to Deep learning with Keras, https://towardsdatascience.com/introduction-to-deep-learning-with-keras-17c09e4f0eb2

[4] Practical machine Learning with Keras, https://towardsdatascience.com/practical-machine-learning-with-keras-19d0e5b2558

[5] Building A Deep Learning Model using Keras, https://towardsdatascience.com/building-a-deep-learning-model-using-keras-1548ca149d37

[6] https://machinelearningmastery.com/how-to-choose-loss-functions-when-training-deep-learning-neural-networks/

[7] Accuracy, Precision, Recall or F1? https://towardsdatascience.com/accuracy-precision-recall-or-f1-331fb37c5cb9

[8] https://beckernick.github.io/oversampling-modeling/