16

Predicting Interest Rate with Classification Models — Part 1

 3 years ago
source link: https://towardsdatascience.com/predicting-interest-rate-with-classification-models-part-1-c7d6f82b739a?gi=d32eab178ad1
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Predicting Interest Rate with Classification Models — Part 1

A practical look to predicting Fed Funds Effective Rate up movements with Logistic Regression

32Q3YjU.jpg!web

Aug 3 ·8min read

iuMjqqa.jpg!web

Photo by Mark Boss on Unsplash

A couple of years ago, I started working for a quant company called M2X Investments , and my first challenge was to create a model that could predict the interest rate movement.

After a couple of days working solely to clean and prepare the data, I took the following approach: build a simple model and then reverse engineer it to make it better (optimizing and selecting features). Then, if the results weren’t so good, I would change the model and make the same process again and so forth.

Therefore, these series of posts objective is to apply different classification models to predict the upward movement of the interest rate, providing a brief intuition of the model (there are a lot of posts that cover the model's mathematics and concepts), and compare their results. By giving more attention to the upward movements, we simplify the problem.

Note: from here on, the data set I will use is fictitious and for educational purposes only.

The data set used in this post is from Quandl , specifically from Commodity Indices , Merrill Lynch , and US Federal Reserve . The idea was to use agriculture, metals, and energy indices , along with corporate yield bond rates, to classify the up movements of the Federal funds' effective rate .

A brief introduction to Logistic Regression

Logistic Regression is a binary classification method. It is a type of Generalized Linear Model that predicts the occurrence’s probability of a binary or categorical variable utilizing a logit function. It relies on a kind of function called sigmoid, that map the input to a value between 0 and 1.

n6N3EvN.png!web
Image by Author

RJZnYn7.png!web

Image by Author

When building the regression model with the sigmoid function, we end up with an equation, as shown above, that will give us the occurrence´s probability ( p ) of the dependent variable.

Image by Author

The model is estimated by using Maximum Likelihood Estimation (MLE) and there are basically three types of Logistic Regression models: Binary, Multinomial, and Ordinal. In this post, we are going to work with the Binary model.

The code

First, we import the libraries we are going to use and include Quandl’s API key to download the variables we need.

import numpy as np
import pandas as pd
import quandl as qdl
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="white")
from imblearn.over_sampling import ADASYN
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import RFE
import statsmodels.api as sm
from sklearn.linear_model import LogisticRegression
from sklearn import metrics# API key from Quandl (free but not necessary)
qdl.ApiConfig.api_key = "JsDf-rbjTsUCP8TzomaW"# get data from Quandl
data = pd.DataFrame()
meta_data = ['RICIA','RICIM','RICIE']
for code in meta_data:
    df=qdl.get('RICI/'+code,start_date="2005-01-03", end_date="2020-07-01")
    df.columns = [code]
    data = pd.concat([data, df], axis=1)meta_data = ['EMHYY','AAAEY','USEY']
for code in meta_data:
    df=qdl.get('ML/'+code,start_date="2005-01-03", end_date="2020-07-01")
    df.columns = [code]
    data = pd.concat([data, df], axis=1)

An essential part of the process is dealing with NaN values. The methods we use to fill or drop them will depend on the problem we have in hands. Unfortunately, it is not the purpose of the post, so I am going to make a basic solution and transform them into the average value of my variables. Sometimes this is a naive solution, but for our purposes, it is just fine.

# dealing with possible empty values (not much attention to this part, but it is very important)
data.fillna(data.mean(), inplace=True)
print(data.head())
print("\nData shape:\n",data.shape)
MFziYjz.png!web
Image by Author

Let’s remember our variables in more detail. RICIA is the Euronext Rogers International Agriculture Commodity Index , RICIM is the Euronext Rogers International Metals Commodity Index , RICIE is the Euronext Rogers International Energy Commodity Index , EMHYY is the Emerging Markets High Yield Corporate Bond Index Yield , AAAEY is the US AAA-rated Bond Index (yield) and, finally, USEY is the US Corporate Bond Index Yield .

Back to the code! Now we are going to look at our data and see if we can find out characteristics that will help us improve our future model.

#histograms
data.hist()
plt.title('Histograms')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()

MbeyMnZ.png!web

Image by Author

The first thing we can notice is that they vary a lot in scale from each other. We can deal with that by Min-Max scaling.

# scaling values to maked them vary between 0 and 1
scaler = MinMaxScaler()
data_scaled = pd.DataFrame(scaler.fit_transform(data.values), columns=data.columns, index=data.index)

I don’t want to get overextended in this matter, so let’s imagine that it was all that we were able to figure it out. Next, we will move to our dependent variable, the RIFSPFF_N_D (more commonly known as Federal funds effective rate ).

# pulling dependent variable from Quandl (par yield curve)
par_yield = qdl.get('FED/RIFSPFF_N_D',start_date="2005-01-03", end_date="2020-07-01")
par_yield.columns = ['FED/RIFSPFF_N_D']# create an empty df with same index as variables and fill it with our independent var values (I think this is unnecessary whith this data set... =))
par_data = pd.DataFrame(index=data_scaled.index, columns=['FED/RIFSPFF_N_D'])
par_data.update(par_yield['FED/RIFSPFF_N_D'])# get the variation and binarize it
par_data=par_data.pct_change()
par_data.fillna(0, inplace=True)
par_data = par_data.apply(lambda x: [0 if y <= 0 else 1 for y in x])
print("Number of 0 and 1s:\n",par_data.value_counts())# plot number of 0 and 1s 
sns.countplot(x='FED/RIFSPFF_N_D', data=par_data, palette='Blues')
plt.title('0s and 1s')
plt.savefig('0s and 1s')

We downloaded our dependent variable, took its % variation, and transformed it into 0s (when ≤0) and 1s (when >0). Here is what we got: 3143 zeros and 909 ones.

Important to note that by binarizing the data that way, we are preoccupied with the up movements only and labeling downward and no movements equal.

32Yjemf.png!web

Image by Author

Well, that’s not a good ratio of 0s and 1s right? To deal with this issue we can use some methods for oversampling data. We are going to use the ADASYN method. The fundamental difference of ADASYN for SMOTE is that the first uses a density distribution while the last utilizes uniform weights for the minority points. Don't worry, now is the moment to have faith and believe that this is a suitable method!

# Over-sampling with ADASYN method
sampler = ADASYN(random_state=13)
X_os, y_os = sampler.fit_sample(data_scaled, par_data.values.ravel())
columns = data_scaled.columns
data_scaled = pd.DataFrame(data=X_os,columns=columns )
par_data= pd.DataFrame(data=y_os,columns=['FED/RIFSPFF_N_D'])print("\nProportion of 0s in oversampled data: ",len(par_data[par_data['FED/RIFSPFF_N_D']==0])/len(data_scaled))
print("\nProportion 1s in oversampled data: ",len(par_data[par_data['FED/RIFSPFF_N_D']==1])/len(data_scaled))
Image by Author

Now that we have our data well balanced, let’s split it into the train and test sets and make a logit regression to analyze de p-values. The purpose of this step is to filter the independent variables.

# split data into test and train set
X_train, X_test, y_train, y_test = train_test_split(data_scaled, par_data, test_size=0.2, random_state=13)# just make it easier to write y
y = y_train['FED/RIFSPFF_N_D']# logit model to analyze p-value and filter remaining variables
logit_model=sm.Logit(y,X_train)
result=logit_model.fit()
print('\nComplete logit regression:\n',result.summary2())

QVzy6ny.png!web

Image by Author

Ok, all variables seem to show a p-value<0.05. So we are going to stick to them and fire up our model!

# ligistic regression model
logreg = LogisticRegression()
logreg.fit(X_train, y)
y_pred = logreg.predict(X_test)
print('\nAccuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(X_test, y_test)))# confusion matrix
confusion_matrix = metrics.confusion_matrix(y_test, y_pred)
print('\nConfusion matrix:\n',confusion_matrix)
print('\nClassification report:\n',metrics.classification_report(y_test, y_pred))# plot confusion matrix
disp = metrics.plot_confusion_matrix(logreg, X_test, y_test,cmap=plt.cm.Blues)
disp.ax_.set_title('Confusion Matrix')
plt.savefig('Confusion Matrix')
Image by Author

YjUNRnM.png!web

Image by Author

So there it is! The attempt to solve the problem using Logistic Regression turned out to give us an accuracy of 66%, predicting 810 labels correctly. We know that accuracy itself is not that informative, so let's look at the classification report and the ROC curve.

# roc curve (beautiful code from Susan Li) 
logit_roc_auc = metrics.roc_auc_score(y_test, logreg.predict(X_test))
fpr, tpr, thresholds = metrics.roc_curve(y_test, logreg.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC curve - Logistic Regression')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
z2IveqY.png!web
Image by Author

The classification report gives us Precision, Recall, and F1-Score. Precision talks about how accurate our model is. It means that out of those predicted positive, how many of them are actually positive. Recall tells us how many of the true positives our model capture through classifying them as positives. The F1-Score takes both, Precision and Recall, into consideration and it is useful if the data is unbalanced. It seems that our metrics are well balanced despite their low values.

bQfiyai.png!web

Image by Author

The objective of analyzing the ROC curve is to see if the model is as far as possible from the red line, which is the result of a pure random classifier. So the closest to the top left corner, the better. In other words, the bigger the area under the curve, the better. We got an area of 0.65; it is noticeable that we still have a long way to go… In the next post (Part 2), we are going to tackle the problem by applying the Naive Bayes method.

This article was written in conjunction with Guilherme Bezerra Pujades Magalhães .

References and great links

[1] J. Starmer, StatQuest with Josh Starmer on Logistic Regression , YouTube.

[2] N. V. Chawla, K. W. Bowyer, L. O. Hall, W. P. Kegelmeyer, SMOTE: Synthetic Minority Over-sampling Technique (2002), Journal Of Artificial Intelligence Research, Volume 16, pages 321–357, 2002.

[3] Haibo He, Yang Bai, E. A. Garcia, and Shutao Li, ADASYN: Adaptive synthetic sampling approach for imbalanced learning (2008) IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), Hong Kong, 2008, pp. 1322–1328.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK