29

Deep Learning on Dataframes with PyTorch

 4 years ago
source link: https://www.tuicool.com/articles/MfMVFzq
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Deep Learning on Dataframes with PyTorch

The goal of this post is to lay out a framework that could get you up and running with deep learning predictions on any dataframe using PyTorch and Pandas. By any dataframe I mean any combination of: categorical features, continuous features, datetime features, regression, binary classification, or multi-classification.

I may touch upon some of the technical aspects of what is going on behind the scenes, but mostly this is meant to be a framework discussion rather than a technical discussion. If you want to dig in further I suggest fast.ai courses in deep learning — and if you simply want predictions up and running without looking under the hood the fast.ai library is a great place to get these models running quickly and effectively with little dev time.

import pandas as pd
import numpy as np
import re
from pandas.api.types import is_string_dtype, is_numeric_dtype
import warnings
from pdb import set_trace
from torch import nn, optim, as_tensor
from torch.utils.data import Dataset, DataLoader
import torch.nn.functional as F
from torch.nn.init import *
import sklearn
from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import LabelEncoder, Imputer, StandardScaler

We will just use a made up dataframe that has categorical features, continuous features, and one datetime feature.

rng = pd.date_range('2015-02-24', periods=500, freq='D')
df = pd.DataFrame({'date': rng,
'cont1' : np.random.randn(len(rng)),
'cat1': [np.random.choice([
'cat','dog','mouse','cow'])
for _ in range(len(rng))],
'cont2' : 0.5 * np.random.randn(len(rng))+5,
'cat2': [np.random.choice([
'laundry','trash','mop','sweep'])
for _ in range(len(rng))],
'targ' : np.random.randint(low=1, high=10,
size=len(rng))})
Sample dataframe

I’m just going to assume we want to use all of our data to train and then predict the very last day of the dataset and check how we did. For your case, this may be predicting the last week, month, year of data — but here we will just use the last day.

max_date = max(df.date).strftime(format='%Y-%m-%d')
test_human_readable = df.loc[df.date ==
 pd.to_datetime(max_date, 
 format='%Y-%m-%d'),:].copy()

I call the dataframe (which is just one row of data) test_human_readable because we are going to be doing some transformations on our dataset that will make it almost impossible to understand to the human eye, so I like to extract my test set now and then later when I predict I will just append the prediction to this dataframe and I can actually see all of the features as they were from the start + the prediction and actual.

Now we will establish some helped functions for the pre-processing of the data.

def add_datepart(df, fldname, drop=True, time=False, errors="raise"):
"Create many new columns based on datetime column."
fld = df[fldname]
fld_dtype = fld.dtype
if isinstance(fld_dtype, pd.core.dtypes.dtypes.DatetimeTZDtype):
fld_dtype = np.datetime64
if not np.issubdtype(fld_dtype, np.datetime64):
df[fldname] = fld = pd.to_datetime(fld,
infer_datetime_format=True, errors=errors)
targ_pre = re.sub('[Dd]ate$', '', fldname)
attr = ['Year', 'Month', 'Week', 'Day', 'Dayofweek','Dayofyear',
'Is_month_end', 'Is_month_start', 'Is_quarter_end',
'Is_quarter_start', 'Is_year_end', 'Is_year_start']
if time: attr = attr + ['Hour', 'Minute', 'Second']
for n in attr: df[targ_pre + n] = getattr(fld.dt, n.lower())
df[targ_pre + 'Elapsed'] = fld.astype(np.int64) // 10 ** 9
if drop: df.drop(fldname, axis=1, inplace=True)
def train_cats(df, cat_vars):
# numercalize/categoricalize
for name, col in df.items():
if name in cat_vars:
df[name] = col.cat.codes + 1
df = pd.get_dummies(df, dummy_na=True)
return df
def scale_vars(df, mapper):
warnings.filterwarnings('ignore',
category=sklearn.exceptions.DataConversionWarning)
if mapper is None:
map_f = [([n],StandardScaler()) for n in df.columns if
is_numeric_dtype(df[n])]
mapper = DataFrameMapper(map_f).fit(df)
df[mapper.transformed_names_] = mapper.transform(df)
return mapper
def proc_df(df, cat_vars, cont_vars, y_fld=None, do_scale=False,
mapper=None, na_dict=None):
"Preprorocess the train, valid, test sets to numericalize,
fillmissing, and normalize."
ignore_flds=[]
skip_flds=[]
# set the dependent variable name and concatenate the cat and
# cont
dep_var = y_fld
df = df[cat_vars + cont_vars + [dep_var]].copy()
df[dep_var] = df[dep_var].astype(int)
df = df.copy()
ignored_flds = df.loc[:, ignore_flds]
y = df[y_fld].values
# deal with skip fields
skip_flds += [y_fld]
df.drop(skip_flds, axis=1, inplace=True)
# initialize the na dictionary
if na_dict is None: na_dict = {}
else: na_dict = na_dict.copy()
na_dict_initial = na_dict.copy()
# fill missing
for name, col in df.items():
if is_numeric_dtype(col):
if pd.isnull(col).sum():
df[name+'_na'] = pd.isnull(col)
filler = col.median()
df[name] = col.fillna(filler)
na_dict[name] = filler
# keep track of which entries are missing and possibly use them
# in the model
if len(na_dict_initial.keys()) > 0:
df.drop([a + '_na' for a in list(set(na_dict.keys()) -
set(na_dict_initial.keys()))], axis=1, inplace=True)
# normalize
if do_scale: mapper = scale_vars(df, mapper)
res = [df, y, na_dict]
# keep track of how things were normalized
if do_scale: res = res + [mapper]
return res

Great. So now we want to add the new datetime features into our dataframe, normalize the continuous data, and categorizalize the categorical features (change them to have a number representing their class).

add_datepart(df, 'date', drop=False)

add_datepart is an in-place operation, so now our dataframe has many more columns representing different aspects of the column date . I do not drop the date column yet because I want to use it soon to create my train, valid, and test dataframes.

columns of df

Let’s now define which columns are categorical and which are continuous.

cat_vars = ['cat1', 'cat2', 'Year', 'Month','Week', 'Day',
 'Dayofweek', 'Dayofyear', 'Is_month_end',
 'Is_month_start', 'Is_quarter_end', 'Is_quarter_start',
 'Is_year_end', 'Is_year_start', 'Elapsed']cont_vars = ['cont1', 'cont2']

I want to categoricalize all of my cat features, but I want to also make sure that they are classified in the same way for my train, valid, and test dataframes. This means, if cow gets mapped to 2 in my training data — I don’t want cow to be mapped to something else in my valid or test data. So I will do that operation now and then split up my datatset.

for v in cat_vars: df[v] = df[v].astype('category').cat.as_ordered()
df = train_cats(df, cat_vars)

The model will treat the categorical features with embeddings, so we need to pre-calculate our embedding sizes to initialize later in our model. Jeremy Howard of fast.ai suggests to use the minimum of 50 and half the cardinality of the class.

This is, again, an operation that should be done before you break up the dataset into train, valid, test.

for v in cat_vars: df[v] = df[v].astype('category').cat.as_ordered()
cat_sz = [(c, len(df[c].cat.categories)+1) for c in cat_vars]
emb_szs = [(c, min(50, (c+1)//2)) for _,c in cat_sz]

If you break up the dataset first and then check embedding sizes you might get the error: RuntimeError: index out of range: Tried to access index 12 out of table with 11 rows. This is because the calculation of the embedding sizes did not take into account some of the classes if they were left out of the training data by chance.

train = df.loc[df.date < pd.to_datetime('2016-01-01', 
format='%Y-%m-%d'),:].copy()
valid = df.loc[(df.date >= pd.to_datetime('2016-01-01',
format='%Y-%m-%d')) &
(df.date < pd.to_datetime(max_date,
format='%Y-%m-%d')),:].copy()
test = df.loc[df.date == pd.to_datetime(max_date,
format='%Y-%m-%d'),:].copy()
train = train.drop(columns='date')
valid = valid.drop(columns='date')
test = test.drop(columns='date')

So I fairly arbitrarily chose my validation set. This is not good practice and in your production environment you should test out different sets and try to map them closely to the test set, but in this case I just took everything after January 2016.

for v in cat_vars: train[v] =
train[v].astype('category').cat.as_ordered()
for v in cont_vars: train[v] = train[v].astype('float32')
for v in cat_vars: valid[v] =
valid[v].astype('category').cat.as_ordered()
for v in cont_vars: valid[v] = valid[v].astype('float32')
for v in cat_vars: test[v] =
test[v].astype('category').cat.as_ordered()
for v in cont_vars: test[v] = test[v].astype('float32')

We want to pass our model the categorical features separately from the continuous features so the cats can be passed through the embeddings first and then through the linear, relu, batchnorm, dropout along with the conts.

class ColumnarDataset(Dataset):
"""Dataset class for column dataset.
Args:
cats (list of str): List of the name of columns contain
categorical variables.
conts (list of str): List of the name of columns which
contain continuous variables.
y (Tensor, optional): Target variables.
is_reg (bool): If the task is regression, set ``True``,
otherwise (classification) ``False``.
is_multi (bool): If the task is multi-label classification,
set ``True``.
"""
def __init__(self, df, cat_flds, y, is_reg, is_multi):
df_cat = df[cat_flds]
df_cont = df.drop(cat_flds, axis=1)

cats = [c.values for n,c in df_cat.items()]
conts = [c.values for n,c in df_cont.items()]

n = len(cats[0]) if cats else len(conts[0])
self.cats = np.stack(cats, 1).astype(np.int64)
if cats else np.zeros((n,1))
self.conts = np.stack(conts, 1).astype(np.float32)
if conts else np.zeros((n,1))
self.y = np.zeros((n,1)) if y is None else y
if is_reg: self.y = self.y[:,None]
self.is_reg = is_reg
self.is_multi = is_multi
def __len__(self): return len(self.y) def __getitem__(self, idx):
return [self.cats[idx], self.conts[idx], self.y[idx]]

As you can see in this class the __getitem__ is retrieving a list of the cats, conts, and target for that idx value.

Normalize and pre-process each dataset.

dep_var = 'targ'
df, y, nas, mapper = proc_df(train, cat_vars, cont_vars, dep_var,
do_scale=True)
df_val, y_val, nas, mapper = proc_df(valid, cat_vars, cont_vars,
dep_var, do_scale=True,
mapper=mapper, na_dict=nas)
df_test, y_test, nas, mapper = proc_df(test, cat_vars, cont_vars,
dep_var, do_scale=True)

Initialize each dataset object and make dataloader objects.

trn_ds = ColumnarDataset(df, cat_vars, y,is_reg=True,is_multi=False)
val_ds = ColumnarDataset(df_val, cat_vars,
y_val,is_reg=True,is_multi=False)
test_ds = ColumnarDataset(df_test, cat_vars,
y_test,is_reg=True,is_multi=False)
bs = 64
train_dl = DataLoader(trn_ds, bs, shuffle=True)
val_dl = DataLoader(val_ds, bs, shuffle=False)
test_dl = DataLoader(test_ds, len(df_test), shuffle=False)

Define the model.

class MixedInputModel(nn.Module):
"""Model able to handle inputs consisting of both categorical and continuous variables.
Args:
emb_szs (list of int): List of embedding size
n_cont (int): Number of continuous variables in inputs
emb_drop (float): Dropout applied to the output of embedding
out_sz (int): Size of model's output.
szs (list of int): List of hidden variables sizes
drops (list of float): List of dropout applied to hidden
variables
y_range (list of float): Min and max of `y`.
y_range[0] = min, y_range[1] = max.
use_bn (bool): If use BatchNorm, set ``True``
is_reg (bool): If regression, set ``True``
is_multi (bool): If multi-label classification, set ``True``
"""
def __init__(self, emb_szs, n_cont, emb_drop, out_sz, szs,
drops, y_range=None, use_bn=False, is_reg=True,
is_multi=False):
super().__init__()
for i,(c,s) in enumerate(emb_szs): assert c > 1,
f"cardinality must be >=2, got emb_szs[{i}]: ({c},{s})"
if is_reg==False and is_multi==False: assert out_sz >= 2,
"For classification with out_sz=1, use is_multi=True"
self.embs = nn.ModuleList([nn.Embedding(c, s)
for c,s in emb_szs])
for emb in self.embs: emb_init(emb)
n_emb = sum(e.embedding_dim for e in self.embs)
self.n_emb, self.n_cont=n_emb, n_cont

szs = [n_emb+n_cont] + szs
self.lins = nn.ModuleList([
nn.Linear(szs[i], szs[i+1]) for i in range(len(szs)-1)])
self.bns = nn.ModuleList([
nn.BatchNorm1d(sz) for sz in szs[1:]])
for o in self.lins: kaiming_normal_(o.weight.data)
self.outp = nn.Linear(szs[-1], out_sz)
kaiming_normal_(self.outp.weight.data)
self.emb_drop = nn.Dropout(emb_drop)
self.drops = nn.ModuleList([nn.Dropout(drop)
for drop in drops])
self.bn = nn.BatchNorm1d(n_cont)
self.use_bn,self.y_range = use_bn,y_range
self.is_reg = is_reg
self.is_multi = is_multi
def forward(self, x_cat, x_cont):
if self.n_emb != 0:
x = [e(x_cat[:,i]) for i,e in enumerate(self.embs)]
x = torch.cat(x, 1)
x = self.emb_drop(x)
if self.n_cont != 0:
x2 = self.bn(x_cont)
x = torch.cat([x, x2], 1) if self.n_emb != 0 else x2
for l,d,b in zip(self.lins, self.drops, self.bns):
x = F.relu(l(x))
if self.use_bn: x = b(x)
x = d(x)
x = self.outp(x)
if not self.is_reg:
if self.is_multi:
x = torch.sigmoid(x)
else:
x = F.log_softmax(x, dim=1)
elif self.y_range:
x = torch.sigmoid(x)
x = x*(self.y_range[1] - self.y_range[0])
x = x+self.y_range[0]
return x
def emb_init(x):
x = x.weight.data
sc = 2/(x.size(1)+1)
x.uniform_(-sc,sc)

Initialize model. We are doing a regression task on the targ column.

model = MixedInputModel(emb_szs, 
n_cont=len(df.columns)-len(cat_vars),
emb_drop = 0.04, out_sz = 1,
szs = [1000,500], drops = [0.001,0.01],
y_range=(0,np.max(y)), use_bn=True,
is_reg=True, is_multi=False)

And we are now ready to train the model.

def train_model(model, train_dl, val_dl, n_epochs=1, lr=5e-2):
"Run training loops."
epochs = n_epochs
opt = optim.SGD(model.parameters(), lr=lr)
loss_func = nn.MSELoss()
try:
for epoch in range(epochs):
model.train()
for xb1, xb2, yb in train_dl:
preds = model(xb1, xb2)
loss = loss_func(preds, yb.float())

loss.backward()
opt.step()
opt.zero_grad()

model.eval()
with torch.no_grad():
loss_val = sum(loss_func(model(xv1, xv2),
yv.float())
for xv1, xv2, yv in val_dl)
print(epoch, loss_val / len(val_dl))
except Exception as e:
exception = e
raise

Finally we can train and predict over the test set.

train_model(model, train_dl, val_dl, n_epochs=500, lr=5e-2)def predict_test(model, test_dl):
"Returns predictions over test_df."
model.eval()
preds = [model(xv1, xv2) for xv1, xv2, _ in test_dl][0]
targs = [yv for _, _, yv in test_dl][0]
test_human_readable['targ_pred'] = preds.data.detach().numpy()
return torch.argmax(preds, dim=1).data.detach().numpy(),
test_human_readable
preds, df = predict_test(model, test_dl)
Test set with prediction

So ideally you could walk up with any dataframe in pandas and run this code and get a decent output of predictions. But this hopefully allows you to dissect the process a bit more and try out some modeling variations or whatever piques your interest.

Have fun!


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK