

Two classification problem: text classification practice based on BERT! Attach c...
source link: https://blog.birost.com/a?ID=00000-4a00068a-33b0-4032-ac6b-f787849c69b0
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Two classification problem: text classification practice based on BERT! Attach complete code
Two classification problem: text classification practice based on BERT! Attach complete code
Datawhale\
Author: Gao Baoli, Excellent Learner of Datawhale
Message: Bert is naturally suitable for classification tasks. There are many methods for text classification, such as fasttext, textcnn, etc., but in front of Bert, it is a little bit of a shame.
Recommended review display refers to selecting one of many user reviews as the reason for recommendation of the shop, in order to hope that more people will open the shop./
This is like a recommendation system, because it is necessary to recommend suitable comments to different users. For example, in the same Cantonese restaurant, user A has high requirements for the environment. If the recommendation reason is "good environment", A will click in; and user B is more concerned about the taste of the dishes and does not have high requirements for the environment, then the recommendation reason is like " If it's delicious", B is more likely to click in. In other words, the same shop, according to user preferences, different people see different reasons for recommendation.
This task is a typical short text (up to 20 words) two classification problem, using pre-trained Bert to solve. Below, explain from the topic description, problem-solving ideas and code implementation.
Title description
Background description
The goal of this recommendation review display task is to dig out short sentences that are suitable as the reason for recommendation from real user reviews. The recommended reasons for review software display should meet the following three characteristics:
- Has a length limit
- High content relevance
- Has strong text appeal
Some real reasons for recommendation are shown in the blue box below:
Data set/
This task is a binary classification task, so the positive-negative sample ratio is more important. The training set has a total of 16,000 items, the ratio of positive to negative samples is about 1:2, and there are some imbalances, but the overall is not serious.
Data link : pan.baidu.com/s/1z_SJ5KhH...
Or reply to keyword recommendation data acquisition in the Datawhale background
Problem-solving ideas
The premise of ML/DL
Whether it is machine learning or deep learning, it is based on the premise that "training set and test set are independent and identically distributed". Only when this premise is met can the model perform well. Simply analyze the length of the text. If the training set is short text and the test set is long text, the model will not perform very well.
The results of data analysis are as follows:
Regarding the length of the comment, the following two characteristics can be seen:
- The quantiles of the training set and the test set are almost exactly the same:
- Looking at the mean and standard deviation of the training set and the test set are also roughly the same
MeanStandard deviationTraining set8.673.18Test set8.633.11
Therefore, the training set and the test set are independent and identically distributed in terms of the length of the comments, and the lengths of label 0 and label 1 are not too different, and the text length as a feature has little effect on classification. At the same time, it is concluded that if our model performs well on the training set, there are reasons to believe that it will perform well on the test set.
The main idea
There are many methods for text classification, fasttext, textcnn, or RNN-based, etc., but in front of Bert, these methods are just like a little bit. Bert is naturally suitable for classification tasks.
The official method is to take the hidden corresponding to [CLS] through a fully connected layer to get the classification result. In order to make full use of the information at this time step, take out the last layer of Bert, and then perform some simple operations, as follows:
- Bert, get a hidden representation of each time step, and time step t is the sentence length.
- There are three methods for comprehensive time step hidden layer representation information: global average pooling, global maximum pooling, and [CLS] and attention scores of other positions in the sequence.
- Put the comprehensive information into the fully connected layer for text classification.
Model training
Five-fold cross-validation is used, that is, the training set is divided into five parts, one part is used as the validation set, and the remaining four parts are used as the training set, which is equivalent to obtaining five models. As can be seen from the figure below, the combination of the validation set is the training set. The predictions of the five models on the test set are averaged to obtain the final prediction results.
Because the Bert model has a lot of parameters, and the training set only has 16,000, in order to prevent overfitting, the early stopping method is adopted.
Keras is implemented as follows:\
from keras_bert import load_trained_model_from_checkpoint, Tokenizer from keras_self_attention import SeqSelfAttention
def build_bert ( nclass, selfloss, lr, is_train ): """ nclass: the number of nodes in the output layer; lr: learning rate; selfloss: loss function is_train: Whether to fine-tune bert """ bert_model = load_trained_model_from_checkpoint(config_path, checkpoint_path, seq_len = None )
for l in bert_model.layers: l.trainable = is_train
x1_in = Input(shape=( None ,)) x2_in = Input(shape=( None ,))
x = bert_model([x1_in, x2_in]) x = Lambda( lambda x: x[:, :])(x)
avg_pool_3 = GlobalAveragePooling1D()(x) max_pool_3 = GlobalMaxPooling1D()(x) attention_3 = SeqSelfAttention(attention_activation = 'softmax' )(x) attention_3 = Lambda( lambda x: x[:, 0 ])(attention_3)
x = keras.layers.concatenate([avg_pool_3, max_pool_3, attention_3]) p = Dense(nclass, activation = 'sigmoid' )(x)
model = Model([x1_in, x2_in], p) model. compile (loss=selfloss, optimizer=Adam(lr), = metrics [ 'ACC' ]) Print (model.summary ()) return Model duplicated code
I also tried some complex operations (for example, followed by a CNN or a layer of GRU); I also tried to take out the features of the last three layers to do some operations. Although the effect is not improved, it is not bad.
Optimization and improvement
The ratio of positive and negative samples in the training set is 1:2. Although the sample imbalance is not obvious, it is not balanced. The general loss function is cross-entropy, but the relationship between cross-entropy and AUC is not strictly monotonic. The decrease of cross-entropy does not necessarily bring about the improvement of AUC. The best method is to directly optimize AUC, but AUC is difficult to calculate.
When the sample is balanced, the effects of AUC, F1, and accuracy (accuary) are similar. But when the sample is unbalanced, accuary cannot be used as an evaluation indicator, and F1 or AUC should be used as an evaluation indicator. Think about it carefully, AUC and F1 are both related to Precision and Recall, so I chose to optimize F1 directly. But F1 is non-derivable, and there are ways. It is recommended that the function smoothing talk written by Su Jianlin: the derivable approximation of non-derivable functions. Use F1_loss directly as the loss function.
Result analysis
Model 1 : batch=16, cross entropy loss function, learning rate 1e-5, fine-tuning the Bert layer, namely:
Model 2 : Load model 1, fix the Bert layer, fine-tune the fully connected layer, the batch is still 16, and the learning rate is 1e-7, namely:
The comparison is as follows:
Complete code
It runs for about 1 hour on the GPU, and the CPU can also run, it may take four or five hours\
import keras from keras.utils import to_categorical from keras.layers import * from keras.callbacks import * from keras.models import Model import keras.backend as K from keras.optimizers import Adam import codecs import gc import numpy as np import pandas as pd import time import os fromkeras.utils.training_utils Import multi_gpu_model Import tensorflow AS TF from keras.backend.tensorflow_backend Import set_session from sklearn.model_selection Import KFold from keras_bert Import load_trained_model_from_checkpoint, the Tokenizer from keras_self_attention Import SeqSelfAttention from sklearn.metrics Import roc_auc_score the lines # cross-entropy 0.9552568091358987 batch = 16 1e-5 online 0.96668 # offline 0.9603767202619631 batch = 16 Based on the previous step, use f1loss to not adjust the bert layer 1e-7 online 0.97010
class OurTokenizer ( Tokenizer ): def _tokenize ( self, text ): R = [] for c in text: if c in self._token_dict: R.append(c) elif self._is_space(c): R.append( '[unused1]' ) # space class uses untrained [unused1] to represent else : R.append( '[UNK]' ) # The remaining characters are [UNK] return R
def f1_loss ( y_true, y_pred ): # y_true: true label 0 or 1; y_pred: the probability of being a positive class loss = 2 * tf.reduce_sum(y_true * y_pred)/tf.reduce_sum(y_true + y_pred) + K.epsilon( ) return -loss
def seq_padding ( X, padding = 0 ): L = [ len (x) for x in X] ML = max (L) return np.array([ np.concatenate([x, [padding] * (ML- len (x))]) if len (x) <ML else x for x in X ])
class data_generator : def __init__ ( self, data, batch_size = 8 , shuffle = True ): self.data = data self.batch_size = batch_size self.shuffle = shuffle self.steps = len (self.data)//self.batch_size if len (self.data)% self.batch_size != 0 : self.steps += 1
def __len__ ( self ): return self.steps
def __iter__ ( self ): while True : idxs = list ( range ( len (self.data)))
if self.shuffle: np.random.shuffle(idxs)
X1, X2, Y = [], [], [] for i in idxs: d = self.data[i] text = d[ 0 ][:maxlen] # indices, segments = tokenizer.encode(first='unaffable', second='steel', max_len=10) x1, x2 = tokenizer.encode(first=text) y = np.float32(d[ 1 ]) X1.append(x1) X2.append(x2) Y.append([y]) if len (X1) == self.batch_size or i == idxs[- 1 ]: X1 = seq_padding(X1) X2 = seq_padding(X2) Y = seq_padding(Y) # print('Y', Y) yield [X1, X2], Y[:, 0 ] [X1, X2, Y] = [], [], []
def build_bert ( nclass, selfloss, lr, is_train ): bert_model = load_trained_model_from_checkpoint(config_path, checkpoint_path, seq_len = None )
for l in bert_model.layers: l.trainable = is_train
x1_in = Input(shape=( None ,)) x2_in = Input(shape=( None ,))
x = bert_model([x1_in, x2_in]) x = Lambda( lambda x: x[:, :])(x)
avg_pool_3 = GlobalAveragePooling1D()(x) max_pool_3 = GlobalMaxPooling1D()(x) # Official document: https://www.cnpython.com/pypi/keras-self-attention # Source code https://github.com/CyberZHG/keras-self-attention/blob/master/keras_self_attention/seq_self_attention.py attention_3 = SeqSelfAttention(attention_activation = 'softmax' )(x) attention_3 = Lambda( lambda x: x[:, 0 ])(attention_3)
x = keras.layers.concatenate([avg_pool_3, max_pool_3, attention_3], name= "fc" ) p = Dense(nclass, activation = 'sigmoid' )(x)
model = Model([x1_in, x2_in], p) model. compile (loss=selfloss, optimizer=Adam(lr), metrics=[ 'acc' ]) print (model.summary()) return model
def run_cv ( nfold, data, data_test ): kf = KFold(n_splits=nfold, shuffle = True , random_state = 2020 ).split(data) train_model_pred = np.zeros(( len (data), 1 )) test_model_pred = np.zeros(( len (data_test), 1 ))
lr = 1e-7 # 1e-5 # categorical_crossentropy (optional plan:'binary_crossentropy', f1_loss) selfloss = f1_loss is_train = False # True False
for i, (train_fold, test_fold) in enumerate (kf): print ( '***************%d-th************** **' % i) t = time.time() X_train, X_valid, = data[train_fold, :], data[test_fold, :]
model = build_bert( 1 , selfloss, lr, is_train) early_stopping = EarlyStopping(monitor = 'val_acc' , patience = 3 ) plateau = ReduceLROnPlateau(monitor= "val_acc" , verbose= 1 , mode= 'max' , factor= 0.5 , patience= 2 ) checkpoint = ModelCheckpoint( '/home/codes/news_classify/comment_classify/expriments/' + str (i) + '_2.hdf5' , monitor = 'val_acc' , verbose = 2 , save_best_only = True , mode = 'max' , save_weights_only = False )
batch_size = 16 train_D = data_generator(X_train, batch_size=batch_size, shuffle = True ) valid_D = data_generator(X_valid, batch_size=batch_size, shuffle = False ) test_D = data_generator(data_test, batch_size=batch_size, shuffle= False )
model.load_weights( '/home/codes/news_classify/comment_classify/expriments/' + str (i) + '.hdf5' )
model.fit_generator( train_D.__iter__(), steps_per_epoch = len (train_D), epochs = 8 , validation_data=valid_D.__iter__(), validation_steps = len (valid_D), callbacks=[early_stopping, plateau, checkpoint], )
# return model train_model_pred[test_fold] = model.predict_generator(valid_D.__iter__(), steps = len (valid_D), verbose = 1 ) test_model_pred += model.predict_generator(test_D.__iter__(), steps = len (test_D), verbose = 1 )
del model gc.collect() K.clear_session()
print ( 'time:' , time.time()-t)
return train_model_pred, test_model_pred
if __name__ == ' __main__ ' :
config = tf.ConfigProto() config.gpu_options.per_process_gpu_memory_fraction = 0.8 # Quantitative config.gpu_options.allow_growth = True # On demand set_session(tf.Session(config=config))
t = time.time() maxlen = 20 # The maximum length of the data set is 19 config_path = '/home/codes/news_classify/chinese_L-12_H-768_A-12/bert_config.json ' checkpoint_path = '/home/codes/news_classify/chinese_L-12_H-768_A-12/bert_model.ckpt' dict_path = '/ home/codes/news_classify/chinese_L-12_H-768_A-12/vocab.txt ' token_dict = {} with codecs. open (dict_path, 'r' , 'utf8' ) as reader: for line in reader: token = line.strip() token_dict[token] = len (token_dict)
tokenizer = OurTokenizer(token_dict)
data_dir = '/home/codes/news_classify/comment_classify/' train_df = pd.read_csv(os.path.join(data_dir, 'union_train.csv' )) test_df = pd.read_csv(os.path.join(data_dir, 'test.csv' ))
print ( len (train_df), len (test_df))
DATA_LIST = [] for data_row in train_df.iloc[:].itertuples(): DATA_LIST.append((data_row.content, data_row.label)) DATA_LIST = np.array(DATA_LIST)
DATA_LIST_TEST = [] for data_row in test_df.iloc[:].itertuples(): DATA_LIST_TEST.append((data_row.content, 0 )) DATA_LIST_TEST = np.array(DATA_LIST_TEST)
n_cv = 5 train_model_pred, test_model_pred = run_cv(n_cv, DATA_LIST, DATA_LIST_TEST)
train_df[ 'Prediction' ] = train_model_pred test_df[ 'Prediction' ] = test_model_pred/n_cv
train_df.to_csv(os.path.join(data_dir, 'train_union_submit2.csv' ), index = False )
test_df[ 'ID' ] = test_df.index test_df[[ 'ID' , 'Prediction' ]].to_csv(os.path.join(data_dir, 'submit2.csv' ), index = False )
auc = roc_auc_score(np.array(train_df[ 'label' ]), np.array(train_df[ 'Prediction' ])) print ( 'auc' , auc)
Print ( 'Time IS' , the time.time () - T) # 2853s copy the code
Reference
1. How to Fine-Tune BERT for Text Classification?
2. A talk on function smoothing written by Mr. Su Jianlin: Differentiable approximation of non-derivable functions
Wonderful review of past issues
Route and data download suitable for beginners to get started with artificial intelligence. Machine learning online manual Deep learning online manual AI basic download (pdf updated to 25 episodes) qq group 1003271085 on this site , join the WeChat group, please reply to "add group" to get a discount on the knowledge of this site planet coupons, please reply "knowledge planet" like articles, a point in looking copy the code
Recommend
-
47
Text Classification is Your New Secret Weapon Natural Language Processing is Fun! Part 2 This article is part of an on-going series on NLP. You can also check...
-
48
RNN is a class of artificial neural network where connections between nodes form a directed graph along a sequence. It...
-
45
Photo by Clément H on...
-
25
Setup First of all, I need to import the following libraries: ## for dataimport jsonimport pandas as pdimport
-
10
-
18
Character-level Convolutional Networks for Text Classification 2020-01-13 | dissertation | 437 |408 6,408 | 26文献翻译:文本分类字...
-
10
Text Classification using Transformers in PyTorchText Classification using Transformers in PyTorch2 points by vatsalsaglani 5 days ago...
-
3
-
7
Introduction Analytics Vidhya has long been at the forefront of imparting data science knowledge to its community. With the intent to make learning data science more engaging to the community, we began with our new initiative- “...
-
2
This problem appeared in a project in the coursera course Deep Learning (by the University of Colorado Boulder) and also as a past Kaggle competition. Brief description of the problem and data In this project, we shall build...
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK