Checkpointing Deep Learning Models in Keras

Learn how to save deep learning models using checkpoints and how to reload them

Different methods to save and load the deep learning model are using

In this article, you will learn how to checkpoint a deep learning model built using Keras and then reinstate the model architecture and trained weights to a new model or resume the training from you left off

Usage of Checkpoints

Allow us to use a pre-trained model for inference without having to retrain the model
Resume the training process from where we left off in case it was interrupted or for fine-tuning the model

It acts like an autosave for your model in case training is interrupted for any reason.

Steps for saving and loading model and weights using checkpoint

Create the model
Specify the path where we want to save the checkpoint files
Create the callback function to save the model
Apply the callback function during the training
Evaluate the model on test data
Load the pre-trained weights on a new model using l oad_weights() or restoring the weights from the latest checkpoint

Create the base model architecture with the loss function, metrics, and optimizer

We have created the multi-class classification model for Fashion MNIST dataset

# Define the model architecture 
def create_model():
 model = tf.keras.Sequential()
 # Must define the input shape in the first layer of the neural network
 model.add(tf.keras.layers.Conv2D(filters=64, kernel_size=2, padding='same', activation='relu', input_shape=(28,28,1))) 
 model.add(tf.keras.layers.MaxPooling2D(pool_size=2))
 model.add(tf.keras.layers.Dropout(0.3))
 model.add(tf.keras.layers.Conv2D(filters=32, kernel_size=2, padding='same', activation='relu'))
 model.add(tf.keras.layers.MaxPooling2D(pool_size=2))
 model.add(tf.keras.layers.Dropout(0.3))
 model.add(tf.keras.layers.Flatten())
 model.add(tf.keras.layers.Dense(256, activation='relu'))
 model.add(tf.keras.layers.Dropout(0.5))
 model.add(tf.keras.layers.Dense(10, activation='softmax'))

 #Compiling the model
 model.compile(loss='sparse_categorical_crossentropy',
 optimizer='adam',
 metrics=['accuracy'])

 return model#create the model
model_ckpt= create_model()

Specify the path where the checkpoint files will be stored

checkpoint_path = "train_ckpt/cp.ckpt"

Create the callback function to save the model.

Callback functions are applied at different stages of training to give a view on the internal training states.

We create a callback function to save the model weights using ModelCheckpoint .

If we set save_weight_only to True, then only the weights will be saved. Model architecture, loss, and the optimizer will not be saved.

We can also specify if we want to save the model at every epoch or every n number of epochs.

# Create a callback that saves the model's weights
cp_callback = tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_path,save_best_only=True, save_weights_only=True, verbose=1)

ModelCheckpoint callback classhas the following arguments:

filepath : specify the path or filename where we want to save the model
monitor : the metrics that we want to monitor such as loss or accuracy
verbosity : 0 for debug mode and 1 for info
save_weights_only : If set to True, then only model weights will be saved else the full model is saved, including the model architecture, weights, loss function, and optimizer.
save_best_only : If set to True, then only the best model will be saved based on the quantity we are monitoring. If we are monitoring accuracy and save_best_only is set to True, then the model will be saved every time we get higher accuracy than the previous accuracy.
mode : It has three options- auto, min, or max . If we are monitoring accuracy, then set it to the max, and if we are monitoring loss, then set it to min . If we set the mode to auto, then the direction is inferred automatically based on the quantity being monitored
save_freq or period : set it to ‘epoch’ or a number . When it set it to epoch, then the model is saved after each epoch. When we specify a number say 5, then the model is saved after every five epochs as shown in the code below

# Create a callback that saves the model's weights every 5 epochs
cp_callback = tf.keras.callbacks.ModelCheckpoint(
 filepath=checkpoint_path, 
 verbose=1, 
 save_weights_only=True,
 save_freq=5)

Apply the callback during the training process

# Train the model with the new callback
# Pass callback to training
model_ckpt.fit(train_images, 
 train_labels, 
 batch_size=64,
 epochs=10,
 validation_data=(test_images,test_labels),
 callbacks=[cp_callback])

We can see that if the val_loss does not improve, then the weights are not saved. Whenever the loss is reduced then those weights are saved to the checkpoint file

Evaluating the model on test images

loss,acc = model_ckpt.evaluate(test_images, test_labels, verbose=2)

Checkpoint files

Checkpoint file stores the trained weights to a collection of checkpoint formatted files in a binary format

The TensorFlow save() saves three kinds of files: checkpoint file, index file, and data file. It stores the graph structure separately from the variable values .

checkpoint file: contains prefixes for both an index file as well as for one or more data files

Index files: indicates which weights are stored in which shard. As I trained the model on one machine, we see cp.ckpt.data-00000-of-00002 and cp.ckpt.data-00001-of-00002

data file: saves values for all the variables, without the structure. There can be one or more data files

Checkpoint files

Loading the pre-trained weights

Reasons for loading the pre-trained weights

Continue from where we left off or
Resume after an interruption or
Load the pre-trained weight for inference

We create a new model to load the pre-trained weights.

When loading a new model with the pre-trained weights, the new model should have the same architecture as the original model.

# Create a basic model instance
model_ckpt2 = create_model()

We load the pre-trained weights into our new model using load_weights() .

model_ckpt2.load_weights(checkpoint_path)

We can make inferences using the new model on the test images

loss,acc = model_ckpt2.evaluate(test_images, test_labels, verbose=2)
print("Restored model, accuracy: {:5.2f}%".format(100*acc))

An untrained model will perform at chance levels (~10% accuracy)

To resume the training where we left off

model_ckpt2.fit(train_images, 
 train_labels, 
 batch_size=64,
 epochs=10,
 validation_data=(test_images,test_labels),
 callbacks=[cp_callback])

we see that the accuracy has changed now

loss,acc = model_ckpt2.evaluate(test_images, test_labels, verbose=2)
print("Restored model, accuracy: {:5.2f}%".format(100*acc))

Loading weights from the latest checkpoints

latest_checkoint() find the filename of the latest saved checkpoint file

#get the latest checkpoint file
checkpoint_dir = os.path.dirname(checkpoint_path)
latest = tf.train.latest_checkpoint(checkpoint_dir)

We create a new model, load the weights from the latest checkpoint and make inferences

Create a new model instance
model_latest_checkpoint = create_model()# Load the previously saved weights
model_latest_checkpoint.load_weights(latest)# Re-evaluate the model
loss, acc = model_latest_checkpoint.evaluate(test_images, test_labels, verbose=2)
print("Restored model, accuracy: {:5.2f}%".format(100*acc))

Including epoch number in the filename

# Include the epoch in the file name (uses `str.format`)
checkpoint_path = "training2/cp-{epoch:04d}.ckpt"

code for saving the model and reloading model using Fashion MNIST

Conclusion:

We now understand how to create a callback function using ModelCheckpoint class, the different checkpoint files that get created and then how we can restore the pre-trained weights

References:

https://www.tensorflow.org/tutorials/keras/save_and_load

Checkpointing Deep Learning Models in Keras