Facial Re-Identification Using Siamese Nets

Facial re-identification (FRID) is the problem of verifying the ID of a person based on their face. For example, a governmental agency could eradicate the use of ID cards and simply use a FRID system to handle attendance, without worrying about a stranger stealing someone’s ID to enter the system.

The crux of this problem is a classification problem, where the model has to classify an image of a face as a person in the database. Because of this it may be tempting to use the many state of the art image classifiers with nearly 100 percent test accuracy. Case closed right? Just fine tune the ResNet50 or even the ResNet101 model to classify the name of the person from the image of a face. Well, there are some fundamental questions that arise when doing this:

Where do I get the data? At most there will be one or two images available for each of the people you want to classify, and that to the image will be a passport style photo that is very different from the images that are being classified. Image classifiers only work well when there are at least 50 images to achieve high confidences on test images, which is 49 more images than available.
How do I ensure high confidence on thousands of classes? The largest datasets for image classifying have 300 to 1000 classes. Corporate organizations however, have anywhere between 100 to 50,000 people, ergo 100 to 50,000 classes.
What do I do when someone new is hired? The very nature of Deep Learning networks is that there is no way to just add a new class to a model and have it do well. Adding a new class means retraining the entire network again with the added class. This wastes both computational resources and time, neither of which anyone has in excess, especially Data Scientists.

All of these questions can be answered by a new style of model: Siamese Networks.

Intuition

Lets first begin by introducing Siamese Networks. A majority of Deep Learning advances have occurred when people apply human intuition into the Deep Learning model, such as when the Residual Block was first used in the ResNet, or when style based losses became more popular as people tried harder to try and incorporate human abstraction into computers. Siamese Networks are another one of these advances that are heavily dependant on human intuition about how a computer should solve the problem.

The core intuition behind Siamese Networks is to try and learn a representation of the face. This representation is similar to the information humans store in their minds about the characteristics of the face, such as the size of the facial features, color of skin, eye color, etc. A human can understand that if it sees another face with similar features, then there is a high chance that the new face belongs to the same person. On the other hand, if the human sees that the new face does not match that of the faces it has previously seen, then the human again makes a representation of the new face to store in it’s memory.

This is exactly how Siamese Networks function. A function transforms the input image of the face into a vector which contains a representation of the face’s features. We then want this vector to be similar to that of the same face, and very different from the vector of a different face.

In a nutshell, the model learns how to extract the important features of a face which allows it to be distinguished from other faces. Once the feature mapping is obtained, it can be compared to the feature mappings of other faces in a database to be matched.

Image Courtesy of ( Signature Verification using a “Siamese” Time Delay Neural Network )

Triplet Loss and Other Details

The most important part of any neural network is how it is training. Training is the way a large, seemingly useless function is tuned to complete its task. Siamese Networks were originally trained in pairs, similar to the use case of the model. The distance between two vectors of the same face was maximized, and the distance between two vectors of different faces was minimized.

This approach however, complicated the training process and was also inefficient in terms of computational requirements. Training this Contrastive Loss, as it is called, meant that for each training sample, the model would have to compute four vectors and then perform two sets of model optimization for each pair of images. Another issue that arose was that training samples would be highly disbalanced since there would be a much smaller amount of pairs available in which the faces were the same, and a larger amount of pairs in which the faces were different.

To decrease the computational requirements of Contrastive Loss, and also to partially solve the data issue, Triplet Loss was created. Triplet loss works in four steps:

Sample three images from the dataset, the anchor, positive, and negative face. The anchor is any face, and the positive is an image of the same face as the anchor, and the negative is an image of a different face.
Compute the vectors of each of the images
Use a distance function to find the distance between the anchor and the positive and negative images.
Compute the loss value as the difference between the distance between the anchor and negative image and the distance between the anchor and positive image.

The following equation shows the mathematical representation of Triplet Loss, along with some important subtleties.

The first difference between the loss described above and the equation is the addition of the Max() function between the difference and 0.0 . This is to make sure that the loss function doesn’t drop beneath zero.

The second, more important difference is the addition of a margin, denoted by m . This margin makes sure that the model doesn’t learn to output the same vector for all faces, regardless of whether the faces match. Notice that if the distance between the anchor and positive is equal to the distance between the anchor and negative, the loss will be minimized to zero if there wasn’t a margin term.

So how does this solve the problem?

The model only needs a maximum of two faces per identity to train it, which is very similar to the type of data available in the real world, since one must have at least on image of the person being identified and one realtime image of the person.
Since the model does not depend on classifying different identities from faces, it does not have to deal with the possible thousands of faces to classify, it only needs to perform an accurate comparison.
For the same reason, a newly added employee does not mean a re-training of the network, the model will simply compute the new employee’s face embedding to add to the database of face embeddings.

Implementation

For the implementation in this blog post we will use tensorflow==2.1 and a gpu if available, since the training of such a network is a heavy computation. I will be training my model on a Nvidia GTX 1080 TI, so I will be using tensorflow-gpu==2.1 and CUDA==10.2 .

Since the development of powerful feature extractor networks, there is often little need to design a custom convolutional neural network from scratch, for almost every situation. We will use the industry standard of the ResNet50 model to extract our features. Because of this, the only addition needed is to devise a custom loss function in the tf.Keras API to train our model with.

There are two different methods to incorporate a custom loss functions in tf.Keras: creating a function that computes the loss to be passed to the model.compile() method, or to create a loss layer and use the Layer.add_loss() method.

The only difference between the two methods of creating custom loss functions is that in the first one, the loss function must be in the format of loss_fn(y_true, y_pred) , while the second method can have any format.

Since Triplet Loss cannot be formulated in a y_true, y_pred format, the second method will be used.

In the above code, triplet_loss is the loss function which takes in the anchor, positive, and negative images. The annotation of @tf.function simply causes the function to be graphed using tf.Autograph and allows for faster computations.

The TripletLoss layer inherits from the tf.keras.layers.Layer , and the triplet loss function is added to the layer’s losses using the add_loss method. The layer then returns the inputs without changing them.

Using the add_loss method not only adds the loss tensor to the losses of the layer, but when training a model using model.fit() , all of the layer losses are also automatically optimized as well, which allows us to compile the model without any overall loss function in the following code:

The code above shows the complete code for the model and how it was trained.

Results

Training the model on the LFW (Labeled Faces in the Wild) database at http://vis-www.cs.umass.edu/lfw/ , the model was able to correctly “predict” matching faces with 90 percent accuracy, and was able to tell if two faces were different with 95 percent accuracy.

There have been multiple papers that have been released in the past few years making significant improvements over the vanilla Triplet Loss, and on much harder datasets such as Duke MTMC at https://megapixels.cc/duke_mtmc/ .

In the next few posts I will be breaking down recent advancements in Re-Identification.

Facial Re-Identification Using Siamese Nets