A Basic Introduction to TensorFlow Lite

An introduction to TensorFlow Lite Converter, Quantized Optimization, and Interpreter to run Tensorflow Lite models at the Edge

In this article, we will understand the features required to deploy a deep learning model at the Edge, what is TensorFlow Lite, and how the different components of TensorFlow Lite can be used to make an inference at the Edge.

source: https://www.tensorflow.org/lite

You are trying to deploy your deep learning model in an area where they don’t have a good network connection but still need the deep learning model to give an excellent performance.

TensorFlow Lite can be used in such a scenario

Features of a Deep Learning model to make inference at the Edge

Light-weight: Edge devices have limited resources in terms of storage and computation capacity. Deep learning models are resource-intensive, so the models we deploy on edge devices should be light-weight with smaller binary sizes.
Low Latency: Deep Learning models at the Edge should make faster inferences irrespective of network connectivity. As the inferences are made on the Edge device, a round trip from the device to the server will be eliminated, making inferences faster.
Secure: The Model is deployed on the Edge device, the inferences are made on the device, no data leaves the device or is shared across the network, so there is no concern for data privacy.
Optimal power consumption: Network needs a lot of power, and Edge devices may not be connected to the network, and hence, the power consumption need is low.
Pre-trained: Models can be trained on-prem or cloud for different deep learning tasks like image classification, object detection, speech recognition, etc. and can be easily deployed to make inferences at the Edge.

Tensorflow Lite offers all the features required for making inferences at the Edge.

But what is TensorFlow Lite?

TensorFlow Lite is an open-source, product ready, cross-platform deep learning framework that converts a pre-trained model in TensorFlow to a special format that can be optimized for speed or storage.

The special format model can be deployed on edge devices like mobiles using Android or iOS or Linux based embedded devices like Raspberry Pi or Microcontrollers to make the inference at the Edge .

How does Tensorflow Lite(TF Lite) work?

Select and Train a Model

let’s say you want to perform the Image Classification task. The first thing is to decide on the Model for the task. Your options are

Create a custom model
Use a pre-trained model like InceptionNet, MobileNet, NASNetLarge, etc.
Apply Transfer Learning on a pre-trained model

Convert the Model using Converter

After the model is trained, you will convert the Model to the Tensorflow Lite version. TF lite model is a special format model efficient in terms of accuracy and also is a light-weight version that will occupy less space, these features make TF Lite models the right fit to work on Mobile and Embedded Devices.

TensorFlow Lite conversion Process

Source: https://www.tensorflow.org/lite/convert/index

During the conversion process from a Tensorflow model to a Tensorflow Lite model, the size of the file is reduced. We have a choice to either go for further reducing the file size with a trade-off with the execution speed of the Model.

Tensorflow Lite Converter converts a Tensorflow model to Tensorflow Lite flat buffer file( .tflite ).

Tensorflow Lite flat buffer file is deployed to the client, which in our cases can be a mobile device running on iOS or Android or an embedded device.

How can we convert a TensorFlow model to the TFlite Model?

After you have trained the Model, you will now need to save the Model.

The saved Model serializes the architecture of the Model, the weights and the biases, and training configuration in a single file. The saved model can be easily used for sharing or deploying the models.

The Converter supports the Model saved using

tf.keras.Model: Create and compile a model using Keras and then convert the Model using TFLite.

#Save the keras model after compiling
model.save('model_keras.h5')
model_keras= tf.keras.models.load_model('model_keras.h5')# Converting a tf.Keras model to a TensorFlow Lite model.
converter = tf.lite.TFLiteConverter.from_keras_model(model_keras)
tflite_model = converter.convert()

SavedModel : A SavedModel contains a complete TensorFlow program, including weights and computation.

#save your model in the SavedModel format
export_dir = 'saved_model/1'
tf.saved_model.save(model, export_dir)# Converting a SavedModel to a TensorFlow Lite model.
converter = lite.TFLiteConverter.from_saved_model(export_dir)
tflite_model = converter.convert()

export_dir follows a convention where the last path component is the version number of the Model.

SavedModelis a meta graph saved on the export_dir, which is converted to the TFLite Model using lite.TFLiteConverter .

Concrete Functions: TF 2.0 has eager execution on by default, and that impacts the performance and deployability. To overcome the performance issue, we can use tf.function to create graphs . The graphs contain the model structure with all the computational operations, variables, and weights of the Model.

Export the Model as a concrete function and then convert the concrete function to TF Lite model

# export model as concrete function
func = tf.function(model).get_concrete_function(
    tf.TensorSpec(model.inputs[0].shape, model.inputs[0].dtype))#Returns a serialized graphdef representation of the concrte function
func.graph.as_graph_def()# converting the concrete function to Tf Lite 
converter =  tf.lite.TFLiteConverter.from_concrete_functions([func])
tflite_model = converter.convert()

Optimize the Model

Why optimize model?

Models at Edge needs to be light-weight

Take less space on the Edge devices.
Faster download time on networks with lower bandwidth
Occupy less Memory for the Model to make inferences faster

Models at Edge should also have low latency to run inferences. Leight weight and low latency model can be achieved by reducing the amount of computation required to predict.

Optimization reduces the size of the model or improves the latency. There is a trade-off between the size of the model and the accuracy of the model.

How is optimization achieved in TensorFlow Lite?

Tensorflow Lite achieves optimization using

Quantization
Weight Pruning

Quantization

When we save the TensorFlow Model, it stores as graphs containing the computational operation, activation functions, weights, and biases. The activation function, weights, and biases are 32-bit floating points.

Quantization reduces the precision of the numbers used to represent different parameters of the TensorFlow model and this makes models light-weight.

Quantization can be applied to weight and activations.

Weights with 32-bit floating points can be converted to 16-bit floating points or 8-bit floating points or integer and will reduce the size of the Model.
Both weights and activations can be quantized by converting to an integer, and this will give low latency, smaller size, and reduced power consumption.

Weight Pruning

Just as we prune plants to remove non-productive parts of the plant to make it more fruit-bearing and healthier, the same way we can prune weights of the Model.

Weight pruning trims parameters within a model that has very less impact on the performance of the model.

Weight pruning achieves model sparsity, and sparse models are compressed more efficiently.Pruned models will have the same size, and run-time latency but better compression for faster download time at the Edge.

Deploying the TF Lite model and making an Inference

TF lite model can be deployed on mobile devices like Android and iOS, on edge devices like Raspberry and Microcontrollers.

To make an inference from the Edge devices, you will need to

Initialize the interpreter and load the interpreter with the Model
Allocate the tensor and get the input and output tensors
Preprocess the image by reading it into a tensor
Make the inference on the input tensor using the interpreter by invoking it.
Obtain the result of the image by mapping the result from the inference

# Load TFLite model and allocate tensors.
interpreter = tf.lite.Interpreter(model_content=tflite_model)
interpreter.allocate_tensors()#get input and output tensors
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()# Read the image and decode to a tensor
img = cv2.imread(image_path)
img = cv2.resize(img,(WIDTH,HEIGHT))#Preprocess the image to required size and cast
input_shape = input_details[0]['shape']
input_tensor= np.array(np.expand_dims(img,0), dtype=np.float32)#set the tensor to point to the input data to be inferred
input_index = interpreter.get_input_details()[0]["index"]
interpreter.set_tensor(input_index, input_tensor)#Run the inference
interpreter.invoke()
output_details = interpreter.get_output_details()[0]

Is there any other way to improve latency?

Tensorflow lite uses delegates to improve the performance of the TF Lite model at the Edge. TF lite delegate is a way to hand over parts of graph execution to another hardware accelerator like GPU or DSP (Digital Signal Processor).

TF lite uses several hardware accelerators for speed, accuracy, and optimizing power consumption, which important features for running inferences at the Edge.

Conclusion:TF lite models are light-weight models that can be deployed for a low-latency inference at Edge devices like mobiles, Raspberry Pi, and Micro-controllers. TF lite delegate can be used further to improve the speed, accuracy and power consumption when used with hardware accelerators

References:

https://www.tensorflow.org/model_optimization/guide/pruning

https://www.tensorflow.org/lite/performance/model_optimization

A Basic Introduction to TensorFlow Lite