Here we look at how for pre-trained networks, the network can be altered through quantization. We also discuss how if the model isn’t compatible with 8-bit quantization, the network can be converted to use 16-bit. Finally, we quickly look at network pruning.
In the previous part of this series we completed building a TensorFlow Lite-based application for performing recognition of objects using a network model that came from the ONNX Model Zoo. Let’s consider the ways in which the network could be further optimized.
With neural network models, one of the challenges can be striking the right balance between available resources and accuracy. Generally speaking, if a model is made more complex there is the potential for it to be more accurate, but it will consume more storage space, have longer execution times, or consume more network bandwidth when being downloaded. Not all optimizations will run on all hardware.
Post-Optimizing Pre-Trained Networks
If you are using a network that came from a third-party, attempts to increase performance may start with post-training optimization.
For pre-trained networks, the network can be altered through quantization. Quantization reduces the precision of the network’s parameters. These are usually 32-bit floating point numbers. When quantization is applied to a network the floating operations can be converted to integer or 16-bit floating point operations. These will run with increased speed but slightly lower accuracy. In an earlier part of this series, I used Python code to convert a model from TensorFlow to TensorFlow Lite. With a few modifications, the converter will also perform 8-bit integer quantization while converting:
import tensorflow as tf
saved_model_dir='/dev/projects/models'
converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.uint8
converter.inference_output_type = tf.uint8
tf_lite_model = converter.convert()
open('output.tflite', 'wb').write(tf_lite_model)
Not all models can be transformed this way. If the model isn’t compatible with 8-bit quantization then the above code will throw an error. One might wonder what other values could be passed to converter.optimizations
other than DEFAULT
. In previous versions of TensorFlow, it was possible to pass values to optimize the network for size or latency here. Those values are currently deprecated and will now have no effect on the result.
If 32-bit floating point to 8-bit integer quantization is too extreme, the network can be converted to use 16-bit floating-point numbers instead:
import tensorflow as tf
saved_model_dir='/dev/projects/models'
converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_types = [tf.float16]
tf_lite_model = converter.convert()
open('output.tflite', 'wb').write(tf_lite_model)
With the above model quantization, the advantage is only experienced if the model is executed on hardware that supports 16-bit floating point numbers (such as the GPU). If executed on a CPU, the network weights will be expanded to 32-bit floating point numbers before execution. On a CPU there would be a slight loss of accuracy of the equivalent unoptimized network with no performance increase.
Pruning a Network
To optimize a network for size, consider network pruning. In pruning, parts of the network that make less significant contributions to accuracy are removed, and the resulting network compresses more tightly. To perform pruning on a network there is an additional tensorflow library to install:
pip install -q tensorflow-model-optimization
Within the Python code, after building a network model, the TensorFlow Model Optimization package contains a method called prune_low_magnitude
which will perform these modifications on the network.
import tensorflow_model_optimization as tfmot
pruned_model = tfmot.sparsity.keras.prune_low_magnitude(source_model)
Next Steps
Now that we’ve learned a bit more about using TensorFlow Lite effectively, we’re ready to dive in deeper. In the next article, we’ll learn how to train our own neural network for use in an Android app.