A quick overview of compression methods for Convolutional Neural Networks

Namely: Hyperparameter Search, Convolution Variants, Network-in-Network, Weight Sharing, Pruning, Quantization

Theodoros Ntakouris
6 min readDec 10, 2020

The following is a very, very brief and non-technical summary of a chapter in the author’s dissertation.

Motives

A very deep neural network that is designed to solve hard problems might take a few seconds to run on modern computers with hardware accelerators for linear algebra operations. Similarly, smaller neural networks that do not take that much time to run, still do not meet realtime constraints. Hardware resources and execution time constraints is what drives the research community to investigate different methods of neural network compression. In the next few sections, common compression methods are presented.

Hyperparameter Search

In this specific problem, we do not only optimize for accuracy (or the metric of your choice), but also for efficiency. Typical techniques for hyperparameter search include Random Search, Grid Search, Bayesian Optimization and Bandit/Tourament based methods.

Hyperparameter Search Visualization

For CNNs, one of such methods is EfficientNet (top of the leaderboards at Papers With Code). Given the constraint of memory and FLOPS, it scales your typical convolutional network on several dimensions: width, depth, compound and resolution (number of layers, number of filters per layer, convolution dimensions).

Efficient Net Scaling Dimensions

Convolutional Variants

Besides hyperparameter search methods, there are some hand-made computational schemes that depend on mathematical tricks to reduce the parameter count of each layer while they proportionally keep a larger amount of their computational capacity.

Group Convolution

Split the convolutional filters of each layer into equal sized groups. Apply to them to equal sized groups. You can end up with a more computationally demanding network if you choose too large group and filter sizes.

Group Convolution

It’s not in one’s benefit to keep the output groups aligned as — is. This is a topological constraint that has no reason to exist. Treat it like ShuffleNets do, and mix the output channel groups.

Output Channel Group Shuffling

Network-In-Network

Hyperparameter search? Special convolution ‘variants’? Empirical network design? Actually, this technique sits right in the middle. Let’s focus on the Inception-v{1,2,3,4} model family. Those replace the typical convolutional layer stack in classic convoltional networks like ResNets and VGG, with layer groups that are aimed to have a greater knowledge capacity with regards to their slight increase in parameters and computational cost.

Inception-v1

Naive Inception v1 Block
Inception v1 Block with less input channels

Inception v{2,3}

The larger convolutional kernels of size (5, 5) are replaced as a cheaper alternative: 2 layers with kernels of size (3, 3). Then, the layers which contain kernel sizes (n, n) with n > 1, are replaced by 2 layers with the first one having filter size of (n, 1) and the second one having filter size of (1, n). In order to reduce the depth of the total network (you typically stack 10s of these layer groups together), the (1, n) and (n, 1) filters are computed in parallel.

Inception v2 Group

The main difference of Inception v3 is that it uses greater convolutional filter sizes (7, 7).

Inception v4

This revision of the Inception family ivnestigated a lot of filter size and arrangement configurations in order to be more efficient. Special steps were taken to reduce computational overhead where less knowledge capacity is needed. In brief, downsize layers were created in order to work better with residual networks.

Weight Sharing

We can keep a smaller number of the weights required to train a neural network and project them into virtual weight matrices that are used to perform forward pass. Then, on the backward pass we just accumulate (or average) the loss for each real one, based on the virtual ones! This is typically achieved with some sort of hash function.

Weight Sharing Visualization

Sparsity and Pruning

Not every neuron is created (or trained) equal. Some may contribute practically zero information to the total output of the network (due to very small weight values). These can be pruned away, leading to a sparse (lots of zeros) network. Then, special matrix storage types (CSC, CSR) and linear algebra subroutines can be used to accelerate computation and reduce storage space by a huge margin. Loading the whole network on on-chip SRAM is a huge advantage. There are lots of techniques to do such thing:

Keep in mind that a high perceptage of sparsity does not always correspond to full hardware utilization. You can get matrices with > 50% sparsity that are so dense in general that can not be multiplied effectively with special hardware. The most easy way to take advantage of pruning is to remove specific filters as a whole when possible.

Quantization

You don’t need 32 bits to represent weights. You might not even need 16! Google and nVidia already use less bits for precision (16 bits) for their neural computers. Weights are typically small and need high precision close to zero and one. You can use less bits to do that. Sometimes even 1 (XNORNet)!

IEEE 754
Google TPUs bfloat16

Quantizing less than 16 bits

It’s not efficient to stick with the typical sign — exponent — mantissa representation for very low bit width (≤ 8 bits). The usual ‘trick’ employed is to first project the weights to the [0, 1] range, using tanh(.) , then to quantize (map to nearest number) and then to re — project back to [-1, 1]. In the 1-bit scenario you expect the values to be either -1 or 1.

Due to the non — differentiable nature of quantizers (thresholding operations), we need to somehow calculate the gradient on the backward pass. We just set the gradient to 1 where the quantization intervals are used and 0 otherwise (Straight Through Estimator). For training, we do use shadow/virtual weights that are the ones to get updated during back propagation, while quantization is done on-the-fly while the network executes. Then, when training is complete and we have reached convergence, we can just quantize and store the weights.

You can use the methodology of knowledge distillation to train networks with multiple bit modes jointly.

Learnable Quantization

You can also learn the quantizer parameters stochastically along with the networks weights! Specific constraints/design considerations can be put in place in order to ensure bit-operation compatible output format. Modern CPUs (even without a signal processing or graphic accelerator module) support these operations in a vectorized manner, being able to process > 64 batches of data at once!

I won’t dive deeper in this matter, due to the fact that I am going to get way more technical than I originally intended to.

I hope that this blog post was food for thought and provided enough resources for the interested reader to look into later.

Thanks for reading all the way to the end!

--

--