A quick overview of compression methods for Convolutional Neural Networks

Namely: Hyperparameter Search, Convolution Variants, Network-in-Network, Weight Sharing, Pruning, Quantization

Picture by stevepb on Pixabay


A very deep neural network that is designed to solve hard problems might take a few seconds to run on modern computers with hardware accelerators for linear algebra operations. Similarly, smaller neural networks that do not take that much time to run, still do not meet realtime constraints. Hardware resources and execution time constraints is what drives the research community to investigate different methods of neural network compression. In the next few sections, common compression methods are presented.

Hyperparameter Search

In this specific problem, we do not only optimize for accuracy (or the metric of your choice), but also for efficiency. Typical techniques for hyperparameter search include Random Search, Grid Search, Bayesian Optimization and Bandit/Tourament based methods.

Hyperparameter Search Visualization
Efficient Net Scaling Dimensions

Convolutional Variants

Besides hyperparameter search methods, there are some hand-made computational schemes that depend on mathematical tricks to reduce the parameter count of each layer while they proportionally keep a larger amount of their computational capacity.

Group Convolution

Split the convolutional filters of each layer into equal sized groups. Apply to them to equal sized groups. You can end up with a more computationally demanding network if you choose too large group and filter sizes.

Group Convolution
Output Channel Group Shuffling


Hyperparameter search? Special convolution ‘variants’? Empirical network design? Actually, this technique sits right in the middle. Let’s focus on the Inception-v{1,2,3,4} model family. Those replace the typical convolutional layer stack in classic convoltional networks like ResNets and VGG, with layer groups that are aimed to have a greater knowledge capacity with regards to their slight increase in parameters and computational cost.


Naive Inception v1 Block
Inception v1 Block with less input channels

Inception v{2,3}

The larger convolutional kernels of size (5, 5) are replaced as a cheaper alternative: 2 layers with kernels of size (3, 3). Then, the layers which contain kernel sizes (n, n) with n > 1, are replaced by 2 layers with the first one having filter size of (n, 1) and the second one having filter size of (1, n). In order to reduce the depth of the total network (you typically stack 10s of these layer groups together), the (1, n) and (n, 1) filters are computed in parallel.

Inception v2 Group

Inception v4

This revision of the Inception family ivnestigated a lot of filter size and arrangement configurations in order to be more efficient. Special steps were taken to reduce computational overhead where less knowledge capacity is needed. In brief, downsize layers were created in order to work better with residual networks.

Weight Sharing

We can keep a smaller number of the weights required to train a neural network and project them into virtual weight matrices that are used to perform forward pass. Then, on the backward pass we just accumulate (or average) the loss for each real one, based on the virtual ones! This is typically achieved with some sort of hash function.

Weight Sharing Visualization

Sparsity and Pruning

Not every neuron is created (or trained) equal. Some may contribute practically zero information to the total output of the network (due to very small weight values). These can be pruned away, leading to a sparse (lots of zeros) network. Then, special matrix storage types (CSC, CSR) and linear algebra subroutines can be used to accelerate computation and reduce storage space by a huge margin. Loading the whole network on on-chip SRAM is a huge advantage. There are lots of techniques to do such thing:


You don’t need 32 bits to represent weights. You might not even need 16! Google and nVidia already use less bits for precision (16 bits) for their neural computers. Weights are typically small and need high precision close to zero and one. You can use less bits to do that. Sometimes even 1 (XNORNet)!

IEEE 754
Google TPUs bfloat16

Quantizing less than 16 bits

It’s not efficient to stick with the typical sign — exponent — mantissa representation for very low bit width (≤ 8 bits). The usual ‘trick’ employed is to first project the weights to the [0, 1] range, using tanh(.) , then to quantize (map to nearest number) and then to re — project back to [-1, 1]. In the 1-bit scenario you expect the values to be either -1 or 1.

Learnable Quantization

You can also learn the quantizer parameters stochastically along with the networks weights! Specific constraints/design considerations can be put in place in order to ensure bit-operation compatible output format. Modern CPUs (even without a signal processing or graphic accelerator module) support these operations in a vectorized manner, being able to process > 64 batches of data at once!




Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store