Learning Strides in Convolutional Neural networks | TransferLab

Strides are an important discrete hyperparameter in convolutional neural networks. Together with pooling layers they control the amount of downsampling happening in the data processing, and changes in them can significantly affect accuracy. Therefore, a costly hyperparameter search for the best strides is often necessary to reach optimal performance of a CNN.

What if we could learn the best strides directly instead of performing a discrete search? [Ria22L] introduces the DiffStride layer that does exactly that - a keras implementation is provided by the authors.

There are two main ideas underlying the new layer:

Instead of reducing the image size through strided convolution followed by pooling, it is done by cropping the discrete Fourier transform (DFT) of the image around its center and transforming back. This operation is a low pass filter, the DFT is differentiable and pooling of this type, called “Spectral Pooling” [Rip15S], has been shown to perform well in CNNs.
While cropping to a fixed size box is non-differentiable (in the size of the box), this cropping can be softened to multiplication with a parametrized, mask-like matrix which is differentiable in the parameter governing its size. This mechanism was put forward in “Adaptive attention span in transformers” [Suk19A] for learning the size of attention context in transformer layers. The soft mask from that paper is shown in the figure. The parameter $z$ determines its size, taking a derivative w.r.t. $z$ results in a non-trivial gradient because of the non-constant profile of the soft mask and thus $z$ can be learned through gradient descent. $R$ becomes a hyperparameter of the neural network.

The DiffStride layer is a combination of these ideas - a 2-dimensional soft mask, analogous to the one-dimensional discussed above, is multiplied with the DFT of the input. The height and width of the soft mask can be learned through gradient descent. The region in which the soft mask is zero is cropped (a non-differentiable operation which is performed with stop-grad) and the inverse DFT results in a smaller image. This processing is displayed in the architecture diagram below.

Experiments show that comparable or even slightly better performance to optimal strides (found through discrete search) in a normal Resnet can be reached starting out with random strides and using the new DiffStride layer. This paper is thus a valuable contribution for all practitioners who don’t want to run a hyperparameter search over strides when training a CNN.

Convolutional neural networks typically contain several downsampling operators, such as strided convolutions or pooling layers, that progressively reduce the resolution of intermediate representations. This provides some shift-invariance while reducing the computational complexity of the whole architecture. A critical hyperparameter of such layers is their stride: the integer factor of …

Discrete Fourier transforms provide a significant speedup in the computation of convolutions in deep learning. In this work, we demonstrate that, beyond its advantages for efficient computation, the spectral domain also provides a powerful representation in which to model and train convolutional neural networks (CNNs).We employ spectral representations to introduce a number of innovations to CNN …

We propose a novel self-attention mechanism that can learn its optimal attention span. This allows us to extend significantly the maximum context size used in Transformer, while maintaining control over their memory footprint and computational time. We show the effectiveness of our approach on the task of character level language modeling, where we achieve state-of-the-art performances on text8 …

References