Representation learning with BYOL and SimSiam

BYOL was the first work to show how useful low-dimensional representations can be when learned in an unsupervised way without negative sampling. It inspired a series of simpler architectures, with SimSiam among them.

Self-supervised learning (SSL) or representation learning is concerned with finding compressed and meaningful representations of inputs useful for downstream tasks. Due to the high dimensionality of images and the readily available transformations that do not change the semantics, SSL is particularly attractive in computer vision.

A standard goal of SSL in the visual domain is finding representations which are invariant under selected transformations, like random cropping, small rotations, horizontal flipping, etc. Thus, one may envision a loss that minimizes distances between predictions for different augmentations. Unfortunately, such a loss is prone to collapse - a model always predicting a constant output would give a perfect solution. For a long time it was thought that including a penalty for small distances to negative samples (i.e. representation of different images) to the loss was needed to prevent such collapse, which led to the development of algorithms involving (hard) negative sampling.

Figure 1. [Gri20B] Performance of BYOL on ImageNet (linear evaluation) using multiple CNN architectures, in particular ResNet-50 and ResNet-200 ($2\times$), compared to other unsupervised and supervised (Sup.) baselines

In Bootstrap Your Own Latent (BYOL) [Gri20B], it was shown for the first time that SSL is viable without negative sampling. The algorithm was based on the rather curious observation that if a network is trained to mimic representations of augmented images from a different, randomly initialized target network, the resulting learned representations are much better than the random targets. For imagenet, a linear classifier achieves a 18.8% top-1 accuracy on them vs. the 1.4% accuracy on the random targets themselves.

This experimental result gives rise to the sketch of a representation learning algorithm: iteratively improve the fixed target networks together with the learned, online networks. While a collapse is still theoretically possible with this strategy, it has not been observed experimentally. BYOL uses some engineering on top of this basic idea that is highlighted in the architecture diagram. The target itself is a moving average of previous iterations of the online network. This kind of technique, where a “student” network is trained on reproducing outputs of a “teacher” network is sometimes called knowledge distillation or simply distillation. See e.g. this nice summary from a NYU class, where non-contrastive SSL algorithms are grouped under “distillation”.

Figure 2. [Gri20B] BYOL’s architecture. BYOL minimizes a similarity loss between $q_\theta(z_\theta)$ and $sg(z^{’}_{\xi})$, where $\theta$ are the trained weights, $\xi$ are an exponential moving average of $\theta$ and $sg$ means stop-gradient. At the end of training, everything but $f_\theta$ is discarded, and $y_\theta$ is used as the image representation.

An experimental study in Exploring Simple Siamese Representation Learning (SimSiam) [Gri20B], showed that contrary to the assumption in BYOL, using previous iterations of the online network as target is not necessary to prevent collapse. One can thus use a simpler architecture, keeping only one network as both online and target and not computing gradients wrt. the weights of the “target”. The main improvement over BYOL here is a simpler implementation with comparable performance.

Figure 3. [Che21E] SimSiam architecture. Two augmented views of one image are processed by the same encoder network $f$ (a backbone plus a projection MLP). Then a prediction MLP $h$ is applied on one side, and a stop-gradient operation is applied on the other side. The model maximizes the similarity between both sides. It uses neither negative pairs nor a momentum encoder.

Both BYOL and SimSiam reach linear-classifier accuracies of above 70% on ImageNet, often coming remarkable close to supervised training with comparable networks, see the corresponding diagram. Multiple implementations for both BYOL and SimSiam are available on GitHub, see the references below.

Since then, other, by now very popular algorithms for self-supervised learning without negative pairs have emerged, like Barlow Twins [Zbo21B] and DINO [Car21E]. We plan to cover them in upcoming paper pills - stay tuned!