Self-supervised learning (SSL) or representation learning is concerned with finding compressed and meaningful representations of inputs useful for downstream tasks. Due to the high dimensionality of images and the readily available transformations that do not change the semantics, SSL is particularly attractive in computer vision.
A standard goal of SSL in the visual domain is finding representations which are invariant under selected transformations, like random cropping, small rotations, horizontal flipping, etc. Thus, one may envision a loss that minimizes distances between predictions for different augmentations. Unfortunately, such a loss is prone to collapse - a model always predicting a constant output would give a perfect solution. For a long time it was thought that including a penalty for small distances to negative samples (i.e. representation of different images) to the loss was needed to prevent such collapse, which led to the development of algorithms involving (hard) negative sampling.
In Bootstrap Your Own Latent (BYOL) [Gri20B], it was shown for the first time that SSL is viable without negative sampling. The algorithm was based on the rather curious observation that if a network is trained to mimic representations of augmented images from a different, randomly initialized target network, the resulting learned representations are much better than the random targets. For imagenet, a linear classifier achieves a 18.8% top-1 accuracy on them vs. the 1.4% accuracy on the random targets themselves.
This experimental result gives rise to the sketch of a representation learning algorithm: iteratively improve the fixed target networks together with the learned, online networks. While a collapse is still theoretically possible with this strategy, it has not been observed experimentally. BYOL uses some engineering on top of this basic idea that is highlighted in the architecture diagram. The target itself is a moving average of previous iterations of the online network. This kind of technique, where a “student” network is trained on reproducing outputs of a “teacher” network is sometimes called knowledge distillation or simply distillation. See e.g. this nice summary from a NYU class, where non-contrastive SSL algorithms are grouped under “distillation”.
An experimental study in Exploring Simple Siamese Representation Learning (SimSiam) [Gri20B], showed that contrary to the assumption in BYOL, using previous iterations of the online network as target is not necessary to prevent collapse. One can thus use a simpler architecture, keeping only one network as both online and target and not computing gradients wrt. the weights of the “target”. The main improvement over BYOL here is a simpler implementation with comparable performance.
Both BYOL and SimSiam reach linear-classifier accuracies of above 70% on ImageNet, often coming remarkable close to supervised training with comparable networks, see the corresponding diagram. Multiple implementations for both BYOL and SimSiam are available on GitHub, see the references below.
Since then, other, by now very popular algorithms for self-supervised learning without negative pairs have emerged, like Barlow Twins [Zbo21B] and DINO [Car21E]. We plan to cover them in upcoming paper pills - stay tuned!