“A Simple Framework for Contrastive Learning of Visual Representations” Summarized

Categories:

Updated:

6 minute read

https://arxiv.org/abs/2002.05709 (2020. 2. 13)

1. Representation Learning Approaches: Generative vs Discriminative

Generative approaches learn to generate or other wise model pixels in the input space(autoencoder, GAN). However, the authors say that pixel-level generation is computationally expensive and may not be necessary to learn representations.

Discriminative approaches learn representations using objective functions similar to those used for supervised learning. There are two types.

  1. Learning with handcrafted pretext tasks

    Some works trained networks to perform pretext tasks where both the inputs and labels are derived from an unlabeled dataset. However, they relied on heuristics to design pretext tasks, which could limit the generality of the learned representations.

  2. Contrastive visual representation learning

    The authors say discriminative approaches based on contrastive learning have shown great promise. These approaches learn representations by contrasting positive pairs against negative pairs.

The authors propose a simple framework for contrastive learning of visual representations: SimCLR.

2. Algorithm

The objective is to maximize the similarity between differently augmented views of the same data example.

  1. Get batch of N examples
  2. Apply 2 augmentations to each examples (Now we have 2N examples)
  3. Put images through the network and get projection vectors $z$
  4. For each positive image pair, calculate loss.
  5. Average loss within a batch.
  6. Backpropagate to update weights.

3. Loss

The loss is calculated only between positive pairs (inputs that were augmented from the same image):

\[l_{ij} = -\log \frac{\exp(sim(z_i, z_j)/\tau)}{\sum_{k=1}^{2N}\mathbb{1}_{[k\neq i]}\exp(sim(z_i, z_k)/\tau)}\]

$sim$ is cosine similarity, and $\tau$ is a temperature parameter.

Minimizing the above loss is maximizing the similarity between positive pairs (numerator), while minimizing the similarity between negative pairs (denominator).

The authors term this loss NT-Xent (the normalized temperature-scaled cross entropy loss)

The authors observed NT-Xent works better than alternatives (logistic, margin triplet). This is because cross-entropy weigh the negatives by their relative hardness, while other losses do not.

4. Augmentations

Data augmentation defines predictive tasks

Many existing approaches define contrastive prediction tasks (global-to-local view prediction, neighboring vie prediction) by changing the architecture. The authors show this complexity can be avoided by performing simple random cropping.

Composition of data augmentation operations is crucial for learning good representations

To understand the effects of individual or composition of data augmentations, the authors first randomly crop images and resize them to the same resolution. Then they applied one targeted transformation(s) and used crop/resized original image and augmented image as a pair in the batch. The experiment result is illustrated below.

We can see that when composing augmentations, the task becomes harder but the quality of representation improves dramatically.

‘random cropping + random color distortion’ performed best. The authors say that since color histograms alone suffice to distinguish images, we need color distortion to make the network learn generalizable features.

Contrastive learning needs stronger data augmentation than supervised learning

Stronger color augmentation substantially improves the linear evaluation of the learned unsupervised models.

Data augmentation that does not yield accuracy benefits for supervised learning can still help considerably with contrastive learning.

5. Network Architecture

Unsupervised contrastive learning benefits more from bigger models

A nonlinear projection head improves the representation quality of the layer before it

Why does using the representation before the nonlinear projection perform better? The authors conjecture it is due to loss of information induced by the contrastive loss. $z=g(h)$ is trained to be invariant to data transformation. Thus, $g$ can remove information that may be useful for the downstream task, such as the color or orientation of objects.

6. Batch Size

Contrastive learning benefits more from larger batch sizes and longer training

This is because larger batch sizes and longer training provide more negative examples.

7. Comparison with State-of-the-art

1. Linear Classifiers

Freeze the pretrained encoder, add linear layer, and train it.

2. Weakly-Supervised Learning

Train small fraction of labels on pretrained encoder

3. Transfer Learning

Leave a Comment