“ALBERT: A Lite BERT for Self-supervised Learning of Language Representations” Summarized

Categories:

Updated:

7 minute read

https://arxiv.org/abs/1909.11942 (2019/09/26)

1. Self-Attention Models Getting Larger?

After BERT, many researches reveal that a large network is of crucial importance for achieving state-of-the-art performance on many NLP tasks. However, memory limitations of available hardware is the obstacle to scaling the networks. Especially when distributed-training, communication overhead is directly proportional to the number of parameters in the model.

To address memory limitation issue, previous works proposed “gradient checkpointing to reduce the memory requirement to be sublinear at the cost of an extra forward pass”, “a way to reconstruct each layer’s activations from the next layer so that they do not need to store the intermediate activations”, or “model parallelization”.

The authors went to alter the architecture in order to directly reduce number of parameters. They propose A Lite BERT (ALBERT) architecture that has significantly fewer parameters than a traditional BERT architecture. Large version of ALBERT established new SOTA results on GLUE, SQuAD, and RACE benchmarks.

There are 3 main points of ALBERT different from BERT.

  1. Factorized embedding parameterization
  2. Cross-layer parameter sharing
  3. Inter-sentence coherence loss

Let’s look one by one.

2. Factorized embedding parameterization

In previous transformer encoder architectures like BERT, XLNet, and RoBERTa, the WordPiece embedding size $E$ is tied with the hidden layer size $H$ ($E=H$). This is suboptimal for both modeling and practical reasons.

  1. Modeling: WordPiece embeddings are meant to learn context-independent representations, whereas hidden-layer embeddings are meant to learn context-dependent representations. The power of BERT came from context-dependent representations, and this implies that $H>E$ is ideal.
  2. Practical: Typically, vocabulary size $V$ is very large. If $E=H$, then increasing $H$ increases the size of the embedding matrix $V \times E$ significantly. This can easily result in a model with billions of parameters, most of which are only updated sparsely during training.

Instead of projecting the one-hot vocabulary vectors directly into the hidden space of size $H$, we first project them into a lower dimensional embedding space of size $E$, and then project it to the hidden space. Thus the embedding parameters are reduced $O(V \times H)$ -> $O(V \times E + E \times H)$.

3. Cross-layer parameter sharing

There are multiple ways to share parameters, e.g., only sharing feed-forward network (FFN) parameters across layers, or only sharing attention parameters. The default decision for ALBERT is to share all parameters across layers.

Bai et al. show that their DQEs, which also share parameters across layers, reach an equilibrium point for which the input and output embedding of a certain layer stay the same. However, as shown below, ALBERT embeddings are oscillating rather than converging.

This shows that the solution space for ALBERT parameters is very different from the one found by DQE.

Also, we can observe that the transitions from layer to layer are much smoother for ALBERT than for BERT, which shows that weight-sharing has an effect on stabilizing network parameters.

4. Inter-sentence coherence loss

In addition to the masked language modeling (MLM) loss, BERT uses an additional loss called next-sentence prediction (NSP). However, subsequent studies found NSP’s impact unreliable. The authors conjecture that the main reason behind NSP’s ineffectiveness is its lack of difficulty as a task, since just predicting one of topic or coherence is enough to get the task right.

The authors proposed a loss based primarily on inter-sentence coherence. We use sentence-order prediction (SOP) loss, which needs the model to predict not topic but coherence, which is harder. Below experiment result show that NSP cannot solve the SOP task at all, while SOP can solve the NSP task to a reasonable degree.

5. Model / Training Details

  • Data: BookCorpus + English Wikipedia -> 16GB
  • Input: “[CLS] segment1 [SEP] segment2 [SEP]”
    • For SOP: switch segment 1&2 when creating negative example
    • For MLM: used n-gram masking
    • maximum input length: 512
  • Tokenize: Use a vocabulary size of 30000, tokenized using SentencePiece

6. Experiment Results

Note that ALBERT achieved RACE test accuracy of 89.4, which even tops 5.3% to DCMI+, an ensemble of multiple models specifically designed for such tasks.

What if we train for the same amount of time?

Additional training data

Removing dropout

It might be that dropout can hurt performance in large Transformer-based models.

7. Discussion

  • While ALBERT-xxlarge has less parameters than BERT-large and gets significantly better results, it is computationally more expensive due to its larger structure.
  • It is needed to speed up the training/inference speed of ALBERT through methods like “sparse attention”, “block attention”, “hard example mining”, or “efficient language modeling training”.
  • Although SOP task is consistently-useful, there could be hypothesize that there could be more dimensions not yet captured by the current self-supervised training losses that could create additional representation power for the resulting representations.

Leave a Comment