“BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” Summarized

Categories:

Updated:

10 minute read

https://arxiv.org/abs/1810.04805 (2018.10.11)

1. Language Model Pre-training

Language model pre-training has been shown to be effective for improving many natural language processing tasks. There are 2 strategies for using pre-trained models for specific tasks: feature based and fine-tuning.

1. Feature based

Feature based approaches use task-specific architectures that include the pre-trained representations as additional features.

Previous works include:

  • Pre-train word embedding vectors by
    • left-to-right language modeling objectives
    • objectives to discriminate correct from incorrect words in left and right context
  • Pre-train sentence embeddings by
    • rank candidate next sentences
    • left-to-right generation of next sentence words
    • denoising auto-encoder derived objectives
  • Pre-train paragraph embeddings
  • Learn context-sensitive word embeddings
    • ELMo extracts context-sensitive features. Representation of each token is the concatenation of the left-to-right and right-to-left representations.
    • Melamud et al proposed learning contextual representations through a task to predict a single word from both left and right context using LSTMS.

2. Fine-tuning

Fine-tuning approaches add minimal task-specific parameters to the pre-trained model then train whole network for downstream task.

Left-to-right language modeling and auto-encoder objectives have been used for pre-training these models.

Before BERT, OpenAI GPT achieved SOTA from many task from the GLUE benchmark.

Additionally, there are some researches that used supervised data for pre-training.

Proposal of BERT

ELMo and OpenAI GPT both use unidirectional language models to learn general language representations. The authors argue that unidirectional approach restricts the power of the pre-trained representations. Thus they propose ‘BERT: Bidirectional Encoder Representations from Transformers’.

BERT obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).

Below is an overall illustration of BERT when pre-training and fine-tuning.

2. BERT Architecture

A distinctive feature of BERT is its unified architecture across different tasks. There is minimal difference between the pre-trained architecture and the final downstream architecture.

Model architecture is a multi-layer bidirectional Transformer encoder. There are two models with different sizes: BERT-base(110M parameters) and BERT-large(340M).

Input & Output

For input corpus, the authors used the BooksCorpus (800M words) and English Wikipedia (2500M words). They used WordPiece embeddings with a 30000 token vocabulary.

The first token of every sequence is always a special classification token ([CLS]). The final hidden state (C) corresponding to this token is used as the aggregate sequence representation for classification tasks.

Input sentence pairs are packed together into a single sequence. The authors differentiate the sentences in two ways.

  1. Separate them with a special token ([SEP])
  2. Add a learned embedding to every token indicating whether it belongs to sentence A or sentence B

Input representation = token embedding + segment embedding + position embedding

3. Pre-training BERT

Task 1: Masked LM

Previous standard conditional language models can only be trained unidirectional, since bidirectional conditioning would allow each word to indirectly ‘see itself’.

To train a deep bidirectional representation, the authors simply masked 15% of the input tokens at random, and made the model to predict those masked tokens. The authors call this Cloze task ‘masked LM’ procedure.

A downside is that we are creating a mismatch between pre-training and fine-tuning, since [MASK] token doesn’t appear during fine-tuning. So the authors replace the chosen token with (1) the [MASK] token 80% of the time, (2) a random token 10% of the time, (3) the unchanged one 10% of the time.

To train masked LM, the final hidden vectors of the transformer encoder corresponding to the mask tokens are fed into an output softmax over the vocabulary.

Task 2: Next Sentence Prediction (NSP)

When choosing the sentences A and B for each pre-training example, 50% of the time B is the actual next sentence of A, and 50% of the time it is a random sentence from the corpus.

The authors showed that pre-training towards this task is very beneficial to QA(Question Answering) and NLI(Natural Language Inference).

To train NSP, C is used to output probability that the input is composed of consequent sentences.

4. Fine-tuning BERT: Experiments

For applications involving text pairs, a common pattern was to independently encode text pairs before applying bidirectional cross attention. BERT instead uses the self-attention mechanism to unify these two stages, as encoding a concatenated text pair with self-attention effectively includes bidirectional cross attention between two sentences.

1. GLUE (General Language Understanding Evaluation)

For GLUE tasks, we connect C with a linear layer to output classification probability.

2. SQuAD (Stanford Question Answering Dataset) v1.1

Given a question and a passage from Wikipedia containing the answer, the task is to predict the answer text span(start position and end position) in the passage.

The authors only introduce a start vector $S$ and an end vector $E$. The score of a candidate span from position $i$ to position $j$ is defined as $S\cdot T+i + E\cdot T_j$, where $T_i$ is the output vector of the encoder at position $i$. The maximum scoring span where $j \geq i$ is used as a prediction.

3. SQuAD v2.0

The authors treated questions that do not have an answer as having an answer span with start and end at the [CLS] token.

For prediction, we compare the score of the no-answer span: $s_{null}=S\cdot C + E \cdot C$ to the score of the best non-null span.

4. SWAG (Situations With Adversarial Generations)

Given a sentence, the task is to choose the most plausible continuation among four choices.

When fine-tuning on the SWAG dataset, we construct four input sequences, each containing the concatenation of the given sentence (sentence A) and a possible continuation (sentence B).

A vector $A$ is introduced. It is dotted with $C$ to output a score for each choice.

5. Ablation Studies

1. Effect of Bidirection & NSP

2. Effect of Model Size

Scaling to extreme model sizes leads to large improvements even on very small scale tasks, provided that the model has been sufficiently pre-trained. It can be observed by comparing BERT-base and BERT-large on various tasks.

3. Feature-based vs Fine-tuning

BERT is effective for both fine-tuning and feature-based approaches, but better with fine-tuning.

Leave a Comment