“Language Models are Few-Shot Learners” Summarized

Categories:

Updated:

16 minute read

https://arxiv.org/abs/2005.14165 (2020/07/28)

1. Fine-Tuning vs Few-Shot Learning

Limitations of fine-tuning

Recently, transformer language models that were pre-trained then directly fine-tuned attained SOTA in many NLP tasks. While the architecture is task-agnostic, there is still a need for task-specific datasets and task-specific fine-tuning. The authors say that we need to remove this limitation for three reasons.

  1. From a practical perspective, the need for a large dataset of labeled examples for every new task limits the applicability of language models.

  2. The potential to exploit spurious correlations in training data fundamentally grows with the expressiveness of the model and the narrowness of the training distribution. Thus the performance on specific benchmarks may be exaggerated.

  3. Humans do not require large supervised datasets to learn most language tasks; we only need a brief directive in natural language to perform a new task. We would someday like our NLP systems to have this same fluidity and generality.

Meta-learning

One potential route towards addressing these limitations of fine-tuning is meta-learning. The model develops a broad set of skills and pattern recognition abilities at training time, then uses those abilities at inference time to rapidly adapt to or recognize the desired task. We can express meta-learning like below illustration.

We have outer-loop, which is standard gradient updating procedures for neural nets. Then we have inner-loop, which is called ‘in-context learning’. Here, the model is conditioned on a natural language instruction and/or a few demonstrations of the task and is then expected to complete further instances of the task simply by predicting what comes next. If only instruction is given, it is called ‘zero-shot learning’. When an instruction and one demonstration is given, it is called ‘one-shot learning’, and when and instruction and multiple(10~100) examples are given, it is called ‘few-shot learning’.

To summarize the concepts;

Meta-learning: learning broad set of skills and pattern recognition abilities at training time, composed of outer-loop and inner-loop

  • Outer-loop: standard weight update procedure, implicitly learns the ability to adapt to tasks defined at inference-time
  • Inner-loop (in-context learning): adapting and answering to specific task by just fed with instructions/examples, one of below 3 (each is fed with sequence of)
    • zero-shot learning: instruction, input
    • one-shot learning: instruction, 1 demonstration, input
    • few-shot learning: instruction, few demonstrations, input

Below image gives the input examples of zero/one/few-shot learnings.

Basically we just feed text instruction to the model input, just like we give it to humans when asking to do a task. Then the pre-trained model will automatically figure out what the task is and complete the sequence with appropriate answers.

Advantages of in-context learning over fine-tuning is:

  1. We do not need any task-specific data.
  2. The pre-trained model does not need to be fine-tuned. It can be used directly.
  3. It is less likely to learn an overly narrow distribution or spurious patterns from a narrow fine-tuning dataset.

Disadvantage is that the results from in-context-learning have so far been much worse than SOTA fine-tuned models.

GPT-3 for meta-learning

The authors trained ‘GPT-3’ which meta-learned on enormous corpus from Internet. GPT-3 achieves promising results in the zero-shot and one-shot settings, and in the few-shot setting, it is sometimes competitive with or even occasionally surpasses SOTA which uses fine-tuning.

Additionally, the authors observed that for most tasks, performance scaled relatively smooth with model capacity(size) in zero/one/few-shot learning. So it can be thought that larger models are more proficient meta-learners.

2. GPT-3: Architecture

It has the same model and architecture as GPT-2 which is based on transformer decoder, with exception that it uses alternating dense and locally banded sparse attention patterns in the layers of the transformer, similar to the Sparse Transformer.

3. Dataset & Contamination

The authors used Common Crawl dataset. They took some steps to improve the quality of the dataset.

  1. filter dataset based on similarity to a range of high-quality reference corpora
  2. fuzzy deduplication at the document level
  3. added known high-quality reference corpora
  4. searched and attempted to remove any overlaps with the development and test sets of all benchmarks
  5. when training, sample in proportion to the quality

Contamination

Since the training dataset is sourced from the internet, it is possible that the model was trained on some of the benchmark test sets. Then the test score will be overestimated. This is called data contamination problem.

The authors produced a ‘clean’ version for each benchmark test sets which removes all potentially leaked examples, evaluated GPT-3 on these clean benchmarks, and compared it to the original score. They concluded that contamination had little effect on performance.

4. Evaluation

  • For few-shot learning: the authors evaluated each example in the evaluation set by randomly drawing $K$ examples from that task’s training set as conditioning, delimited by 1 or 2 newlines depending on the task.
  • On tasks that involve choosing one correct completion from several options: the authors provided $K$ examples of context plus correct completion, followed by one example of context only, and compared the LM likelihood of each completion.
  • On tasks with free-form completion: the authors used beam search

5. Results

1. Language Modeling

Penn Tree Bank (PTB)

2. Completion Tasks

LAMBADA

“predict the last word of sentences”

It tests the modeling of long-range dependencies in text.

Standard language model assigns probability not only to one correct ending word but also to other valid continuations of the paragraph. On the other hand, the few-shot setting allows the language model to infer from examples that a completion of exactly one word is desired.

HellaSwag

“pick the best ending to a story or set of instruction”

StoryCloze

“select the correct ending sentence for five-sentence long stories”

3. Closed Book Question Answering

  • Open Book QA: information retrieval system to find relevant text + model which learns to generate an answer given the question and the retrieved text
  • Closed Book QA: information is embedded in the pre-trained parameters; the model answers the questions without conditioning an auxiliary information

4. Translation

GPT-3 learns from a blend of training data that mixes many languages together in a natural way, combining them on a word, sentence, and document level, although training data is 93% English.

GPT-3 significantly outperforms prior unsupervised NMT work when translating into English but underperforms when translating in the other direction. This could be due to use of byte-level BPE tokenizer which was developed for an almost entirely English dataset.

5. Winograd-Style Tasks

“determine which word a pronoun refers to, when the pronoun is grammatically ambiguous but semantically unambiguous to a human”

6. Common Sense Reasoning

(physical or scientific reasoning)

7. Reading Comprehension

8. SuperGLUE

9. NLI (Natural Language Inference)

It tests the ability to understand the relationship between two sentences. SuperGLUE includes an NLI dataset, RTE

ANLI is a difficult dataset employing a series of adversarially mined natural language inference questions in three rounds.

10. Arithmetic

11. Word Scrambling and Manipulation Tasks

“recover the original word from a word which is distorted by scrambling, addition, or deletion of characters”

Succeeding at these tasks invloves not just manipulating BPE tokens but understanding and pulling apart their substructure.

12. SAT Analogies

Average score among college applicants was 57%.

13. News Article Generation

The authors employed GPT-3’s few-shot learning abilities by providing three previous news articles in the model’s context to condition it.

14. Learning and Using Novel Words

15. Correcting English Grammar

6. Limitations

  1. Text generation fluency

    GPT-3 generated samples still sometimes repeat themselves semantically at the document level, start to lose coherence over sufficiently long passages, contradict themselves, and occasionally contain non-sequitur sentences or paragraphs.

  2. Common sense physics

    GPT-3 seems to have special difficulty with ‘common sense physics’. It does not answer well with questions like “If I put cheese into the fridge, will it melt?”.

  3. Comparison tasks

    GPT-3 doesn’t do well on ‘comparison’ tasks, such as determining if two words are used the same way in a sentence, or if one sentence implies another, as well as on a subset of reading comprehension tasks.

  4. Limits of the pretraining objective

    Current objective weights every token equally and lacks a notion of what is most important to predict and what is less important. Useful language systems might be better thought of as taking goal-directed actions rather than just making predictions.

  5. Lack of a large amount of context about the world

    Promising future directions might include learning the objective function from humans, fine-tuning with reinforcement learning, or adding additional modalities such as images.

  6. Poor sample efficiency during pre-training

  7. Ambiguity of few-shot learning

    We do not know whether few-shot learning actually learns new tasks ‘from scratch’ at inference time, or if it simply recognizes and identifies tasks that it has learned during training. Synthetic tasks such as wordscrambling or defining nonsense words seem especially likely to be learned de novo, whereas translation clearly must be learned during pretraining.

  8. Uninterpretablity, higher variance of predictions on novel inputs than humans, biases from the training data

7. Social Impacts

  1. Misuse of Language Models

    Language models that produce high quality text generation could lower existing barriers to carrying out threat/misuse activities and increase their efficacy.

    Language models that are sufficiently consistent and steerable will be of greater interest to malicious actors.

  2. Bias

    Gender, race, religion biases exist in GPT-3, since it was trained on internet corpus which includes biases. Removing these biases should be approached in a holistic manner.

  3. Energy Usage

    Pre-training GPT-3 175B consumed several thousand petaflop/s-days of compute. Nevertheless, they can be surprisingly efficient once trained: even with the full GPT-3 175B, generating 100 pages of content from a trained model can cost on the order of 0.4kW-hr, or only a few cents in energy costs. Additionally, techniques like model distillation can further bring down the cost.

  • Increasing parameter count and/or computation in language models as a means to improve generative or task performance
  • Attempt to preserve strong performance in language models that are as small as possible
  • Study effect of scale on language model performance, Found a smooth power-law trend in loss as autoregressive language models are scaled up
  • Construct more difficult or open-ended NLP tasks, especially on question-answering
  • Meta-learning
  • Few-shot learning with gradient descent and Semi-supervised learning
  • Notion of presenting tasks in natural language
  • Multi-task learning to increase generality and transfer-learning capability in language
  • Other algorithmic/architectural innovations

Leave a Comment