Sunday, December 4, 2022

[Paper Review - NLP] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (Devlin et al., 2021)

 Abstract

Introducing BERT (Bidirectional Encoder Representations from Transformers); bidirectional - jointly conditioning on both left and right context in all layers, which creates state-of-art models for a wide range of tasks

1. Introduction

i) Language model: Pre-training has been shown to be effective for improving many NLP tasks

ii) 2 Strategies: feature-based (train only layers) or fine-tuning (update all including embeddings)

ELMO: uses task-specific architectures that include the pre-trained representations as additional features

GPT: is trained on the downstream tasks by simply fine-tuning all pre-trained parameters

Same objective function, and both unidirectional language models

Unidirectionaity is the major limitation and very harmful in token-level tasks such as questions answering because it needs context information!

Thus, a bidirectional model is needed - BERT!

i) It alleviates unidirectionality constraint by using a Masked Language Model (MLM) pre-training objective

ii) MLM: randomly masks some of the tokens from the input, and the objective is to predict the original vocabulary ID of the masked word based only on its context

iii) Next Sentence Prediction (NSP): joitly pre-trains text-pair representations

Paper Contribution

i) demonstrates the importance of bidirectional pre-training for language representations compared to GPT 1 (unidirectional) and ELMO (shallow concatenation of independently trained left-to-right and right-to-left LMs)

ii) the first fine-tuning based representation model outperforming many task-specific architectures.

iii) SOTA for 11 NLP tasks

2. Related work

i) Unsupervised feature-based approaches: ELMO - generalizes traditional word embedding research along a different dimension; context-sensitive features from a left-to-right and a right-to-left language model

ii) Unsupervised fine-tuning approaches: OpenAI GPT - sentence or document encoders which produce contextual token representations have been pre-trained from unlabelled text

iii) Transfer learning from supervised data: Computer vision parts - effective transfer from supervised tasks with large datasets

3. BERT

i) pre-training: training on unlabeled data over different pre-training tasks

ii) fine-tuning: first initializing with the pre-trained parameters, and all of the parameters are fine-tuned using labeled data from the downstream tasks

iii) Model architecture: Transformers encoder

Bert base (L = 12, H = 768, A = 12, Total parameters = 110M) - compare this with GPT

Bert larger (L = 24, H = 1024, A = 16, Total parameters = 340 M)

Transformer (L = 6, H = 512, A = 8)

iii) Input/output representation

a) input - single sentence (span of contiguous text) or double sentences / WordPiece embeddings (30,000 vocabulary)

b) output: CLS - aggregate sequence representation for classification / SEP - sentence separation, segment embedding

Pre-training BERT

Task 1: Masked LM

i) simply 15% input tokens mask at random, and then predict those masked tokens

ii) [MASK] token 80% + random token 10% + original 10% --> avoid mismatch between pre-training and fine-tuning

Task 2: Next Sentence Prediction (NSP)

i) Many important downstream tasks such as Question Answering (QA) and Natural Language Inference (NLI) are based on understanding the relationship between two sentences

ii) 50% of the time B is the actual next sentence that follows A (labeled as IsNext), and 50% of the time it is a random sentence from the corpus (labeled as NotNext)

Pre-training data: BooksCorpus (800M words) + English Wikipedia (2,500M words).

Fine-tuning BERT: simply plug in the task specific inputs and outputs into BERT and finetune all the parameters end-to-end

4. Experiments

GLUE (The General Language Understanding Evaluation benchmark)

a) should perform well for many tasks, not just one

b) MNLI (Multi-Genre Natural Language Inference): entailment classification task

c) QQP (Quora Question Pairs): a task that checks whether question pairs are semantically similar to each other

d) QNLI(Question Natural Language Inference): a binary classification vertion of SQuAD. Whether paragraphs include answers or not.

e) SST-2(Stanford Sentiment Treebank): A binary classification task per sentence. Extracted from movie reviews with sentiment

f) CoLA(Corpus of Linguistic Acceptability): A binary classification task to check whether the English sentence is linguistically acceptable

g) STS-B (Semantic Textual Similarity Benchmark): How similar pairs of sentences are

h) MRPC(Microsoft Research Paraphrate Corpus): How similar pairs of sentences are

i) RTE(Recognizing Textual Entailment): similar to MNLI, but less data

j) WNLI(Winograd NLI): excluded in BERT models because of some evaluation-related issues

BERT > GPT, BERT LARGE > BERT BASE

SQuAD v1.1: The Stanford Question Answering Dataset

i) input question and passage as a single packed sequence, with the question using the A embedding and the passage using the B embedding

ii) training objective is the sum of the log-likelihoods of the correct start and end positions

SQuAD v2.0

i) SQuAD v.1.1 + more than 50,000 unanswerable questions

ii) tasks are not just limited to when answers are possible, more difficult

SWAG (The Situations With Adversarial Generations dataset)

i) evaluate grounded commonsense inference

ii) Given a sentence, the task is to choose the most plausible continuation among four choices

iii) four input sequences, each containing the concatenation of the given sentence (sentence
A) and a possible continuation (sentence B) ⇒ NSP

5. Ablation Studies

Effect of pre-training tasks

i) No NSP ⇒ hurt performance QNLI, MNLI, and SQuAD 1.1

ii) LTR (rather than MLB) & No NSP = same as GPT but larger training dataset, input representation, fine-tuning scheme

--> LTR worse than MLM

--> For SQuAD, LTR performs poorly

--> BiLSTM, improve results on SQuAD, but worse in others

--> possible to train separate LTR and RTL models and represent each token as the concatenation of the two models, as ELMo does BUT, 1) expensive, non-intuitive for tasks like QA, less powerful than a deep bidirectional model

Effect of model size: training a number of BERT models with a differing number of layers, hidden units, and attention heads, while otherwise using the same hyperparameters and training procedure

--> larger models lead to a strict accuracy improvement across all four datasets

--> scaling to extreme model sizes leads to large improvements on very small scale tasks, provided that the model has been sufficiently pre-trained (3,600 labeled training examples in MRPC)

Feature-based approach with BERT: the feature-based approach, where fixed features are extracted from the pre-trained model

Bert is effective for both fine-tuning and feature-based approaches!

Conclusion

i) empirical improvements due to transfer learning with language models --> rich, unsupervised pre-training is an integral part of many language understanding systems (low-resource tasks to benefit from deep unidirectional architectures)

ii) generalizing these findings to deep bidirectional architectures, allowing the same pre-trained model to successfully tackle a broad set of NLP tasks

No comments:

Post a Comment