[PAPER REVIEW 231113] ELECTRA(2020)

Paper Review

[PAPER REVIEW 231113] ELECTRA(2020)

Sungyeon Kim 2023. 12. 20. 19:04

Stanford University, Google Brain

ICLR 2020 conference

0. MLM

1) mask 15% of the input

2) train to recover the original input

→ It can learn bidirectional representations

1. A Problem of MLM

substantial compute cost

→ because only learns from 15% tokens per sample

2. Replace token detection

1) replace some tokens of the input

→ solves a mismath problem of BERT

2) train to predicts every token whether it is replaced or not. (as a discriminator)

→ computationally efficient (because learns from all input tokens)

3. ELECTRA

Effiicently Learning an Encoder that Classifies Token Replacements Accurately

- pre-train Transformer text encoders

- consists of 2 networks

1) a small generator network : corrupts input by replacing some tokens (not masking)

2) a discriminative network : predict each token whether it was replaced or not

*after pre-training, we throw out the generator and only fine-tune the discriminator

*Why is generator small? : to low the compute

- generator : softmax, discriminator : sigmoid

- minimize the combined loss

- 2 networks share embeddings (token + positional)

- when the generator is 1/4-1/2 of the discriminator, it works best

4. Differences from the discriminator of a GAN

1) If the generator generate the correct token, that token is considered ‘real’

2) The generator is trained with maximum likelihood (not adversarial)

→ because of the difficulty of applying GANs to text (back-propagate is impossible)

3) We don’t supply the generator with a noise vector as input

5. Results

GLUE, SQuAD

ELECTRA outperforms MLM-based models(BERT, XLNet)

ELECTRA-Small outperforms small BERT and GPT

ELECTRA-Large performs comparably to RoBERTa, XLNet and ALBERT

having fewer parameters
using 1/4 of the compute for training
new SOTA

→ It works well both in small size and large size

→ This method is more parameter-efficient and compute-efficient

6. Efficiency analysis

a large amount of ELECTRA’s improvement comes from learning all tokens

a smaller amount comes from alleviating pre-train fine-tune mismatch.

7. Related Work

1) self-supervised learning : learn word representations and contextual representations

2) BERT : pre-trains a large Transformer

There have been numerous extensions
(MASS, UniLM, ERNIE, SpanBERT, TinyBERT, MobileBERT)

3) SpanBERT : mask out contiguous sequences of token

→ this idea becomes ELECTRA