Paper Review

[PAPER REVIEW 231113] ELECTRA(2020)

Sungyeon Kim 2023. 12. 20. 19:04

Stanford University, Google Brain

ICLR 2020 conference

 

0. MLM

1) mask 15% of the input

2) train to recover the original input

→ It can learn bidirectional representations

 

1. A Problem of MLM

substantial compute cost

→ because only learns from 15% tokens per sample

 

2. Replace token detection

1) replace some tokens of the input

   → solves a mismath problem of BERT

 

2) train to predicts every token whether it is replaced or not. (as a discriminator)

   computationally efficient (because learns from all input tokens)

 

3. ELECTRA

Effiicently Learning an Encoder that Classifies Token Replacements Accurately

- pre-train Transformer text encoders

- consists of 2 networks

   1) a small generator network : corrupts input by replacing some tokens (not masking)

   2) a discriminative network : predict each token whether it was replaced or not

      *after pre-training, we throw out the generator and only fine-tune the discriminator

      *Why is generator small? : to low the compute

- generator : softmax, discriminator : sigmoid

- minimize the combined loss

- 2 networks share embeddings (token + positional)

- when the generator is 1/4-1/2 of the discriminator, it works best

 

4. Differences from the discriminator of a GAN

1) If the generator generate the correct token, that token is considered ‘real

2) The generator is trained with maximum likelihood (not adversarial)

   → because of the difficulty of applying GANs to text (back-propagate is impossible)

3) We don’t supply the generator with a noise vector as input

 

5. Results

GLUE, SQuAD

 

ELECTRA outperforms MLM-based models(BERT, XLNet)

ELECTRA-Small outperforms small BERT and GPT

ELECTRA-Large performs comparably to RoBERTa, XLNet and ALBERT

  • having fewer parameters
  • using 1/4 of the compute for training
  • new SOTA

→ It works well both in small size and large size

→ This method is more parameter-efficient and compute-efficient

6. Efficiency analysis

a large amount of ELECTRA’s improvement comes from learning all tokens

a smaller amount comes from alleviating pre-train fine-tune mismatch.

 

7. Related Work

1) self-supervised learning : learn word representations and contextual representations

2) BERT : pre-trains a large Transformer

  • There have been numerous extensions
  • (MASS, UniLM, ERNIE, SpanBERT, TinyBERT, MobileBERT)

3) SpanBERT : mask out contiguous sequences of token

→ this idea becomes ELECTRA