[PAPER REVIEW 231113] ELECTRA(2020)
Stanford University, Google Brain
ICLR 2020 conference
0. MLM
1) mask 15% of the input
2) train to recover the original input
→ It can learn bidirectional representations
1. A Problem of MLM
substantial compute cost
→ because only learns from 15% tokens per sample
2. Replace token detection
1) replace some tokens of the input
→ solves a mismath problem of BERT
2) train to predicts every token whether it is replaced or not. (as a discriminator)
→ computationally efficient (because learns from all input tokens)
3. ELECTRA
Effiicently Learning an Encoder that Classifies Token Replacements Accurately
- pre-train Transformer text encoders
- consists of 2 networks
1) a small generator network : corrupts input by replacing some tokens (not masking)
2) a discriminative network : predict each token whether it was replaced or not
*after pre-training, we throw out the generator and only fine-tune the discriminator
*Why is generator small? : to low the compute
- generator : softmax, discriminator : sigmoid
- minimize the combined loss
- 2 networks share embeddings (token + positional)
- when the generator is 1/4-1/2 of the discriminator, it works best
4. Differences from the discriminator of a GAN
1) If the generator generate the correct token, that token is considered ‘real’
2) The generator is trained with maximum likelihood (not adversarial)
→ because of the difficulty of applying GANs to text (back-propagate is impossible)
3) We don’t supply the generator with a noise vector as input
5. Results
GLUE, SQuAD
ELECTRA outperforms MLM-based models(BERT, XLNet)
ELECTRA-Small outperforms small BERT and GPT
ELECTRA-Large performs comparably to RoBERTa, XLNet and ALBERT
- having fewer parameters
- using 1/4 of the compute for training
- new SOTA
→ It works well both in small size and large size
→ This method is more parameter-efficient and compute-efficient
6. Efficiency analysis
a large amount of ELECTRA’s improvement comes from learning all tokens
a smaller amount comes from alleviating pre-train fine-tune mismatch.
7. Related Work
1) self-supervised learning : learn word representations and contextual representations
2) BERT : pre-trains a large Transformer
- There have been numerous extensions
- (MASS, UniLM, ERNIE, SpanBERT, TinyBERT, MobileBERT)
3) SpanBERT : mask out contiguous sequences of token
→ this idea becomes ELECTRA