1.
Learning universal sentence embeddings is a fundamental problem in natural language processing and has been studied extensively in the literature. In this work, we advance state-of-the-art sentence embedding methods and demonstrate that a contrastive objective can be extremely effective when coupled with pre-trained language models such as BERT or RoBERTa. We present SimCSE, a simple contrastive sentence embedding framework, which can produce superior sentence embeddings, from either unlabeled or labeled data.
Our unsupervised SimCSE simply predicts the input sentence itself with only dropout used as noise. In other words, we pass the same sentence to the pre-trained encoder twice: by applying the standard dropout twice, we can obtain two different embeddings as "positive pairs". Then we take other sentences in the same mini-batch as "negatives", and the model predicts the positive one among the negatives. Although it may appear strikingly simple, this approach outperforms training objectives such as predicting next sentences and discrete data augmentation by a large margin and even matches previous supervised methods. Through careful analysis, we find that dropout acts as minimal "data augmentation" of hidden representations while removing it leads to a representation collapse.
Our supervised SimCSE builds upon the recent success of using natural language inference (NLI) datasets for sentence embeddings and incorporates annotated sentence pairs in contrastive learning. Unlike previous work that casts it as a 3-way classification task (entailment, neutral, and contradiction), we leverage the fact that entailment pairs can be naturally used as positive instances. We also find that adding corresponding contradiction pairs as hard negatives further improves performance. This simple use of NLI datasets achieves a substantial improvement compared to prior methods using the same datasets. We also compare to other labeled sentence-pair datasets and find that NLI datasets are especially effective for learning sentence embeddings.
To better understand the strong performance of SimCSE, we borrow the analysis tool from Wang and Isola (2020), which takes alignment between semantically-related positive pairs and uniformity of the whole representation space to measure the quality of learned embeddings. Through empirical analysis, we find that our unsupervised Sim-CSE essentially improves uniformity while avoiding degenerated alignment via dropout noise, thus improving the expressiveness of the representations. The same analysis shows that the NLI training signal can further improve alignment between positive pairs and produce better sentence embeddings. We also draw a connection to the recent findings that pre-trained word embeddings suffer from anisotropy and prove that-through a spectrum perspective-the contrastive learning objective "flattens" the singular value distribution of the sentence embedding space, hence improving uniformity.
We conduct a comprehensive evaluation of Sim-CSE on seven standard semantic textual similarity (STS) tasks and seven transfer tasks. On the STS tasks, our unsupervised and supervised models achieve a 76.3% and 81.6% averaged Spearman's correlation respectively using BERTbase, a 4.2% and 2.2% improvement compared to previous best results. We also achieve competitive performance on the transfer tasks. Finally, we identify an incoherent evaluation issue in the literature and consolidate the results of different settings for future work in evaluation of sentence embeddings.