Title - emoji2vec: Learning Emoji Representations from their Description
Authors - Ben Eisner, Time Rocktaschel, Isabelle Augenstein, Matko Bosnjak, Sebastian Riedal
Publication - EMNLP 2016
1. Research Background
1) Increased use of emoji and interest in Social media text analysis
2) NLP often relies on pre-trained word embeddings (e.g., word2vec, GloVe)
3) Yet, neither resource contains Unicode emoji representations
4) So the authors released emoji2vec, embeddings for emoji Unicode symbols
2. Related Work
There has been little work in distributional embeddings of emojis
1) The first research (Dimson, 2015)
- was released on an informal blog post by the Instagram Data Team
- training on the entire corpus of Instagram posts
- similar to skip-gram-based vectors
2) The second research (Barbieri et al., 2016)
- training on a large Twitter dataset of over 100 million English tweets
- using skip-gram method
- Frequently-used emojis worked well, but less frequent emojis did not.
3. Characteristics
1) estimating the representation of emojis from their description
2) working with much less data than previous research
4. Limitation
1) training only on English-language data -> ignoring temporal definitions of emojis
2) Different cultural phenomena and languages may co-opt conventional emoji sentiment
3) emoji2vec doesn't capture the context of emojis
5. Method
1) emoji2vec's dimension = 300 (the same as word2vec embeddings)
2) vector representation of the description (v) = the sum of the individual word2vec vectors (w) in the description
3) defining a trainable vector x for every emoji
4) using sigmoid of the dot product of the 2 representations (x, v)
5) optimization: using stochastic gradient descent with Adam
6) randomly sampling descriptions for emojis as negative instances
- one positive example per negative example produced the best results
7) performing early-stopping on a held-out development set
8) 80 epochs of training
9) training takes less than 3 minutes on a 2013 MackBook Pro
6. Evaluation
1) quantitative evaluation
(1) intrinsic (emoji-description classification): 85.5% accuracy (manually labeling test data to evaluate)
(2) extrinsic (Twitter sentiment analysis):
emoji embeddings could improve performance
2) qualitative evaluation
visualizing the learned emoji embedding space (t-SNE Visualization)