TODO: an interesting paper with important references:

Word embedding is an assignment of a vector to each word in a language: $ W: words \rightarrow \mathbb{R}^n $. Typically, the assignment is learned from a large corpus, a vector is dense and has a relatively small dimensionality (for example, 200 to 500) compared to distributional semantics models.

Good practices to train word embeddings: see Lai et al. (2016)[1]:

  1. "First, we discover that corpus domain is more important than corpus size. We recommend choosing a corpus in a suitable domain for the desired task, after that, using a larger corpus yields better results.
  2. Second, we find that faster models provide sufficient performance in most cases, and more complex models can be used if the training corpus is sufficiently large.
  3. Third, the early stopping metric for iterating should rely on the development set of the desired task rather than the validation loss of training embedding"

Characteristics Edit

Proximity of similar words Edit


t-SNE visualizations of word embeddings. Left: Number Region; Right: Jobs Region. From Turian et al. (2010)[2], see complete image.

Words in high-dimensional space tend to form clusters of related meaning and synonymous words are closest to each other.

Algebraic relation Edit

Some simple relations are found to be represented by a constant different vectors across pairs of words. For example:

$ W(\textrm{woman}) - W(\textrm{man}) \approx W(\textrm{queen}) - W(\textrm{king}) $

Similar observations were made for capital-country, celebrity-job, president-country, chairman-company,...[3]

Basis or the usage of sub-word features Edit

TODO: Bian (2014)[4], Qing Cui et al. (2014)[5].

Sources of information Edit

Text Edit


Knowledge graph Edit

Many methods combine text and knowledge graph to get better word embeddings: retrofitting (Faruqui et al. 2015)[6], Liu et al. (2016)[7], Xu et al. (2014)[8]


  • M. Yu, M. Dredze, Improving lexical embeddings with semantic knowledge., in: ACL (2), 2014, pp. 545–550.
  • J. Bian, B. Gao, T.-Y. Liu, Knowledge-powered deep learning for word embedding, in: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Springer, 2014, pp. 132– 148.
  • C. Xu,Y. Bai, J. Bian, B. Gao, G.Wang, X. Liu, T.-Y. Liu, Rc-net:Ageneral framework for incorporat- ing knowledge into word representations, in: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, ACM, 2014, pp. 1219–1228.
  • Q. Liu, H. Jiang, S. Wei, Z.-H. Ling, Y. Hu, Learning semantic word embeddings based on ordinal knowledge constraints, in: Proceedings of ACL, 2015, pp. 1501–1511.
  • [9][10]

TODO: comparisons between methods???

Retrofitting Edit

Faruqui et al. (2015)[6]: "we first train the word vectors independent of the information in the semantic lexicons and then retrofit them".

Inequality Edit

From Liu et al. (2016): "the knowledge constraints are formulized as semantic similarity inequalities between two word pairs...

semantic inequalities from WordNet: 1) Similarities between a word and its synonymous words are larger than similarities between the word and its antonymous words. A typical example is similarity(happy, glad) > similarity(happy, sad). 2) Similarities of words that belong to the same semantic category would be larger than similarities of words that belong to different categories. 3) Similarities between words that have shorter distances in a semantic hierarchy should be larger than similarities of words that have longer distances."

Models Edit

  • CBOW
  • Skip-gram
  • CLOW (continuous list of words): Trask et al. 2015[11]
  • PENN (partitioned embedding neural network): Trask et al. 2015[11]

Applications Edit

External links Edit

Source code Edit

  • Retrofitting: github
  • gensim implementation of Word2vec

References Edit

  1. Lai, S., Liu, K., He, S., & Zhao, J. (2016). How to generate a good word embedding. IEEE Intelligent Systems.
  2. Turian, J., Ratinov, L., & Bengio, Y. (2010, July). Word representations: a simple and general method for semi-supervised learning. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (pp. 384-394). Association for Computational Linguistics. PDF
  3. Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
  4. Bian, J., Gao, B., & Liu, T. Y. (2014). Knowledge-powered deep learning for word embedding. In Machine Learning and Knowledge Discovery in Databases (pp. 132-148). Springer Berlin Heidelberg.
  6. 6.0 6.1 Faruqui, M., Dodge, J., Jauhar, S. K., Dyer, C., Hovy, E., & Smith, N. A. (2015). Retrofitting Word Vectors to Semantic Lexicons. In NAACL 2015 (pp. 1606–1615). Denver, Colorado: ACL.
  7. Liu, Q., Jiang, H., Ling, Z.-H., Zhu, X., Wei, S., & Hu, Y. (2016). Commonsense Knowledge Enhanced Embeddings for Solving Pronoun Disambiguation Problems in Winograd Schema Challenge.
  8. Xu, C., Bai, Y., Bian, J., Gao, B., Wang, G., Liu, X., & Liu, T. Y. (2014). RC-NET: A General Framework for Incorporating Knowledge into Word Representations.
  9. Weston, Jason, et al. "Connecting language and knowledge bases with embedding models for relation extraction." arXiv preprint arXiv:1307.7973 (2013).
  10. Yu, Mo, and Mark Dredze. "Improving lexical embeddings with semantic knowledge." Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. Vol. 2. 2014.
  11. 11.0 11.1 Trask, A., Gilmore, D., & Russell, M. (2015). Modeling Order in Neural Word Embeddings at Scale. arXiv preprint arXiv:1506.02338. PDF