The simplest approach in NLP is looking for words. Because of its simplicity and small resource requirement, it has always been very common, for example best search engines are word-based.

Some projects employing word-based approach, according to Cambria & White (2014)[1]:

  1. Ortony’s Affective Lexicon (Ortony, Clore, & Collins, 1988), which groups words into affective categories
  2. Penn Treebank (Marcus, Santorini, & Marcinkiewicz, 1994), a corpus consisting of over 4.5 million words of American English annotated for part-of-speech (POS) information
  3. PageRank (Page, Brin, Motwani, & Winograd, 1999), the famous ranking algorithm of Google
  4. LexRank (GÜnes & Radev, 2004), a stochastic graph-based method for computing relative importance of textual units for NLP
  5. TextRank (Mihalcea & Tarau, 2004), a graph-based ranking model for text processing, based on two unsupervised methods for keyword and sentence extraction

Drawback Edit

  • Reliance on surface features: a document about dogs may not use the word "dog" because specific bread names are used.

References Edit

  1. Cambria, E., & White, B. (2014). Jumping NLP curves: A review of natural language processing research. IEEE Computational Intelligence Magazine, 9(2), 48-57.