An **exponential language model** or **maximum entropy language model** use the following formula to express the conditional probability of word $ w_i $ given context $ h_i $:

where $ \lambda_j $ are the parameters, $ f_j (h_i , w_i ) $ are arbitrary functions of the pair $ (h_i , w_i ) $ and $ Z(h i ) $ is a normalization factor:

The parameters are learned from the training data based on the Maximum Entropy principle. It was first introduced into language modeling by Pietra et al. (1992)^{[1]}.
Later, it was systematically investigated by Rosenfeld (1996)^{[2]}.

Most neural network LMs use softmax output layer therefore can be considered exponential LMs albeit with sophisticated feature templates.

## References Edit

- ↑ Stephen A. Della Pietra, Vincent J. Della Pietra, Robert L. Mercer, and Salim Roukos. Adaptive language modeling using minimum discriminant estimation. In Proceedings of the workshop on Speech and Natural Language, pages 103–106, 1992.
- ↑ Ronald Rosenfeld. A maximum entropy approach to adaptive statistical language modeling. Computer, Speech and Language, 10(3):187–228, 1996.