Maes et al. (2009)^{[1]} established the equivalence between structured prediction and reinforcement learning.

## Structured Prediction - Markov Decision Process Edit

- States. Each state of an SP-MDP contains both an input x and a partial output y. Let Y be the set of all possible partial outputs
- Actions. Actions of SP-MDPs concern elementary modifications of the partial output y.
- Transitions. SP-MDP transitions are deterministic and replace the current partial output by the transformed partial output. Transitions do not change the current input: T((x,y),a) = (x,a(y))
- Rewards. In SP, the aimis to predict outputs that are as similar as possible to the correct outputs w.r.t. the loss function ∆.When dealing with MDPs, the goal is expressed through rewards that should be maximized.

Note that the correct outputs are required to compute rewards. These
outputs are available for training but not for testing on new inputs. Rewards can then be computed only for the subset of the SP-MDP corresponding to states reachable from a initial training state s^{initial}(x^{(i)}). SP-MDP are thus different from traditional RL problems, where the reward function is known anytime. From the point of view of RL, this special setting has two main consequences: a policy can only be trained on a subset of the SP-MDP and approximated policies are needed in order to generalize to the whole state-action space.

## See also Edit

## References Edit

- ↑ Maes, F., Denoyer, L., & Gallinari, P. (2009). Structured prediction with reinforcement learning. doi:10.1007/s10994-009-5140-8