In CoNLL 2008[1], SRL evaluation was formulated as F1 score of dependencies. A dependency is created between all predicates and its arguments and is labeled with the role. Moreover, a dependency is created between every predicate and ROOT with its sense as label. This way, SRL systems still receive some credit when it makes mistake in predicate disambiguation. The report shows this example:

For example, for the correct proposition:

 verb.01: ARG0, ARG1, ARGM-TMP

the system that generates the following output for the same argument tokens:

 verb.02: ARG0, ARG1, ARGM-LOC

receives a labeled precision score of 2/4 because two out of four semantic dependencies are incorrect: the dependency to ROOT is labeled 02 instead of 01 and the dependency to the ARGM-TMP is incorrectly labeled ARGM-LOC.

For joint evaluation of syntax and semantics, they compute macro precision and recall scores:

$ LMP=W_{sem} \times LP_{sem} + (1-W_{sem}) \times LAS $ $ LMR=W_{sem} \times LR_{sem} + (1-W_{sem}) \times LAS $

Here, LMP stands for labeled macro precision, LMR labeled macro recall, and LAS (syntactic) label attachment score.

References Edit

  1. Surdeanu, M., Johansson, R., Meyers, A., Màrquez, L., & Nivre, J. (2008, August). The CoNLL-2008 shared task on joint parsing of syntactic and semantic dependencies. In Proceedings of the Twelfth Conference on Computational Natural Language Learning (pp. 159-177). Association for Computational Linguistics.