Algorithm Edit

Spotting Edit

They use the extended set of labels in the lexicalization dataset to create a lexicon for spotting. LingPipe Exact Dictionary-Based Chunker[1] which relies on the Aho-Corasick string matching algorithm[2] with longest case-insensitive match.

Ignoring common words Edit

A configuration flag can instruct the system to disregard in this stage any spots that are only composed of verbs, adjectives, adverbs and prepositions. The part of speech tagger was LingPipe implementation based on Hidden Markov Models

Candidate selection Edit

To narrow down the space of disambiguation possibilities.

Modeling DBpedia resources Edit

  1. Aggregate all paragraphs mentioning each concept in Wikipedia
  2. Compute Term Frequency (TF) representing the relevance of a word for a given resource
  3. Compute Inverse Candidate Frequency (ICF) weight. The intuition behind ICF is that the discriminative power of a word is inversely proportional to the number of DBpedia resources it is associated with.
    $ ICF(w_j) = \log \frac{|R_s|}{n(w_j)} = \log |R_s| - \log n(w_j) $,
    where $ R_s $ is the set of candidate resources for a surface form s and $ n(w_j) $ be the total number of resources in $ R_s $ that are associated with the word $ w_j $.
  4. Create a Vector Space Model (VSM) with TF*ICF weights.

Disambiguation Edit

Rank candidate resources according to the similarity score between their context vectors and the context surrounding the surface form. Cosine was used as the similarity measure.

Configuration parameters Edit

Topic Pertinence. Edit

The similarity score returned by the disambiguation step.

In order to constrain annotations to topically related resources, a higher threshold for the topic pertinence can be set.

Contextual Ambiguity. Edit

If more than one candidate resource has high topical pertinence to a paragraph, it may be harder to disambiguate between those resources because they remain partly ambiguous in that context.

The score is computed by the relative difference in topic score between the first and the second ranked resource.

Applications that require high precision may decide to reduce risks by not annotating resources when the contextual ambiguity is high.

Disambiguation confidence Edit

A confidence value of 0.7 should eliminate 70% of incorrectly disambiguated test cases.

A statistical test is computed based on topic pertinence and contextual ambiguity. The confidence is high with low ambiguity.

Parameters were estimated on a development set of 100,000 Wikipedia samples.

Usage Edit

REST Web service Edit

  • Spotting
  • Disambiguate
  • Candidates

References Edit

  1. Alias-i. LingPipe 4.0.0. [ retrieved on 24.08.2010], 2008.
  2. A. V. Aho and M. J. Corasick. Efficient string matching: an aid to bibliographic search. Commun. ACM, 18:333{340, June 1975.