News

WMBT is able to tag unknown words

Added by Adam Radziszewski over 11 years ago

The previous version of WMBT could not recover tags for tokens where morph analyser failed. The current version (as in the repository) is enhanced with a simple algorithm for guessing unknown words. This algorithm is not intended to replace the guesser used before, but just to recover those cases where the external guesser fails.

The algorithm brings improvement from 88.5507% to 89.6983% (10% drop in error rate) as tested on NKJP (lower bound resulting from plain text tests, which are probably the closest approximation to real-world tagging so far made).

NOTE: the WMBT's guessing algorithm predicts tags only, lemmas are not guessed (guessed lemma = orth.lowercase()).

To use the best results for the NKJP tagset:
  1. get the newest version of WMBT code from the repository
  2. download model_nkjp10_guess.tar.bz2 as posted on the main wiki site
  3. use MACA configuration for morfeusz SGJP with guesser (morfeusz-nkjp-official-guesser or morfeusz-nkjp-guesser)
  4. use WMBT with nkjp-guess.ini config (as in here)

(1-1/1)

Also available in: Atom