WMBT is able to tag unknown words
WMBT now contains a simple algorithm for tagging words not recognised by Morfeusz+guesser, which boosted the real accuracy from 88.6% to 89.7%
The previous version of WMBT could not recover tags for tokens where morph analyser failed. The current version (as in the repository) is enhanced with a simple algorithm for guessing unknown words. This algorithm is not intended to replace the guesser used before, but just to recover those cases where the external guesser fails.
The algorithm brings improvement from 88.5507% to 89.6983% (10% drop in error rate) as tested on NKJP (lower bound resulting from plain text tests, which are probably the closest approximation to real-world tagging so far made).
NOTE: the WMBT's guessing algorithm predicts tags only, lemmas are not guessed (guessed lemma = orth.lowercase()).
To use the best results for the NKJP tagset:- get the newest version of WMBT code from the repository
- download model_nkjp10_guess.tar.bz2 as posted on the main wiki site
- use MACA configuration for morfeusz SGJP with guesser (morfeusz-nkjp-official-guesser or morfeusz-nkjp-guesser)
- use WMBT with nkjp-guess.ini config (as in here)
Comments