Treatment of unknown words in WMBT

Adam Radziszewski, 2012

This document describes the changes made to the base WMBT algorithm to allow for better treatment of unknown words. The algorithm described here corresponds to the most recent WMBT configuration recommended for general-purpose usage (nkjp-guess.ini).

The previous version of WMBT (as published in the LTC'11 paper) assumed that the tagger was just to select tags — if the analyser output no tags, the tagger could not change it. Such a configuration is still available, look for *ltc11.ini.

Here an additional “guessing” module is introduced: if a word is recognised as unknown, a list of tags is assigned to it and then it's subjected to disambiguation as before. This is now the default config (nkjp-guess.ini).

NOTE: this algorithm is to provide a guessing strategy when the analyser+guesser module fails. It is a simple enhancement and lemmas are not guessed.

Note on tagger evaluation and figures reported earlier

The testing procedure employed in the LTC'11 paper made it impossible to see the actual tagging rate for unknown words — only disambiguation capabilities were tested, that is what was assessed was the ability of the taggers to rule out unwanted interpretation from the reference corpus, containing full morphological anaylysis of known words.

Fair tests assume running the tagger on plain text1 extracted from the testing part of the reference corpus and comparing its output to the original reference. Such tests allow to see the ‘real’ performance, that is to see how well the tagger tags plain text (which is the only input normally available), not just disambiguate artificial test material. Two figures are reported: lower bound (penalising each segmentation change, i.e. every reference token subjected to segmentation change is counted as mistagged) and upper bound (every reference token subjected to segmentation change is counted as correctly tagged, whatever the actual tagging).

The disambiguation accuracy as reported in the LTC'11 paper was quite high, roughly 93%. The ‘real’ figures, i.e. tagging accuracy, are much lower: around 90%.

An implementation of the plain-text evaluation procedure may be found in the corpus2 repository (utils/ This script is to compare real tagger output (obtained through plain text tagging) against the reference/gold standard data. The implementation is able to deal with segmentation changes.

1 Adam Radziszewski and Szymon Acedański, “Taggers gonna tag: an argument against evaluating disambiguation capacities of morphosyntactic taggers”. Will be published at proceedings of TSD 2012. Draft: taggereval.pdf

Algorithm overview

Assumption: the input to be tagged is already morphologically analysed. That is, tokens are assigned sets of (tag, lemma) pairs. The same holds for training but one1 such pair is selected as the correct tagging (disamb interpretation)

Tokens are divided into known words and unknown words. During tagging, tokens not recognised by the analyser are the unknown words, that is the tokens containing just the “unknown” tag. Which tokens are “unknown” in the training data is less obvious; here we assume that those tokens should be marked by containing an additional “unknown tag” in the set of theoretically possible tags. See next section for details.

Our idea is to 1. exploit the disamb tags assinged to unknown words in the training data and 2. to create separate case bases for known and unknown words.


  1. A frequency list of correct tags for unknown words is gathered from the training data. The list is essentially a mapping disamb tag that was assigned to an unknown word -> count. If a token were assigned more disamb tags, one would be arbitrarily chosen (‘tagset-first’). The list is trimmed to tags appearing at least U times (default: 1) and the tags are stored along with the trained model.
  2. Tokens recognised as unknown words are appended all the tags from the list (the unknown tag remains, this is to make it distinguishable for the classifier) before proceeding with iteration over tagset attributes.
  3. As outlined in the paper, a frequency list of word forms is generated (here no distinction is made whether tokens are known or unknown words)
  4. During training, a separate case base is created for each tagset attribute (as in the paper), but if the word is unknown, this is still separate case base (e.g. there is a case base for WORDCLASS-KNOWN and WORDCLASS-UNKNOWN, GENDER-KNOWN, GENDER-UKNOWN and so on).
  1. Tokens recognised as unknown words are appended all the tags from the list (the unknown tag remains, this is to make it distinguishable for the classifier).
  2. The case base applicable for given tagset attribute and known or unknown word is used.

1 The input may contain more disamb interpretations, but one will be arbitrarily chosen (tagset-first).

Data preparation (morphological re-analysis)

To mimic the conditions that happen during performance (tagging of morphological analyser output), it is desirable to pre-process the training data with the same analyser.

The procedure goes as follows:
  1. The training data is turned to plain text.
  2. Plain text is fed through the morphological analyser and sentence splitter (for best results it should be exactly the same configuration as used when tagging).
  3. The analyser output is synced with the original training data:
    1. Tokenisation is taken from the original training data; should any segmentation change occur, the tokens subjected to it are taken from the original data intact.
    2. Other tokens (vast majority) are compared; if the disamb interpretation also appears in the reanalysed token, the reanalysed token is taken and the correct interpretations are marked as disamb.
    3. If the disamb interpretation is not present there, it means that the tagger wouldn't be able to recover it, hence it's an unknown word. This token is marked as such: we retain only the disamb interpretation and add a non-disamb interpretation consisting of the “unknown” tag (lemma is set as token's orthographic form, lower-cased). This is to let the tagger see is the same way as when tagging plain text.

Note: for cross-validation, every training part should be subjected to this procedure. This is implemented in tools/reanalyse (the script requires MACA).

Comparison with the previous (no guess) results

Tests on NKJP 1.0, folds generated using whole paragraphs

The tests have been performed following the methodology proposed in taggereval.pdf, using exactly the same data set and set-up as described there.

Tagger Re-analysis Acc lower bound Acc upper bound Acc lower known Acc lower unknown
PANTERA no 88,79% 89,09% 91,08% 14,70%
YES 88,99% 89,28% 91,27% 14,74%
WMBT noguess no 87,50% 87,82% 89,78% 13,57%
YES 88,75% 89,08% 91,07% 13,62%
WMBT guess no 88,44% 88,76% 89,89% 41,43%
YES 89,71% 90,04% 91,20% 41,45%

PANTERA stands for the morphosyntactic tagger based on Brill's Algorithm adapted for morphologically rich languages, using threshold of 6 (recommended by the author)
WMBT noguess corresponds to WMBT with no guessing (as descibed in the LTC'11 paper)
WMBT guess is the version descibed here

Re-analysis denotes whether the training data was synchronised with output of the employed morphological analyser (yes) or was used intact (no)

taggereval.pdf (305 KB) Adam Radziszewski, 19 Jun 2012 08:52