Training and reanalysis

To train the tagger, specify the configuration file (e.g. bundled nkjp-guess.ini) and a directory to store the model (nkjp_model in the example), e.g.

wmbt/wmbt.py -d path/to/nkjp_model  -v config/nkjp-guess.ini --train path/to/train_nkjp.xml

To mimic the conditions that happen during performance (tagging of morphological analyser output), it is desirable to re-analyse the training data with the same analyser.

The recommended procedure is as follows:
  1. Turn the training data to plain text.
  2. Feed the plain text through the morphological analyser and sentence splitter (it should be exactly the same configuration as used when tagging).
  3. Sync the analyser output with the original training data:
    1. Take tokenisation & sentence division from the original training data. Should any segmentation change occur, take the tokens subjected to it from the original data intact (those cases should be infrequent, why bother).
    2. Other tokens (vast majority) may be directly compared; if the disamb interpretation also appears in the reanalysed token, take the reanalysed token and set the correct interpretations as disamb.
    3. If the disamb interpretation is not present there, it means that the tagger wouldn't be able to recover it, hence it's an unknown word. This token should be marked as such: we retain only the disamb interpretation and add a non-disamb interpretation consisting of the “unknown” tag (lemma is set as token's orthographic form, lower-cased). This is to let the tagger see is the same way as when tagging plain text.

This procedure is implemented in tools/reanalyse. The script requires MACA.

By default, it uses Morfeusz SGJP and guesser from TaKIPI (it is sufficient to install corpus1 library from TaKIPI). If another MACA configuration and/or tagset is to be used, please change the defined variable values or use reanalyse.py directly (reanalyse script is not more than just a simple call to reanalyse.py).

The script should be run against the training data. If training is to perform tagger cross-validation, feed every training part separately through the script.

E.g.

tools/reanalyse nkjp_folds/train01.xml reana/train01.xml
wmbt/wmbt.py -d path/to/nkjp_model  -v config/nkjp-guess.ini --train reana/train01.xml