Training and reanalysis¶
To train the tagger, specify the configuration file (e.g. bundled
nkjp-guess.ini) and a directory to store the model (
nkjp_model in the example), e.g.
wmbt/wmbt.py -d path/to/nkjp_model -v config/nkjp-guess.ini --train path/to/train_nkjp.xml
To mimic the conditions that happen during performance (tagging of morphological analyser output), it is desirable to re-analyse the training data with the same analyser.The recommended procedure is as follows:
- Turn the training data to plain text.
- Feed the plain text through the morphological analyser and sentence splitter (it should be exactly the same configuration as used when tagging).
- Sync the analyser output with the original training data:
- Take tokenisation & sentence division from the original training data. Should any segmentation change occur, take the tokens subjected to it from the original data intact (those cases should be infrequent, why bother).
- Other tokens (vast majority) may be directly compared; if the disamb interpretation also appears in the reanalysed token, take the reanalysed token and set the correct interpretations as disamb.
- If the disamb interpretation is not present there, it means that the tagger wouldn't be able to recover it, hence it's an unknown word. This token should be marked as such: we retain only the disamb interpretation and add a non-disamb interpretation consisting of the “unknown” tag (lemma is set as token's orthographic form, lower-cased). This is to let the tagger see is the same way as when tagging plain text.
This procedure is implemented in
tools/reanalyse. The script requires MACA.
By default, it uses Morfeusz SGJP and guesser from TaKIPI (it is sufficient to install corpus1 library from TaKIPI). If another MACA configuration and/or tagset is to be used, please change the defined variable values or use reanalyse.py directly (reanalyse script is not more than just a simple call to reanalyse.py).
The script should be run against the training data. If training is to perform tagger cross-validation, feed every training part separately through the script.
tools/reanalyse nkjp_folds/train01.xml reana/train01.xml wmbt/wmbt.py -d path/to/nkjp_model -v config/nkjp-guess.ini --train reana/train01.xml