To train the tagger, specify the configuration file (e.g. bundled
nkjp_s2.ini) and a directory to store the model (the directory must exist, preferably empty;
nkjp_model in the example), e.g.
wcrft/wcrft.py -d path/to/model -v config/nkjp_s2.ini --train path/to/train_nkjp.xmlBefore training, you need to create an empty directory where the training model will be placed (
path/to/model). If you intend to redistribute the model (or get back to it later), it is a good practice to put a brief text file inside the directory containing at least the following information:
- What configuration was used (nkjp_s2.ini or another)
- What was the training file (filename and/or short description, e.g. was it reanalysed and with what configuration)
To mimic the conditions that happen during performance (tagging of morphological analyser output), it is desirable to reanalyse the training data with the same analyser.The recommended procedure is as follows:
- Turn the training data to plain text.
- Feed the plain text through the morphological analyser and sentence splitter (it should be exactly the same configuration as used when tagging).
- Sync the analyser output with the original training data:
- Take tokenisation & sentence division from the original training data. Should any segmentation change occur, take the tokens subjected to it from the original data intact (those cases should be infrequent, why bother).
- Other tokens (vast majority) may be directly compared; if the disamb interpretation also appears in the reanalysed token, take the reanalysed token and set the correct interpretations as disamb.
- If the disamb interpretation is not present there, it means that the tagger wouldn't be able to recover it, hence it's an unknown word. This token should be marked as such: we retain only the disamb interpretation and add a non-disamb interpretation consisting of the “unknown” tag (lemma is set as token's orthographic form, lower-cased). This is to let the tagger see is the same way as when tagging plain text.
This procedure for NKJP is implemented in
tools/reanalyse. The script require MACA installed with Morfeusz support.
Reanalysis for NKJP tagset requires Morfeusz SGJP and guesser from TaKIPI (it is sufficient to install corpus1 library from TaKIPI). If another MACA configuration and/or tagset is to be used, please change the defined variable values or use reanalyse.py directly (reanalyse script is not more than just a simple call to reanalyse.py).
The script should be run against the training data. If training is to perform tagger cross-validation, feed every training part separately through the script.
tools/reanalyse nkjp_folds/train01.xml reana/train01.xml wcrft/wcrft.py -d path/to/nkjp_model -v config/nkjp_s2.ini --train reana/train01.xml