Evaluation procedure

This page describes the technical procedure that was performed in order to evaluate the tagger (and other taggers).

It is assumed that you have installed corpus2 library along with Python wrappers (corpus2 with Python wrappers is also a requirement for WCRFT tagger itself).

Motivation for this procedure may be found in Radziszewski and Acedański, 2012.

Data and conversion

Obtain the desired version of NKJP corpus (most recent version is probably here). Unpack it. WCRFT (and other our tools) won't process TEI format directly (this is unlikely to change for good reasons). First, you need to convert the data into XCES format. This may be done using tei2xces.py script that accompanies the PANTERA tagger. You don't need to install PANTERA or even get the whole sources, the script may be obtained directly from here. Note: the script parses TEI using regexes, so if someone one day decides to change XML formatting in NKJP, things may happen — you should do at least basic sanity checks after conversion.

cd NKJP… # where it was unpacked

# generate morph.xml files (each file will be in XCES format)
$ find -name "ann_mor*.xml" -exec ~/pantera/scripts/tei2xces.py {} \;

# sanity check -- check the pre-conversion number of tokens
$ time find -name "ann_mor*.xml" -exec cat {} \; | grep -c "</seg>" 
1215513
# …it should be exactly the same post conversion
$ find -name "morph.xml" -exec cat {} \; | grep -c "</tok>" 
1215513

# merge'em
$ find -name "morph.xml" > morphs.txt
$ corpus-merge -t nkjp --input-list=morphs.txt -o xces,flat -C > merged.xml

This will result in the whole corpus dumped into one huge XCES file (merged.xml here, perhaps you should name it with version).

Division into folds

For ten-fold cross validation you will need explicit division into training and testing parts. This may be done using utils/parfolds.py script from corpus2 repository (the script is not installed system-wide when installing corpus2 library, you can just use it as it is).

$ mkdir folds
$ time /path/to/local-corpus2-repository-copy/utils/parfolds.py -v nkjp11-merged.xml folds/

This should take a couple of minutes. Result: ten files named train##.xml and ten files named test##.xml. Each file will be in XCES format. Each test–train pair with the same number should correspond to one train-and-test run. train01 consists of the paragraphs that belong to all test parts but 01. Similar relationship holds for other parts.

The final result of ten-fold cross-validation is the averaged value of the selected measure (we endorse using accuracy lower bound for tagger evaluation as shown below).

Morphological reanalysis

The original NKJP corpus contains sets of ‘all possible tags’ per each token. When tagging real input (plain text), one does not have such information. To perform fair evaluation, we have to discard all this information from the files that will be tagged and re-introduce it by running morphological analyser. This will introduce some real-life morphological analyser and tokenisation errors. This should reflect the actual tagger+analyser behaviour and, hence, the obtained results should reflect real-life performance of the tools.

The procedure assumes that you have Maca and a selected version of Morfeusz installed (Maca and Morfeusz are reuired to tag plain text with WCRFT, anyway). Make sure that a recent version of Morfeusz is installed system-wide before proceeding or a particular version you are willing to test (anyway, the choice of Morfeusz version should be conscious as it may significantly impact tagger performance — both during evaluation and in real-life applications).

  1. You need to create plain-text variants of each folds/test##.xml file. This may be done with corpus2/utils/corptext.py folds/test01.xml folds/test01.txt (and so on).
  2. Create a directory to store reanalysed test data (mkdir folds/testana). Use maca to analyse plain text test files (maca-analyse -qs morfeusz-nkjp-official -o xces < folds/test01.txt > testana/test01.xml — but you may want to use different config, e.g. polimorf-nkjp if using Morfeusz Polimorf). You need to analyse all test files this way.

For best tagger performance (this actually concerns any tagger that peeks at the set of ‘possible tags’ in training data, not only WCRFT) you should reanalyse training data as well. This idea and procedure is explained in detail here.

There is a script in WCRFT repository (wcrft/tools) to perform this. The script consists of two parts: simple convenience bash script that runs everything (named reanalyse) and an underlying Python script that synchronises Maca output with original training data to keep information on manually selected tags (called ‘disamb lexemes’ in XCES format). reanalyse bash script specifies MACA configuration name that designates which Morfeusz to run. By default this is MACA_CONF=morfeusz-nkjp-official, which works well with Morfeusz SGJP installed. If this config name should be the same as the one you used during reanalysis of test data, so change if necessary. The script should be run with two args: original training file and resulting reanalysed training file. E.g., ./reanalyse folds/train01.xml reana/train01.xml.

Here we assume that after this procedure you'll have ten reana/train##.xml files (and, similarly, ten testana/test##.xml as described earlier in this section).

Training the tagger

You have ten re-analysed training files: reana/train##.xml. Each of these should be used to train a separate tagger model.

To train WCRFT, you can use the following bash script (train_wcrft). Similar script may be used to train other taggers (those that don't mind XCES input).

#!/bin/bash

# Trains tagger in parallel mode.
# Assumes the training data is available in reana/train##.xml files (one per fold).
# This data should result from running wcrft/utils/reanalyse against each train##.xml
# file (each training fold).
# Writes models for each fold to $SUBDIR/f##/.

JOBS=3 # number of taggers to train paralelly
NICE=3 # niceness of tagger training process
SUBDIR=wcrft_reana # here the trained models will be stored

CONF=nkjp_s2.ini

rm -rf $SUBDIR

seq -w 10 | parallel --nice $NICE -j $JOBS " 
        echo "---" {} "---" && mkdir -p $SUBDIR/f{} &&
        wcrft --train $CONF -d $SUBDIR/f{} reana/train{}.xml" 

The script may take many hours to complete — so consider runing it within a screen session or something similar. What results is ten directories containing trained tagger models: wcrft_reana/f##.

Tagging with trained models

Next step is to use the trained models to tag reanalysed test data (testana/test##.xml). You can use the following script (tag_wcrft).

#!/bin/bash

# Tags using WCRFT trained with train_wcrft script (see there).
# Assumes testana/test##.xml is available for each fold (01..10).
# Those files should result from reanalysing plain text files.
# This is to obtain morpho analysis of plain-text test files
# (test##.txt).

# Output is written to $SUBDIR/tagd##.xml.
# To assess tagger performance, use:
# PYTHONIOENCODING=utf8 corpus2/utils/tagger-eval.py wcrft_reana/tagd??xml folds/test??.xml | tee results.txt

JOBS=3
NICE=3
SUBDIR=wcrft_reana

CONF=nkjp_s2.ini

seq -w 10 | parallel --nice $NICE -j $JOBS " 
        echo "---" {} "---" && 
        wcrft $CONF -d $SUBDIR/f{} testana/test{}.xml -O $SUBDIR/tagd{}.xml" 

What results is ten tagged files: wcrft_reana/tagd##.xml (again, in XCES format).

Evaluation of tagged data against original test folds

Now that you have ten output files, one per each testing part, what remains is to compare wcrft_reana/tagd##.xml# (tagger output) to @folds/test##.xml (original testing data). Note the way that led to generating the tagged files: the original test files were converted to plain text (hence all linguistic annotation was discarded; what remained was division into paragraphs by double newline characters) and then re-analysed from scratch by using Maca in selected configuration. This corresponds to the very same situation you'll most likely come across when using a trained tagger in real-life application: you'll have access to plain text where any sensible markup available will at most be the division into paragraphs. Feeding plain text through Maca will influence not only the possiblility of correctly tagging rare/difficult words (otherwise, the proper tag was always there in the reference corpus), but also the possibility of getting the text correctly tokenised.

The final evaluation may be performed using tagger-eval.py script from corpus2 repository, e.g.:

PYTHONIOENCODING=utf8 corpus2/utils/tagger-eval.py wcrft_reana/tagd??xml folds/test??.xml | tee results.txt

This will report a numer of measures. First records of script output report performance for subsequent folds. Last lines (beginning with AVG) concern values averaged over each fold.

Note on accuracy lower bound and weak correctness lower bound

Accuracy lower bound is reported here as “weak corr lower bound” (AVG) and “WC_LOWER” (per-fold). In fact, the measure actually obtained should be called “weak correctness lower bound” (the same for upper bound). Weak correctness concerns the number of tokens where the set of tags output by the tagger and the set of tags in reference corpus have non-empty intersection. Strong correctness concerns only tokens where those sets are exactly the same. This naming was introduced by Szymon Acedański and Adam Przepiórkowski (TODO: reference). This may seem totally irrelevant for corpora like NKJP and taggers like WCRFT where both of them are guaranteed to assign exactly one correct tag per token.

However, this is not quite so. As a matter of fact, NKJP tagset defines some optional attributes. This means that a tag may be valid when an attribute is given a value or no value is given at all. As noted by Danuta Karwańska, sometimes lack or presence of values for such attributes may change between versions of reference corpus or morphological analyser used and this may not be very practically relevant. Karwańska proposed to expand tags where value of optional attibutes were omitted to a set of fully-specified tags making up all possible cominations. By default, this is done by the tagger-eval script. The script expands unspecified attributes in both tagger output and refernce corpus before comparing. This is the reason why values of weak correctness (lower / upper bound) may different from strong correctness even if technically the tagger always outputs one tag per token and so is the situation in the reference corpus (as in NKJP).

When we perform evaluation, we assume tags are always expanded this way (default script's behaviour, inspired by Karwańska's observations). Also, we always give values of “weak correctness” (lower/upper bound, also when reporting values for unknown/known words). If you don't want the optional attributes to be expanded at all, you can use the -k switch in the script (see --help).

To check which attributes are optional in a tagset, you can use tagset-tool (exacutable installed system-wide with corpus2) or directly see tagset definition file, e.g. nkjp.tagset (attibutes optional for a given class are given in square bracket). Note that at least some of the optional attibutes are lexically motivated. E.g., prepositions have optional attribute named vocalicity that is valid only for some forms. The tagset would be mathematically more pure if prepositions were divided into two separate grammatical classes depending on the presence of this attribute, but this would be impractical.