Lemmatisation

Present experiments will be run against NKJP 1.0.
It would be better to test against NKJP 1.1, but I can't manage to run Pantera again. Again, it doesn't like my locale and segfaults upon any attempt to run it, whether tagging or running without arguments. This is quite defective, since I'm testing on a virgin Ubuntu 12.04 installation (on a VM) and Pantera installed straight from Bartosz Zaborowski's PPA. To make things even worse, the PPA distributes Morfeusz Polimorf, even though Pantera model was trained on NKJP annotated with Morfeusz SGJP.

Important note

While trying to add explicit lemmatisation layer to the code, I've spotted a possibly serious bug in WCRFT code. The bug is related to deciding whether there was any decision problem present at a layer or not (if not, then no classifier was trained as such a layer would not contribute anyway). The bug could have caused many layers to be misjudged as having no data and therefore be skipped. Here is a fragment of corrected code, currently at a working branch.


                        all_attr_vals = corpus2.mask_token(tok, attr_mask, False)
                        num_attr_vals = corpus2.mask_card(all_attr_vals)
                        disamb_attr_vals = corpus2.mask_token(tok, attr_mask, True)
                        # may be that disamb_attr_vals.is_null(), as for some
                        # attrs no value is given
                        class_label = corpio.mask2text(self.tagset, disamb_attr_vals)
                        # generate training example and store to file
                        classify.write_example(tr_file, feat_vals, class_label)
                        self.stats.num_evals += 1
                        got_data_here = (not disamb_attr_vals.is_null())
                        if got_data_here:
                            got_data = True
                            # remove lexemes with non-disamb attr values
                            # must be equal to mask to be left
                            corpus2.disambiguate_equal(tok, all_attr_vals, disamb_attr_vals)
                classify.write_end_of_sent(tr_file)
                if got_data: attr_met.add(attr_name)

Tiny-scale experiments show that this makes a difference, so whole WCRFT evaluation should probably be repeated.

The difference on the whole data set is not very big (and it gets slightly worse, in fact):

r_wcrft_reana_ng_s2_sepX2013.txt:AVG weak lemma lower bound    96.1465%
r_wcrft_reana_ng_s2_sepX2013.txt:AVG strong lemma lower bound    94.2815%

r_wcrft_fixd_sepX2013.txt:AVG weak lemma lower bound    96.1440%
r_wcrft_fixd_sepX2013.txt:AVG strong lemma lower bound    94.2786%

r_wcrft_reana_ng_s2_sepX2013.txt:AVG weak corr lower bound    90.8001%

r_wcrft_fixd_sepX2013.txt:AVG weak corr lower bound    90.7950%

Where are the data

bauer:~/NKJP-10

PYTHONIOENCODING=utf8 ~/workspace/corpus2/utils/tagger-eval.py kubaw/test??.plain.xces folds/test??.xml | tee r_kubaw_oct2013.txt # data copied from spock

PYTHONIOENCODING=utf8 ~/workspace/corpus2/utils/tagger-eval.py wmbt/tagd??.xml folds/test??.xml | tee r_wmbt_oct2013.txt

PYTHONIOENCODING=utf8 ~/workspace/corpus2/utils/tagger-eval.py pantera/f??/out.xml folds/test??.xml | tee r_pantera_reana_oct2013.txt

spock:/mnt/synat2a/eliasz/NKJP-10

PYTHONIOENCODING=utf8 ../../../synat1a/eliasz/corpus2/utils/tagger-eval.py wcrft_reana_ng_s2/tagd??.xml folds/test??.xml | tee r_wcrft_reana_ng_s2_oct2013.txt

Baseline lemma results

Tagger Lem weak lower Lem strong lower POS strong=weak lower Tag weak lower
Pantera 94.7839% 94.7839% 95.4689% 88.9866%
WMBT 96.0368% 94.1837% 96.7464% 89.7126%
WCRFT (s2) 96.1465% 94.2815% 97.1440% 90.8001%
Concraft 94.8908% 93.0335% 97.1134% 91.1187%

Tag weak lower (AVG weak corr lower bound) != tag strong lower (AVG strong corr lower bound) because of optional attributes and different decisions as to select them (different in morpho output and in ref corpus).
POS strong lower corr = POS weak lower corr because the above phenomenon is limited to optional attributes and does not affect POS at all (neither it affects most substantial attributes).

CST lemmatiser

CST lemmatiser (cstlemma) is a tool from Center for Sprogteknologi / Centre for Language Technology of Univ of Copenhagen.
Project site: http://cst.dk/online/lemmatiser/uk/
On Github: http://github.com/kuhumcst/cstlemma

CST lemmatiser was shown to perform better than any other lemmatiser or tagger on Croatian data (Agić et al., 2003, PDF). This is why it is worth to test it here.

CST is trained on a morphological dictionary only. The paper cited above suggests that the results are better when deriving the dictionary from the training corpus without using any external morpho dict (although they used a pretty small one, much smaller than that of Morfeusz). Anyway, I'm assuming this is the way to go as we don't want to deal with differences between Morfeusz dict and the real data (see notes on reanalysis in Training).

Note: CST by default lowercases lemmas. This is why the Croatians did that to training / testing data. It is important to evaluate lemmatisation in a case-insensitive manner (and state it explicitly) as the tool won't deal with letter case.

Training CST (courtesy of Nikola Ljubešić)

The command line used by the Croatians to train a CST model:

cstlemma  -D -cFBT -i setimes.types -eU -N setimes.freq -nNFT -o setimes.cstdict

setimes.types contains a list of ‘types’ (orths) with their lemmas and MSD-s:

,       ,       Z
.       .       Z
je      biti    Vcr3s
i       i       Cc
u       u       Sl
"       "       Z
za      za      Sa
se      sebe    Px--sa--ypn
su      biti    Vcr3p
kako    kako    Cs

setimes.freq contains ‘types’ (orths) and their MSD-s with their frequency.

9099    ,       Z
7508    .       Z
5707    je      Vcr3s
4502    i       Cc
4347    u       Sl
3119    "       Z
2380    za      Sa
2154    se      Px--sa--ypn
1724    su      Vcr3p
1509    kako    Cs

Training results in three files being created, setimes.cstdict, setimes.cstpats and setimes.cstpats0, the last two being identical.

Using a trained CST model to lemmatise tagged text (courtesy of Nikola Ljubešić)

Commandline:

cstlemma -L -i corpus.tagtemp -f setimes.cstpats -d setimes.cstdict -eU -o corpus.taglem

The file corpus.tagtemp is prepared in the CST-loving format which is this:

Stručnjaci/Ncmpn
navode/Vmr3p
kako/Cs
će/Var3s
metalurški/Agpmsny
sektor/Ncmsn
u/Sl
Makedoniji/Npfsl
tijekom/Ncmsi
2009/Mdc

To keep the sentence delimiters (CST kills empty lines, yes!) we are adding "<s/>/Z" between sentences and modify the lemmatised result back to empty lines with sed -r -i 's/<s\/> .*? Z//g' corpus.taglem.

Watch out for slashes or spaces in tokens, if I remember correctly this crashes the procedure.

I know that CST can work without MSD-s as well, but with much lower accuracy. Working with POS only would, I assume, hurt accuracy as well.

Problems and experiment plans

Take four most recent Polish taggers: Concraft, WCRFT, WMBT, Pantera and assess their lemmatisation accuracy.

It has been long since lemmatisation was assessed on large scale for Polish. Cf. with Jongejan, Bart and Dalianis, Hercules (2009) where they show 93.88% but they seem to be testing on artificial data (morpho dictionary itself without any real text).

1. Refer to tagger testing (gonna tag), similar problem, segmentation changes. Define lemmatisation accuracy lower bound. Upper bound could be as there, ignoring lemmas where seg changed but may be defined more naturally to actually check lemma concatenation.
2. Problem: taggers focus on disambiguation and actually output multiple lemmas per token. This happens not rarely, so makes sense to study weak/strong. Practically this is devastating.
3. Random selection of one lemma?

Weak / Strong lemma: the taggers leave ambiguous lemma

AVG weak lemma lower bound v. AVG strong lemma lower bound

Case-sensitivity: Proper v. proper

Seg change: Lower bound / cat-heur

Seg change v. letter case: which should we take?