Present experiments will be run against NKJP 1.0.
It would be better to test against NKJP 1.1, but I can't manage to run Pantera again. Again, it doesn't like my locale and segfaults upon any attempt to run it, whether tagging or running without arguments. This is quite defective, since I'm testing on a virgin Ubuntu 12.04 installation (on a VM) and Pantera installed straight from Bartosz Zaborowski's PPA. To make things even worse, the PPA distributes Morfeusz Polimorf, even though Pantera model was trained on NKJP annotated with Morfeusz SGJP.
While trying to add explicit lemmatisation layer to the code, I've spotted a possibly serious bug in WCRFT code. The bug is related to deciding whether there was any decision problem present at a layer or not (if not, then no classifier was trained as such a layer would not contribute anyway). The bug could have caused many layers to be misjudged as having no data and therefore be skipped. Here is a fragment of corrected code, currently at a working branch.
all_attr_vals = corpus2.mask_token(tok, attr_mask, False) num_attr_vals = corpus2.mask_card(all_attr_vals) disamb_attr_vals = corpus2.mask_token(tok, attr_mask, True) # may be that disamb_attr_vals.is_null(), as for some # attrs no value is given class_label = corpio.mask2text(self.tagset, disamb_attr_vals) # generate training example and store to file classify.write_example(tr_file, feat_vals, class_label) self.stats.num_evals += 1 got_data_here = (not disamb_attr_vals.is_null()) if got_data_here: got_data = True # remove lexemes with non-disamb attr values # must be equal to mask to be left corpus2.disambiguate_equal(tok, all_attr_vals, disamb_attr_vals) classify.write_end_of_sent(tr_file) if got_data: attr_met.add(attr_name)
Tiny-scale experiments show that this makes a difference, so whole WCRFT evaluation should probably be repeated.
The difference on the whole data set is not very big (and it gets slightly worse, in fact):
r_wcrft_reana_ng_s2_sepX2013.txt:AVG weak lemma lower bound 96.1465% r_wcrft_reana_ng_s2_sepX2013.txt:AVG strong lemma lower bound 94.2815% r_wcrft_fixd_sepX2013.txt:AVG weak lemma lower bound 96.1440% r_wcrft_fixd_sepX2013.txt:AVG strong lemma lower bound 94.2786% r_wcrft_reana_ng_s2_sepX2013.txt:AVG weak corr lower bound 90.8001% r_wcrft_fixd_sepX2013.txt:AVG weak corr lower bound 90.7950%
Where are the data¶
PYTHONIOENCODING=utf8 ~/workspace/corpus2/utils/tagger-eval.py kubaw/test??.plain.xces folds/test??.xml | tee r_kubaw_oct2013.txt # data copied from spock
PYTHONIOENCODING=utf8 ~/workspace/corpus2/utils/tagger-eval.py wmbt/tagd??.xml folds/test??.xml | tee r_wmbt_oct2013.txt
PYTHONIOENCODING=utf8 ~/workspace/corpus2/utils/tagger-eval.py pantera/f??/out.xml folds/test??.xml | tee r_pantera_reana_oct2013.txt
PYTHONIOENCODING=utf8 ../../../synat1a/eliasz/corpus2/utils/tagger-eval.py wcrft_reana_ng_s2/tagd??.xml folds/test??.xml | tee r_wcrft_reana_ng_s2_oct2013.txt
Baseline lemma results¶
|Tagger||Lem weak lower||Lem strong lower||POS strong=weak lower||Tag weak lower|
Tag weak lower (AVG weak corr lower bound) != tag strong lower (AVG strong corr lower bound) because of optional attributes and different decisions as to select them (different in morpho output and in ref corpus).
POS strong lower corr = POS weak lower corr because the above phenomenon is limited to optional attributes and does not affect POS at all (neither it affects most substantial attributes).
CST lemmatiser (
cstlemma) is a tool from Center for Sprogteknologi / Centre for Language Technology of Univ of Copenhagen.
Project site: http://cst.dk/online/lemmatiser/uk/
On Github: http://github.com/kuhumcst/cstlemma
CST lemmatiser was shown to perform better than any other lemmatiser or tagger on Croatian data (Agić et al., 2003, PDF). This is why it is worth to test it here.
CST is trained on a morphological dictionary only. The paper cited above suggests that the results are better when deriving the dictionary from the training corpus without using any external morpho dict (although they used a pretty small one, much smaller than that of Morfeusz). Anyway, I'm assuming this is the way to go as we don't want to deal with differences between Morfeusz dict and the real data (see notes on reanalysis in Training).
Note: CST by default lowercases lemmas. This is why the Croatians did that to training / testing data. It is important to evaluate lemmatisation in a case-insensitive manner (and state it explicitly) as the tool won't deal with letter case.
Training CST (courtesy of Nikola Ljubešić)¶
The command line used by the Croatians to train a CST model:
cstlemma -D -cFBT -i setimes.types -eU -N setimes.freq -nNFT -o setimes.cstdict
setimes.types contains a list of ‘types’ (orths) with their lemmas and MSD-s:
, , Z . . Z je biti Vcr3s i i Cc u u Sl " " Z za za Sa se sebe Px--sa--ypn su biti Vcr3p kako kako Cs
setimes.freq contains ‘types’ (orths) and their MSD-s with their frequency.
9099 , Z 7508 . Z 5707 je Vcr3s 4502 i Cc 4347 u Sl 3119 " Z 2380 za Sa 2154 se Px--sa--ypn 1724 su Vcr3p 1509 kako Cs
Training results in three files being created,
setimes.cstpats0, the last two being identical.
Using a trained CST model to lemmatise tagged text (courtesy of Nikola Ljubešić)¶
cstlemma -L -i corpus.tagtemp -f setimes.cstpats -d setimes.cstdict -eU -o corpus.taglem
corpus.tagtemp is prepared in the CST-loving format which is this:
Stručnjaci/Ncmpn navode/Vmr3p kako/Cs će/Var3s metalurški/Agpmsny sektor/Ncmsn u/Sl Makedoniji/Npfsl tijekom/Ncmsi 2009/Mdc
To keep the sentence delimiters (CST kills empty lines, yes!) we are adding "<s/>/Z" between sentences and modify the lemmatised result back to empty lines with
sed -r -i 's/<s\/> .*? Z//g' corpus.taglem.
Watch out for slashes or spaces in tokens, if I remember correctly this crashes the procedure.
I know that CST can work without MSD-s as well, but with much lower accuracy. Working with POS only would, I assume, hurt accuracy as well.
Problems and experiment plans¶
Take four most recent Polish taggers: Concraft, WCRFT, WMBT, Pantera and assess their lemmatisation accuracy.
It has been long since lemmatisation was assessed on large scale for Polish. Cf. with Jongejan, Bart and Dalianis, Hercules (2009) where they show 93.88% but they seem to be testing on artificial data (morpho dictionary itself without any real text).
1. Refer to tagger testing (gonna tag), similar problem, segmentation changes. Define lemmatisation accuracy lower bound. Upper bound could be as there, ignoring lemmas where seg changed but may be defined more naturally to actually check lemma concatenation.
2. Problem: taggers focus on disambiguation and actually output multiple lemmas per token. This happens not rarely, so makes sense to study weak/strong. Practically this is devastating.
3. Random selection of one lemma?
Weak / Strong lemma: the taggers leave ambiguous lemma
AVG weak lemma lower bound v. AVG strong lemma lower bound
Case-sensitivity: Proper v. proper
Seg change: Lower bound / cat-heur
Seg change v. letter case: which should we take?