WMBT (Wrocław Memory-Based Tagger) is a simple morpho-syntactic tagger for Polish producing state-of-the-art results. WMBT uses TiMBL API as the underlying Memory-Based Learning implementation. The features for classification are generated by using WCCL.
WMBT uses a tiered tagging approach. Grammatical class is disambiguated first, then subsequent attributes (as defined in a config file) are taken care of. Each attribute may be supplied a different set of features.
The software package comes with default configurations for KIPI/IPIC and NKJP tagsets (kipi-guess.ini
and nkjp-guess.ini
).
- wmbt.pdf Adam Radziszewski and Tomasz Śniatowski — “A memory-based tagger for Polish”. Proceedings of LTC'11 (presentation slides: WMBT-LTC.pdf)
Simple tag guessing algorithm added to treat unknown words
Usage¶
WMBT is written in Python. Currently there is no installation script (setup.py, but you can use the main module wmbt.py as it is, without installation.
To tag a single input file using a trained model:
wmbt/wmbt.py -d path/to/nkjp_model config/nkjp-guess.ini input.xml -O tagged.xml
A model trained on the 1-million subcorpus of the NKJP (1.0) is available here: model_nkjp10_guess.tar.bz2 (to be unpacked and used with the nkjp-guess.ini
config and morfeusz-nkjp-official-guesser
MACA configuration). Note that the original subcorpus is licensed under GNU GPL 3.0 (download from here). We do not claim any rights for the trained model, yet its licence still remains somewhat obscure as GPL is based on the notion of ``source code'', which doesn't have a clear interpretation in the case of corpora.
Batch mode and stream processing are also supported, see -h for details.
WMBT supports a range of input and output formats (thanks to corpus2), see -i
and -o
options. The default is to read and write XCES XML.
NOTE: By default, the reads sentences only, discarding paragraphs (non-s ‘chunks’ in XCES XML format). When the input is divided into paragraphs, use -C to preserve them in output. Note that if the input is not divided into paragraphs, this may cause the whole input to be loaded into memory before tagging (it is possible that it will be treated as one huge paragraph).
NOTE 2: By default WMBT outputs the chosen interpretations only (‘disamb lexemes’). To preserve whole ambiguity in output, use -A.
Tagging plain text¶
WMBT itself does not perform morphological analysis itself. The recommended solution is to use MACA. For instance:
maca-analyse morfeusz-nkjp-official-guesser -qs < in.txt > morph.xml -o xces wmbt/wmbt.py -d model_nkjp10_guess/ config/nkjp-guess.ini morph.xml -O morph-tagged.xmlThere are two MACA configurations that we recommend for using with
nkjp-guess
:
morfeusz-nkjp-official-guesser
, using Morfeusz SGJP and a guesser from TaKIPI (requires both the Morfeusz SGJP and Corpus1 library from the TaKIPI repository)morfeusz-nkjp-official
, without the guesser
morfeusz-kipi-guesser
, using Morfeusz SIaT + guessermorfeusz-kipi
, using Morfeusz SIaT only
NOTE: Please read this instruction if you plan to use Morfeusz NKJP (or both).
Maca and WMBT are built upon the same corpus2 library, hence they support the same set of corpus formats. Batch processing is also supported, see the MACA User Guide. Also, you can use pipeline processing to perform morphological analysis and tagging in one call, although it may be impractical as WMBT takes several seconds of start-up time.
Training¶
To train the tagger, specify the configuration file (e.g. bundled nkjp-guess.ini
) and a directory to store the model (nkjp_model
in the example), e.g.
wmbt/wmbt.py -d path/to/nkjp_model config/nkjp-guess.ini --train path/to/train_nkjp.xml -v
NOTE: for best results, it is recommended to train the tagger having reanalysed the training data with the morphological analyser set-up that is going to be used when tagging real input. Here is the instruction, please read carefully before training. This procedure has been used to generate the training data that is distributed on this site.
Download and install¶
WMBT sources may be obtained from a Git repository:
git clone http://nlp.pwr.wroc.pl/wmbt.gitWMBT requires the following dependencies:
- Python 2.6 with headers
- SWIG
- TiMBL and
python-timbl
(an updated version ofpython-timbl
is included in the repo, please use it unless the official project site has been updated since 2006) - Corpus2 library compiled with Python support
- WCCL compiled with Python support
Installing TiMBL Python wrapper¶
As noted above, the official python-timbl distribution might still be outdated (if it's from 2006, it certainly is). In such a case it is recommended to install one of the versions included in the third_party
directory of the WMBT repository. There are two versions:
- python-timbl-2011.08.24 (recommended) — compatible with TiMBL 6.4.1 (and, hopefully, any later).
- python-timbl-2011.04.01 — compatible with TiMBL 6.3.1. The changes introduced in TiMBL 6.4 include renaming of the underlying soname, hence this won't work with the recent versions of TiMBL.
Note that changes made to the python-timbl package are workaround in nature, so it is likely that the build process is still buggy. Please report any bugs.
The TiMBL wrapper requires TiMBL itself, libxml2 and Boost.Python libraries.
Installation. Enter the selected directory and issue the following:
./setup.py build_ext # If you get any errors, you probably have non-standard boost, timbl or libxml2 paths. # If it is the case, see README for details on how to set --boost-include-dir etc. sudo ./setup.py install # test if the installation is working python -c "import timbl; print dir(timbl)" # you should get something like ['Algorithm', 'TimblAPI', 'Weighting', '__doc__', '__file__', '__name__', '__package__']
Contact and bug reporting¶
To report bugs and feature requests, use our bugtracker.
Comments and discussion also welcome, please contact the author (Adam Radziszewski, name.surname
at pwr.wroc.pl
).