WMBT (Wrocław Memory-Based Tagger) is a simple morpho-syntactic tagger for Polish producing state-of-the-art results. WMBT uses TiMBL API as the underlying Memory-Based Learning implementation. The features for classification are generated by using WCCL.

WMBT uses a tiered tagging approach. Grammatical class is disambiguated first, then subsequent attributes (as defined in a config file) are taken care of. Each attribute may be supplied a different set of features.

The software package comes with default configurations for KIPI/IPIC and NKJP tagsets (kipi-guess.ini and nkjp-guess.ini).

The underlying disambiguation algorithm is described in the following paper (please cite the paper if using WMBT):
  • wmbt.pdf Adam Radziszewski and Tomasz Śniatowski — “A memory-based tagger for Polish”. Proceedings of LTC'11 (presentation slides: WMBT-LTC.pdf)

Simple tag guessing algorithm added to treat unknown words

Usage

WMBT is written in Python. Currently there is no installation script (setup.py, but you can use the main module wmbt.py as it is, without installation.

To tag a single input file using a trained model:

wmbt/wmbt.py -d path/to/nkjp_model  config/nkjp-guess.ini input.xml -O tagged.xml

A model trained on the 1-million subcorpus of the NKJP (1.0) is available here: model_nkjp10_guess.tar.bz2 (to be unpacked and used with the nkjp-guess.ini config and morfeusz-nkjp-official-guesser MACA configuration). Note that the original subcorpus is licensed under GNU GPL 3.0 (download from here). We do not claim any rights for the trained model, yet its licence still remains somewhat obscure as GPL is based on the notion of ``source code'', which doesn't have a clear interpretation in the case of corpora.

Batch mode and stream processing are also supported, see -h for details.

WMBT supports a range of input and output formats (thanks to corpus2), see -i and -o options. The default is to read and write XCES XML.

NOTE: By default, the reads sentences only, discarding paragraphs (non-s ‘chunks’ in XCES XML format). When the input is divided into paragraphs, use -C to preserve them in output. Note that if the input is not divided into paragraphs, this may cause the whole input to be loaded into memory before tagging (it is possible that it will be treated as one huge paragraph).

NOTE 2: By default WMBT outputs the chosen interpretations only (‘disamb lexemes’). To preserve whole ambiguity in output, use -A.

Tagging plain text

WMBT itself does not perform morphological analysis itself. The recommended solution is to use MACA. For instance:

maca-analyse morfeusz-nkjp-official-guesser -qs < in.txt > morph.xml -o xces
wmbt/wmbt.py -d model_nkjp10_guess/ config/nkjp-guess.ini morph.xml -O morph-tagged.xml
There are two MACA configurations that we recommend for using with nkjp-guess:
  1. morfeusz-nkjp-official-guesser, using Morfeusz SGJP and a guesser from TaKIPI (requires both the Morfeusz SGJP and Corpus1 library from the TaKIPI repository)
  2. morfeusz-nkjp-official, without the guesser
Configurations compatible with the IPI PAN Corpus / Morfeusz SIaT also exist:
  1. morfeusz-kipi-guesser, using Morfeusz SIaT + guesser
  2. morfeusz-kipi, using Morfeusz SIaT only

NOTE: Please read this instruction if you plan to use Morfeusz NKJP (or both).

Maca and WMBT are built upon the same corpus2 library, hence they support the same set of corpus formats. Batch processing is also supported, see the MACA User Guide. Also, you can use pipeline processing to perform morphological analysis and tagging in one call, although it may be impractical as WMBT takes several seconds of start-up time.

Training

To train the tagger, specify the configuration file (e.g. bundled nkjp-guess.ini) and a directory to store the model (nkjp_model in the example), e.g.

wmbt/wmbt.py -d path/to/nkjp_model config/nkjp-guess.ini --train path/to/train_nkjp.xml -v

NOTE: for best results, it is recommended to train the tagger having reanalysed the training data with the morphological analyser set-up that is going to be used when tagging real input. Here is the instruction, please read carefully before training. This procedure has been used to generate the training data that is distributed on this site.

Download and install

WMBT sources may be obtained from a Git repository:

git clone http://nlp.pwr.wroc.pl/wmbt.git
WMBT requires the following dependencies:
  • Python 2.6 with headers
  • SWIG
  • TiMBL and python-timbl (an updated version of python-timbl is included in the repo, please use it unless the official project site has been updated since 2006)
  • Corpus2 library compiled with Python support
  • WCCL compiled with Python support

Installing TiMBL Python wrapper

As noted above, the official python-timbl distribution might still be outdated (if it's from 2006, it certainly is). In such a case it is recommended to install one of the versions included in the third_party directory of the WMBT repository. There are two versions:

  • python-timbl-2011.08.24 (recommended) — compatible with TiMBL 6.4.1 (and, hopefully, any later).
  • python-timbl-2011.04.01 — compatible with TiMBL 6.3.1. The changes introduced in TiMBL 6.4 include renaming of the underlying soname, hence this won't work with the recent versions of TiMBL.

Note that changes made to the python-timbl package are workaround in nature, so it is likely that the build process is still buggy. Please report any bugs.

The TiMBL wrapper requires TiMBL itself, libxml2 and Boost.Python libraries.

Installation. Enter the selected directory and issue the following:

./setup.py build_ext
# If you get any errors, you probably have non-standard boost, timbl or libxml2 paths.
# If it is the case, see README for details on how to set --boost-include-dir etc.
sudo ./setup.py install

# test if the installation is working
python -c "import timbl; print dir(timbl)" 
# you should get something like ['Algorithm', 'TimblAPI', 'Weighting', '__doc__', '__file__', '__name__', '__package__']

Contact and bug reporting

To report bugs and feature requests, use our bugtracker.
Comments and discussion also welcome, please contact the author (Adam Radziszewski, name.surname at pwr.wroc.pl).

wmbt.pdf - Adam Radziszewski and Tomasz Śniatowski — “A memory-based tagger for Polish”. To appear in LTC'11 proceedings. (139 KB) Adam Radziszewski, 25 Oct 2011 12:25

model_nkjp10_guess.tar.bz2 - A model trained on 1-million subcorpus of the NKJP, see http://clip.ipipan.waw.pl/LRT (10.9 MB) Adam Radziszewski, 02 Jan 2012 11:08

WMBT-LTC.pdf - LTC'11 conference presentation (no guessing module) (149 KB) Adam Radziszewski, 22 Feb 2012 15:50