WCRFT (Wrocław CRF Tagger) is a simple morpho-syntactic tagger for Polish producing state-of-the-art results.

The tagger combines tiered tagging, conditional random fields (CRF) and features tailored for inflective languages written in WCCL.
The algorithm and code are inspired by Wrocław Memory-Based Tagger.

WCRFT uses CRF++ API as the underlying CRF implementation.

Tiered tagging is assumed. Grammatical class is disambiguated first, then subsequent attributes (as defined in a config file) are taken care of. Each attribute is treated with a separate CRF and may be supplied a different set of feature templates.

For details of the underlying algorithms, as well as tagger evaluation, please refer to the following paper (draft in PDF):

@incollection{wcrft,
title  = "A tiered {CRF} tagger for {P}olish",
author = "Radziszewski, Adam",
pages  = "to appear",
year = "2013",
booktitle = "Intelligent Tools for Building a Scientific Information Platform:  Advanced Architectures and Solutions",
editor = "R. Bembenik, {\L}. Skonieczny, H. Rybi\'{n}ski, M. Kryszkiewicz, M. Niezg{\'o}dka",
publisher = "Springer Verlag" 
}

WCRFT is licensed under GNU LGPL v3.0.

The sources of the newest version may be obtained from our Git repository:

git clone http://nlp.pwr.wroc.pl/wcrft2.git

Versions

NEW: WCRFT has been ported to C++, which completely eliminated Python dependencies and made the tagger significantly faster (about 3 times) and easier to build. The C++ version is called WCRFT2 and may be obtained from the above repository. This is now the default and recommended version. The older version will not be supported any more.
Programming languages aside, both versions are very similar (in fact, the Python version was heavily based on C++ components, hence the rewrite was relatively straightforward). They produce exactly the same results when using the same trained models. Also, training produces the same binary models (existing models are compatible between versions).

If you're looking for details on the older (Python) version, consult WCRFT1.

Usage

For typical usage scenarios please refer to this User_guide. It assumes that you are interested in using the default configuration (nkjp_e2.ini or nkjp_s2.ini) and one of the default trained tagger models (trained on NKJP).

If you're interested in using WCRFT in your C++ or Python code or add support for tagging to any existing application based on Corpus2 library, please read manual for WcrftReader, a simple and universal reader interface to the tagger.

WCRFT is highly configurable and may be used in various ways. Below we give some more detailed information on WCRFT installation, directories where data is sought and tagger training.

Installation

WCRFT requires the following dependencies:
  • CMake (at least 2.8) for installation
  • Boost C++ libraries (at least 1.41), including program-options and filesystem
  • CRF++
  • Corpus2 library
  • MACA package (for morphological analysis of plain text)
  • Morfeusz SGJP (if you want to use v. 1.0 of Morfeusz, please install it before installing MACA so that Morfeusz plugin is also built)
  • WCCL

If all the dependencies are installed correctly, WCRFT2 installation is simple:

# assuming you're within wcrft2 root directory
mkdir bin
cd bin
cmake -i ..
# confirm the default values with ENTER
# analyse the output, if some required dependencies are missing, install the lacking packages, remove CMakeCache.txt file and re-run cmake
make
sudo make install
sudo ldconfig

After successful installation you will have access to WCRFT2 binary (wcrft-app), standard configurations and nkjp_e2 model, shared libraries and headers to use with applications.

Configurations and trained models

WCRFT's behaviour is defined by a config file (ini). The file specifies which attributes are disambiguated (tagging tiers) and in which order. It also provides features and configuration for CRF++ classifiers.
To tag text you also need a trained model. Trained model is a collection of statistical data gathered from a training corpus. Each trained model is tied to a config file and may only work with it (although there may be many models working with one config file, e.g. trained on the whole corpus and on a small part of it).

The software package comes with some configurations for tagging Polish using NKJP tagset, among others:
  • nkjp_e2.ini: faster, consumes little memory and works out-of-the-box but mispredicts tags ~5% more often than the one below;
  • nkjp_s2.ini: previous default; more accurate but requires downloading of a large model and works slower;
  • nkjp_e2-morfeusz2.ini and its s2 counterpart: same as above, but uses new version of morfeusz when tagging plain text documents (check section Tagging plain text)

The model for nkjp_e2.ini config is small and hence included in the source distribution. If WCRFT is installed, you don't need to specify the location of the model, e.g.:

wcrft-app nkjp_e2 -i txt my_docs/input.txt -O my_docs/tagged.xml

NOTE: when referencing configurations, you may use the full name (with .ini), but this is optional.

Large model for more accurate tagging

To obtain higher accuracy please use nkjp_s2 configuration and a model trained using this configuration. The differences in accuracy values may be found on Evaluation page.

This model has been trained on the 1-million subcorpus of the NKJP (1.0) and is available for download here (436 MB; to be unpacked and used with the nkjp_s2.ini config and morfeusz-nkjp-official MACA configuration). Note that the original subcorpus is licensed under GNU GPL 3.0 (download from here).

Example usage:

wcrft-app nkjp_s2 -d downloaded_wcrft_models/model_nkjp10_wcrft_s2/ -i txt my_docs/input.txt -O my_docs/tagged.xml

Where WCRFT looks for configs and models

WCRFT resolves relative paths of config files and model directory in given order:
  1. current working directory
  2. installation directory

By default WCRFT is installed under PREFIX/share/wcrft.

To see the current search path, you can run WCRFT with non-existing model directory, e.g.:

$ wcrft-app nkjp_e2 -d nonexistent someinput
Error: model dir file 'nonexistent' not found in search path .;./config;./model;/usr/local/share/wcrft;/usr/local/share/wcrft/config;/usr/local/share/wcrft/model

Corrcting morphological dictionary deficiencies

You can customise Maca (morphological analyser) configuration to add missing tags and lemmas, override unwanted entries in Morfeusz or provide your own analyser.

For details, please refer to this manual from the Maca project: http://nlp.pwr.wroc.pl/redmine/projects/libpltagger/wiki/Custom_dictionaries

Tagging plain text

WCRFT can tag plain text if used with option: -i text (or -i txt). Then WCRFT uses MACA analyser to perform morphological analysis (details on MACA can be found here)

wcrft-app nkjp_e2.ini -i text plain.txt -O morph-tagged.xml
wcrft-app -d model_nkjp10_wcrft_s2/ nkjp_s2 -i text plain.txt -O morph-tagged.xml

MACA configuration is provided to WCRFT within .ini config file (e.g. nkjp_e2.ini which is installed system-wide; option: macacfg in section general). It can be overriden on command line with option -mc.

Note: WCRFT is also able to tag simple XML files that contain plain text plus division into paragraphs (XCES/premorph or similar).

In general there are three MACA configurations that may be used with nkjp_e2 or nkjp_s2 config:
  1. morfeusz2-nkjp, using 2.0 version of Morfeusz SGJP :http://sgjp.pl/morfeusz/ (* recommended *)
  2. morfeusz-nkjp-official, using old version of "Morfeusz SGJP" (v. 1.0)
  3. morfeusz-nkjp-official-guesser, as above, but also using tag guesser from TaKIPI (requires both the Morfeusz SGJP and Corpus1 library from the TaKIPI repository)

The configuration with guesser will probably produce better lemmatisation of input as Corpus1 guesser is able to guess lemmas (WCRFT has built-in simple tag guessing algorithm but it does not attept to guess lemmas).

Note: the guesser is licensed under GNU GPL. To use it with MACA you have to compile MACA with guesser support, which renders all the code to be licensed under GNU GPL (MACA without the GPL plugins has dual licensing: GNU LGPL or GNU GPL).

Maca and WCRFT are built upon the same corpus2 library, hence they support the same set of corpus formats.

Tagging morphologically analysed corpora

If you have a corpus that has already been morphologically analysed (e.g. using Maca and Morfeusz SGJP), you can pass it directly to WCRFT. WCRFT supports a range of input formats, including XCES (-i xces) and CCL format (-i ccl). Note: using CCL format makes it possible to tag files that also have shallow annotation (e.g. syntactic chunks, proper names) -- this information will be kept intact in tagger output.

Batch mode and stream processing are also supported, see -h for details.

WCRFT supports a range of input and output formats (thanks to corpus2), see -i and -o options. The default is to read and write XCES XML.

Training and evaluation

To train the tagger, specify the configuration file (e.g. bundled nkjp_s2.ini) and a directory to store the model (nkjp_model in the example), e.g.

wcrft-app -d path/to/nkjp_model nkjp_s2.ini --train path/to/train_nkjp.xml -v

NOTE: for best results, it is recommended to train the tagger having reanalysed the training data with the morphological analyser set-up that is going to be used when tagging real input. Here is the instruction, please read carefully before training. This procedure has been used to generate the trained model that is distributed on this site.

For evaluation results as well as a detailed instruction on reproducing the published results, please consult Evaluation page.

Contact and bug reporting

To report bugs and feature requests, use our bugtracker.
Comments and discussion also welcome, please contact the author (Adam Radziszewski, name.surname at pwr.wroc.pl).