This site describes the older version of WCRFT — WCRFT ver. 1. This version is written in Python.

Installation

The source codes may be obtained from our Git repository:

git clone http://nlp.pwr.wroc.pl/wcrft.git

WCRFT (ver. 1) requires the following dependencies:
  • Python 2.6 or 2.7 with headers
  • Python setuptools for installation
  • SWIG (SWIG and python-dev must be installed before installation of Corpus2 and WCCL to get the python wrappers built)
  • CRF++ with Python support (install CRF++ itself first, then enter the `python' subdir and install Python wrappers)
  • Corpus2 library compiled with Python support
  • MACA package compiled with Python support (for morphological analysis of plain text)
  • Morfeusz SGJP (please install it before installing MACA so that Morfeusz plugin is also built)
  • WCCL compiled with Python support

If all the dependencies are installed correctly, the installation is simple:

sudo ./setup.py install

NOTE: installation is recommended, but not necessary. When installed, you may run the wcrft binary and use only config names without specifying full directories, e.g.:

wcrft nkjp_e2.ini -i txt my_docs/input.txt -O my_docs/tagged.xml

Without installation, you can still run the main python module and give paths to config files and trained model, e.g.

python wcrft/wcrft/wcrft.py wcrft/wcrft/config/nkjp_s2.ini -d downloaded_wcrft_models/model_nkjp10_wcrft_s2/ -i txt my_docs/input.txt -O my_docs/tagged.xml

Usage

WCRFT is highly configurable and may be used in various ways. For typical usage scenarios please refer to this User_guide_ver1. It assumes that you are interested in using the default configuration (nkjp_e2.ini or nkjp_s2.ini) and one of the default trained tagger models (trained on NKJP).

Configurations and trained models

WCRFT's behaviour is defined by a config file (ini). The file specifies which attributes are disambiguated (tagging tiers) and in which order. It also provides features and configuration for CRF++ classifiers.
To tag text you also need a trained model. Trained model is a collection of statistical data gathered from a training corpus. Each trained model is tied to a config file at may only work with it (although there may be many models working with one config file, e.g. trained on the whole corpus and on a small part of it).

NOTE: WCRFT1 and WCRFT2 use the same syntax of configuration files and the same format of trained models, hence the same files may be used for both versions. This holds for version 1.* and 2.0, perhaps newer versions of WCRFT2 will introduce some changes.

The software package comes with two configurations for tagging Polish using NKJP tagset:
  • nkjp_e2.ini: faster, consumes little memory and works out-of-the-box but mispredicts tags ~5% more often than the one below;
  • nkjp_s2.ini: previous default; more accurate but requires downloading of a large model and works slower.

The model for nkjp_e2.ini config is small and hence included in the source distribution. If WCRFT is installed, you don't need to specify the location of the model, e.g.:

wcrft nkjp_e2.ini -i txt my_docs/input.txt -O my_docs/tagged.xml

Large model for more accurate tagging

To obtain higher accuracy please use nkjp_s2.ini configuration and a model trained using this configuration. The differences in accuracy values may be found on Evaluation page.

This model has been trained on the 1-million subcorpus of the NKJP (1.0) and is available for download here (436 MB; to be unpacked and used with the nkjp_s2.ini config and morfeusz-nkjp-official MACA configuration). Note that the original subcorpus is licensed under GNU GPL 3.0 (download from here).

Example usage:

wcrft nkjp_s2.ini -d downloaded_wcrft_models/model_nkjp10_wcrft_s2/ -i txt my_docs/input.txt -O my_docs/tagged.xml

Where WCRFT looks for configs and models

WCRFT resolves relative paths of config files and model directory in given order:
  1. current working directory
  2. installation directory
By default WCRFT is installed under packages directory (dist-packages or site-packages, it should be in one of those):
Platform Standard installation location Default value
Unix (pure) prefix/lib/pythonX.Y/dist-packages /usr/local/lib/pythonX.Y/dist-packages
Unix (non-pure) exec-prefix/lib/pythonX.Y/dist-packages /usr/local/lib/pythonX.Y/dist-packages
Windows prefix\Lib\dist-packages C:\PythonXY\Lib\dist-packages

There you can find two directories for config files and model files:

./dist-packages/wcrft-(version)-(python-version).egg/wcrft/config
./dist-packages/wcrft-(version)-(python-version).egg/wcrft/model

If you wish to be able to run WCRFT without having to specify full path to tagger model, just place the model subdir into the installation site wcrft/model directory (e.g. /usr/local/lib/python2.6/dist-packages/wcrft-0.8.0-py2.6.egg/wcrft/model). This also allows all system users to use the same model.

If used with no model directory WCRFT defaults to installation directory. You can link your model files into ./site-packages/wcrft-(version)-(pyversion).egg/wcrft/model (or copy - linking is recommended as reinstalling WCRFT deletes all files from ./site-packages/wcrft-(version)-(pyversion).egg/wcrft/model directory) and then you can call WCRFT from any directory as:

wcrft nkjp_s2.ini input.xml -O tagged.xml

Or with path to model:
wcrft -d path/to/nkjp_model  nkjp_s2.ini input.xml -O tagged.xml

Tagging plain text

WCRFT can tag plain text if used with option: -i text (or -i txt). Then WCRFT uses MACA analyser to perform morphological analysis (details on MACA can be found here)

wcrft -d model_nkjp10_wcrft_s2/ nkjp_s2.ini -i text plain.txt -O morph-tagged.xml

MACA configuration is provided to WCRFT within .ini config file (option: macacfg in section general). It can be overriden on command line with option -c.

Using MACA configuration described in config/nkjp.ini file:

wcrft -d model_nkjp10_wcrft_s2/ nkjp_s2.ini -i text plain.txt -O morph-tagged.xml

Using MACA configuration specified on the command line:

wcrft -d model_nkjp10_wcrft_s2/ nkjp_s2.ini -i text plain.txt -O morph-tagged.xml -c morfeusz-nkjp-guesser

Note: WCRFT is also able to tag simple XML files containing division of text into paragraphs (XCES/premorph or similar).

In general there are two MACA configurations that may be used with nkjp_s2.ini config:
  1. morfeusz-nkjp-official, using Morfeusz SGJP (recommended)
  2. morfeusz-nkjp-official-guesser, as above, but also using tag guesser from TaKIPI (requires both the Morfeusz SGJP and Corpus1 library from the TaKIPI repository)

The configuration with guesser will probably produce better lemmatisation of input as Corpus1 guesser is able to guess lemmas (WCRFT has built-in simple tag guessing algorithm but it does not attept to guess lemmas).

Note: the guesser is licensed under GNU GPL. To use it with MACA you have to compile MACA with guesser support, which renders all the code to be licensed under GNU GPL (MACA without the GPL plugins has dual licensing: GNU LGPL or GNU GPL).

Maca and WCRFT are built upon the same corpus2 library, hence they support the same set of corpus formats.

Tagging morphologically analysed corpora

If you have a corpus that has already been morphologically analysed (e.g. using Maca and Morfeusz SGJP), you can pass it directly to WCRFT. WCRFT supports a range of input formats, including XCES (-i xces) and CCL format (-i ccl). Note: using CCL format makes it possible to tag files that also have shallow annotation (e.g. syntactic chunks, proper names) -- this information will be kept intact in tagger output.

Batch mode and stream processing are also supported, see -h for details.

WCRFT supports a range of input and output formats (thanks to corpus2), see -i and -o options. The default is to read and write XCES XML.

NOTE: By default, the reads sentences only, discarding paragraphs (non-s ‘chunks’ in XCES XML format). When the input is divided into paragraphs, use -C to preserve them in output. Note that if the input is not divided into paragraphs, this may cause the whole input to be loaded into memory before tagging (it is possible that it will be treated as one huge paragraph).

NOTE 2: By default WCRFT outputs the chosen interpretations only (‘disamb lexemes’). To preserve whole ambiguity in output, use -A.

Training and evaluation

To train the tagger, specify the configuration file (e.g. bundled nkjp_s2.ini) and a directory to store the model (nkjp_model in the example), e.g.

wcrft -d path/to/nkjp_model nkjp_s2.ini --train path/to/train_nkjp.xml -v

NOTE: for best results, it is recommended to train the tagger having reanalysed the training data with the morphological analyser set-up that is going to be used when tagging real input. Here is the instruction, please read carefully before training. This procedure has been used to generate the trained model that is distributed on this site.

For evaluation results as well as a detailed instruction on reproducing the published results, please consult Evaluation page.