The WrocUT Language Technology Group G4.19
Morphosyntactic toolchain

This site describes collectively tools for morhosyntactic processing of Polish developed at Wrocław University of Technology. The tools may be used to perform the following tasks:

  1. Tokenisation — division into tokens and sentences
  2. Morphosyntactic analysis using the available analysers and dictionaries (including Morfeusz SGJP/SIAT), but also user-supplied dictionaries
  3. Morphosyntactic tagging
  4. Shallow parsing (understood as chunking)
  5. Turning running text into a sequence of feature vectors (using WCCL formalism, useful for further NLP tasks)

The tools may work together as a whole toolchain, or be used separately. They are all free software (most components are available under GNU LGPL, see the “Licence” section).

All of the tools have been tested under GNU/Linux. Some will also work elsewhere.

Contents

Most of the tools provide command-line utilities. All of them provide libraries with C++ or Python API (or both). The list below contains links to project sites, where you may find documentation, download site and installation instructions.

  1. Corpus2 — library for rapid NLP application development (C++ and Python APIs). Support for positional tagsets as well as a number of corpus I/O formats. Also provides an util to read and convert corpus formats (corpus-get).
  2. Toki — a configurable tokeniser. Supporting SRX standard. A simple util and C++ library.
  3. MACA — acts as a morphological analyser, combining various sources of morphological data (including user-supplied dictionaries and Morfeusz). Employs Toki for tokenisation. Also supports simple on-the-fly tagset conversion. Utilities and library (C++ and Python APIs).
  4. WCCL — a formalism of functional expressions evaluated against morphosyntactically annotated sentences (e.g. MACA or tagger output). May be used to generate advanced morphosyntactic features, which is especially helpful in NLP applications using Machine Learning approach. WCCL predicates may also be used as a corpus query language. The implementation consists of utils and library (C++ and Python APIs).
  5. WCRFT — morphosyntactic tagger for Polish. May tag plain text and simple XML files (thanks to MACA library). The tagger uses tiered tagging and CRF. WCRFT is a Python module that may be used as command-line util or Python module to import.
  6. IOBBER — chunker for Polish. The default configuration recognises NP, VP and AdjP chunks. Also a Python util + module.

The tools may work together. Using the tagger requires installation of the items 1–4. These items are used as libraries, hence the user must only run the tagger.

Using the chunker requires installation of all the tools (or, installing Corpus2, WCCL and another tagger). To get text chunked, the user must run the tagger first and then run IOBBER against tagger output.

It is also possible to see the tools in action without installation — we provide a VM image for that purpose (see below).

Licence

Basic versions of all the tools (that is, without the add-ons listed below) are licensed under GNU LGPL 3.0. The licence allows for both commercial and non-commercial usage.

Corpus2 and MACA packages are distributed with optional add-ons. The user is free to choose whether to install the add-ons or not. Installation of any of the add-ons listed below etails change of licence of the whole toolchain to GNU GPL 3.0. This is due to licensing of software required by the add-ons.

The GPL add-ons are:

  1. Poliqarp Reader — a reader for binary corpus storage format of Poliqarpa, which is distributed here with Corpus2. Istallation of the add-on allows direct reading of Poliqarp-encoded corpora. If Corpus2 is compiled with Poliqarp Reader, all the tools will be able to read Poliqarp corpora directly. This is especially convenient in the case of the tagger and feature generation toolkit WCCL (a compact Poliqarp corpus may be turned directly into a simple text file, whose lines contain feature vectors generated for all the tokens or only tokens satisfying given morphosyntactic predicate).
  2. Guesser Plugin — add-on for MACA providing unkown form guesser (guesses tags and lemmas). The add-on wraps the guesser from TaKIPI tagger package. The add-on is required to run some of MACA configurations, e.g. morfeusz-nkjp-official-guesser.
  3. SFST Plugin — add-on for MACA providing support for user-supplied dictionaries compiled into transducers. Transducers offer compact dictionary storage and efficient processing. They are compiled using the SFST package (MACA repository includes a manual for dictionary compilation, see the doc subdir). SFST library is used for reading the compiled transducers, which forces the add-on being GPL'ed.


Virtual machine image

Running the toolchain requires installation of a number of dependencies. For your convenience, we also provide a virtual machine disk image (VirtualBox) containing Ubuntu 12.04 (64-bit) and the whole toolchain pre-installed. The installation includes the GPL add-ons. For more information, please read the README file.

http://156.17.134.43/share/ubuntu_vm/

Contact and bug reporting

To help us improve our software, please report any bugs as well as comments.

Please follow this instructions to report bugs or feature requests.

Any other comments are also wellcome. Please use the contact e-mail given in the bottom of the site regarding particular tool (in case it's not there, please use the address given at the  WCRFT tagger site).