WCCL

WCCL (Wrocław Corpus Constraint Language) is a formalism for writing functional expressions evaluated on morpho-syntactically annotated text. These expressions may be used directly as features for Machine Learning classification.

Implementation-wise, WCCL is a set of simple command-line utils, as well as the underlying C++ library with Python wrappers, suitable for rapid development of taggers, chunkers, etc.

WCCL is targeted at Polish, although the only obstacle to processing other inflectional languages is probably the assumed string representation of tags and corpus I/O formats.

More specifically, WCCL formalism may be used to:
  • express simple morpho-syntactic features such as possible values of grammatical case for each token,
  • express advanced morpho-syntactic features such as tests for morphological agreement,
  • refer to any positional tagset (tagset attributes automatically become valid functions),
  • filter word forms and lemmas against frequency lists,
  • transform word forms and lemmas with user-supplied dictionaries,
  • express constraints to capture multi-word units,
  • use variables over different domains (strongly-typed),
  • write disambiguation rules (“tag rule” sub-language of WCCL).
  • write syntactic/semantic annotation rules (“match rule” sub-language of WCCL).
The implementation has the following features:
  • Unicode and regex support,
  • compatibility with Maca and Corpus2 (enables pipeline processing and usage of tagset-related tools),
  • available as C++ library with simple API,
  • provides ready-to-use command-line tools for feature generation and tagging with rules,
  • Python wrappers for rapid NLP application development.

A bundled util caleld wccl-run may directly transform corpora into simple tab-separated files with feature values ready for training and testing ML classifiers.

WCCL is a successor of JOSKIPI, a formalism made for the TaKIPI tagger.

More (in Polish): Nowości w języku w stosunku do JOSKIPI.

Documentation

Installation

The most recent version of WCCL may be obtained from a git repository:

git clone http://nlp.pwr.wroc.pl/wccl.git

The source codes are released under GNU LGPL 3.0.

Proceed to the installation instruction.

Papers

Please cite the following paper when using WCCL:

@inproceedings{wccl,
  author = {Adam Radziszewski and Adam Wardy\'{n}ski and Tomasz {\'{S}}niatowski},
  title = {{WCCL}: A Morpho-syntactic Feature Toolkit},
  booktitle = {Proceedings of the Balto-Slavonic Natural Language Processing Workshop (BSNLP 2011)},
  year = {2011},
  publisher = {Springer}
}

Full text: wccl.pdf
Presentation slides: BSNLP.pdf

Basic usage

For normal usage (command-line), read the User guide

API usage

There are two APIs: the original C++ API and convenient Python wrappers. The same holds for the underlying corpus2 library.

Read more: API overview

Language specification (Polish):

Specyfikacja — główny dokument, zawiera wprowadzenie i odwołania do specyfikacji podjęzyków:

Reporting bugs and feature requests

How to report bugs?

Internal documentation

Format korpusu (CCL format, in Polish; up-to-date English version is here)

Zbieranie wymagań

Projekt implementacji (po części nieaktualny)

Struktury_danych używane wewnętrznie

Bitowa reprezentacja tagu

Reprezentacja_anotacji — anotacja, zdanie, relacje

Testowanie operatora match

wccl.pdf - Adam Radziszewski, Adam Wardyński and Tomasz Śniatowski — “WCCL: A Morpho-syntactic Feature Toolkit”. BSNLP'11 (draft) (257 KB) Adam Radziszewski, 27 Jul 2011 15:23

wccl.jpg (51.2 KB) Adam Radziszewski, 23 Dec 2011 10:56

BSNLP.pdf - BSNLP'11 presentation slides (101 KB) Adam Radziszewski, 23 May 2012 14:52

wccl.vim - WCCL Syntax File (3.27 KB) Paweł Kędzia, 03 Jan 2013 11:38