WCCL¶

WCCL (Wrocław Corpus Constraint Language) is a formalism for writing functional expressions evaluated on morpho-syntactically annotated text. These expressions may be used directly as features for Machine Learning classification.
Implementation-wise, WCCL is a set of simple command-line utils, as well as the underlying C++ library with Python wrappers, suitable for rapid development of taggers, chunkers, etc.
WCCL is targeted at Polish, although the only obstacle to processing other inflectional languages is probably the assumed string representation of tags and corpus I/O formats.
More specifically, WCCL formalism may be used to:- express simple morpho-syntactic features such as possible values of grammatical case for each token,
- express advanced morpho-syntactic features such as tests for morphological agreement,
- refer to any positional tagset (tagset attributes automatically become valid functions),
- filter word forms and lemmas against frequency lists,
- transform word forms and lemmas with user-supplied dictionaries,
- express constraints to capture multi-word units,
- use variables over different domains (strongly-typed),
- write disambiguation rules (“tag rule” sub-language of WCCL).
- write syntactic/semantic annotation rules (“match rule” sub-language of WCCL).
- Unicode and regex support,
- compatibility with Maca and Corpus2 (enables pipeline processing and usage of tagset-related tools),
- available as C++ library with simple API,
- provides ready-to-use command-line tools for feature generation and tagging with rules,
- Python wrappers for rapid NLP application development.
A bundled util caleld wccl-run
may directly transform corpora into simple tab-separated files with feature values ready for training and testing ML classifiers.
WCCL is a successor of JOSKIPI, a formalism made for the TaKIPI tagger.
More (in Polish): Nowości w języku w stosunku do JOSKIPI.
Documentation¶
Installation¶
The most recent version of WCCL may be obtained from a git repository:
git clone http://nlp.pwr.wroc.pl/wccl.git
The source codes are released under GNU LGPL 3.0.
Proceed to the installation instruction.
Papers¶
Please cite the following paper when using WCCL:
@inproceedings{wccl, author = {Adam Radziszewski and Adam Wardy\'{n}ski and Tomasz {\'{S}}niatowski}, title = {{WCCL}: A Morpho-syntactic Feature Toolkit}, booktitle = {Proceedings of the Balto-Slavonic Natural Language Processing Workshop (BSNLP 2011)}, year = {2011}, publisher = {Springer} }
Full text: wccl.pdf
Presentation slides: BSNLP.pdf
Basic usage¶
For normal usage (command-line), read the User guide
API usage¶
There are two APIs: the original C++ API and convenient Python wrappers. The same holds for the underlying corpus2 library.
Read more: API overview
Language specification (Polish):¶
Specyfikacja — główny dokument, zawiera wprowadzenie i odwołania do specyfikacji podjęzyków:- Wyrażenia funkcyjne: podstawowy język WCCL
- Reguły ujednoznaczniania — ujednoznacznianie morfo-syntaktyczne oraz prymitywne znakowanie anotacji
- Reguły dopasowania — znakowanie płytkich anotacji składniowych/semantycznych
- Plik WCCL: ujednolicona składnia całego pliku WCCL
Reporting bugs and feature requests¶
Internal documentation¶
Format korpusu (CCL format, in Polish; up-to-date English version is here)
Projekt implementacji (po części nieaktualny)
Struktury_danych używane wewnętrznie
Reprezentacja_anotacji — anotacja, zdanie, relacje