WCCL (Wrocław Corpus Constraint Language) is a formalism for writing functional expressions evaluated on morpho-syntactically annotated text. These expressions may be used directly as features for Machine Learning classification.
Implementation-wise, WCCL is a set of simple command-line utils, as well as the underlying C++ library with Python wrappers, suitable for rapid development of taggers, chunkers, etc.
WCCL is targeted at Polish, although the only obstacle to processing other inflectional languages is probably the assumed string representation of tags and corpus I/O formats.
More specifically, WCCL formalism may be used to:
- express simple morpho-syntactic features such as possible values of grammatical case for each token,
- express advanced morpho-syntactic features such as tests for morphological agreement,
- refer to any positional tagset (tagset attributes automatically become valid functions),
- filter word forms and lemmas against frequency lists,
- transform word forms and lemmas with user-supplied dictionaries,
- express constraints to capture multi-word units,
- use variables over different domains (strongly-typed),
- write disambiguation rules (“tag rule” sub-language of WCCL).
- write syntactic/semantic annotation rules (“match rule” sub-language of WCCL).
The implementation has the following features:
- Unicode and regex support,
- compatibility with Maca and Corpus2 (enables pipeline processing and usage of tagset-related tools),
- available as C++ library with simple API,
- provides ready-to-use command-line tools for feature generation and tagging with rules,
- Python wrappers for rapid NLP application development.
A bundled util caleld
wccl-run may directly transform corpora into simple tab-separated files with feature values ready for training and testing ML classifiers.
WCCL is a successor of JOSKIPI, a formalism made for the TaKIPI tagger.
More details as well as pointers to the source code may be found on the project site.