Basic assumptions

The WCCL library offers the following functionality:
  • parser for the WCCL formalism, able to parse files and strings with WCCL expressions,
  • a class representing a parsed WCCL file, possibly containing rules and named expressions; the contents of the parsed file may be accessed by their type or labels
  • classes representing parsed functional expressions, which may be evaluated against a given sentence with one of its token set as the centre,
  • classes representing tagging rules and annotation/match rules.
WCCL is tightly integrated with and built upon the Corpus2 library. The Corpus2 library itself offers the following features:
  • data structures representing tokens, tags, annotated sentences,
  • support for configurable positional tagsets,
  • routines for reading and writing annotated sentences in various formats (including XCES, so-called CCL format and simple plain text format),
  • data structures and routines for dealing with shallow syntactic annotations (chunking).
Both libraries contain two APIs:
  1. The original C++ API, consisting of header files.
  2. Python modules wrapping almost the whole C++ API (made using SWIG), making it possible to write NLP applications in Python, while offloading the core functionality to C++ implementation.

The Python wrappers are built automatically when CMake (of both Corpus2 and WCCL projects) discovers SWIG and Python installed with headers. To check if they are installed, try to import corpus2 and wccl modules.

API overview

A typical usage usually includes some of the following points:
  1. Create a Tagset object using a tagset name.
  2. Create a TokenReader object to read sentences, tokens or whole paragraphs.
  3. Create a TokenWriter object to write processed input.
  4. Create a WCCL Parser and parse a given file with WCCL expressions.
  5. Apply functional operators against sentences.
  6. Apply tag or annotation/match rules.

Tagsets (Corpus2)

A tagset defines a set of valid tags. More specifically, a tagset defines the following:
  1. set of grammatical class mnemonics (grammatical class ~= part of speech; e.g. subst in the KIPI tagset)
  2. set of attribute name mnemonics (attributes ~= grammatical categories; the mnemonics might be somewhat arbitrary, e.g. cas is used by default Corpus2 config to represent grammatical case in the KIPI tagset)
  3. set of attribute value mnemonics (e.g. nom stands for nominative case in the KIPI tagset)
  4. assignment of attributes to grammatical classes (some attributes may be marked as optional — they value may be not given and the tag is still valid)

A tagset is defined by its INI-style file (e.g. kipi.tagset), see the corpus2 documentation for details. Tagset definition files are contained in prefix/share/corpus2; they are also sought in the current directory.

Any functionality that may need to parse existing tags, create new tags and perform inference assuming some knowledge about the tagset, need to know what tagset to use. Hence, you will have to create an object representing the tagset to be used. Tagsets are immutable in nature and they are cached in a singleton dictionary, preventing from unneccessary I/O operation.

Tagsets are instantiated with the Corpus2::get_named_tagset call (corpus2.get_named_tagset in Python).

Token readers and writers (Corpus2)

Corpus2 offers a rich library of readers and writers. There are three possible modes of reader operation, corresponding to three different methods of the reader object:
  1. get_next_sentence: reading sentence-by-sentence, discarding possible paragraph information. This is the recommended mode for most of the normal applications, unless you know that the input contains paragraph division. It is perfectly valid to read the input divided into paragraphs with this mode, yet you will not see the paragraph division.
  2. get_next_chunk: reading paragraph-by-paragraph. NOTE: if the input contains no paragraph division (e.g. chunk XML tags other than type="s" in XCES), it is likely that the whole input will be read into memory as one huge paragraph. A paragraph (Chunk class) contains a list of sentence objects, which may be accessed.
  3. get_next_token: reading token-by-token, discarding any sentence and paragraph information (if present). This is the recommended mode when only token-level information is needed, i.e. orthographic forms, sets of morpho-syntactic interpretations.

Writers are also suited for writing sentences, tokens or paragraphs (“chunks”).

There are two ways to create a reader: create_path_reader (reading from a file), create_path_reader, create_stream_reader (using an open stream). The Python wrapper also provides a convenience create_stdin_reader. Similar API is available in the TokenWriter class.

Token readers and writers are created by using format names, possibly containing options, e.g. xces,flat. This idea is explained in the MACA User Guide.

WCCL Parser and WCCL files

The WCCL parser is constructed with a tagset object. A parser object is ready to parse on of the following:
  1. Whole WCCL file (recommended usage): parseWcclFileFromPath
  2. Parse a single functional operator from a string or a stream. There are several variants, one for any-type operator (the parsed operator will be returned as a generic FunctionalOperator) and one for each type, e.g. parseStringOperator. This is useful for short operators, such as predicates to filter corpora retrieved from the user's supplied command-line args.

There are also routines for parsing rules, yet the recommended way to parse rules is to parse a whole WCCL file (the syntax is identical in this case).

Functional expressions

Functional expressions are applied against a sentence wrapped as a SentenceContext object. SentenceContext is essentially a pointer to a sentence with one position highlighted as the centre (referred to as 0 in the WCCL code). The following C++ code instantiates a SentenceContext:

Wccl::SentenceContext sc(sentence);

The centre may be set using sc.set_position(i), where i is an absolute position, 0 <= i < sentence.size().

Functional expressions may be generic (any-type) or have a specified type. Any-type functional expressions should be applied using the base_apply method, while the typed operators may be used with the apply function.

The resulting type of the application is Value. Values may represent one of the basic WCCL data types. Hence set of tagset symbols is one of the types, in principle you need to give a tagset object to retrieve string representation of the Value object.

When parsing a WCCL file, operators are stored in named sections. Each section may contain one ore more operators. Sections may be queried for, and lists of parsed expressions may be retrieved. See wcclfile.h for details of the C++ API, or issue dir(wccl_file) on a parsed WCCL file in you Python interpreter to see the available methods — the names and docs should be pretty self explanatory. Examples are described below.

Python note: please use the get_…_ptr variants of the methods, they use smart pointers, which guarantee correct memory management.

Rules

There are two special sections of the WCCL file: one for tagging rules, one for match/annotation rules. Both sections may be present or not (you can test it with the corresponding has… function).

Tagging rules may be fired once (recommended) or until no changes (in extreme cases may lead to almost-infinite loops).

Annotation rules may only be fired once.

For details, see examples (sections below).

Python note: please use the get_…_ptr variants of the functions.

Examples using the Python API

See doc/wccl-rules.py for usage of the Python API to Corpus2 readers and writers, the WCCL parser and practical usage of WCCL rules. The example also demonstrates how to use the data structures related to shallow syntactic annotation.

See doc/wccl-run.py for usage of Corpus2 readers and WCCL functional expressions. This is a simplified-but-working version of the wccl-run toolkit made in Python.

More examples of Corpus2 API may be found in corpus2/doc. Also see the sources of the corpus-get script provided with Corpus2 (this is a Python code).

Examples using the C++ API

See the source code of all the utils: wccl-run, wccl-rules, wccl-features and wccl-parser. There is currently no C++ example that would illustrate the usage of the underlying data structures for shallow syntactic annotation — please look at the Python script wccl-rules.py — the method and class names are generally the same across APIs.