Basic assumptions¶The WCCL library offers the following functionality:
- parser for the WCCL formalism, able to parse files and strings with WCCL expressions,
- a class representing a parsed WCCL file, possibly containing rules and named expressions; the contents of the parsed file may be accessed by their type or labels
- classes representing parsed functional expressions, which may be evaluated against a given sentence with one of its token set as the centre,
- classes representing tagging rules and annotation/match rules.
- data structures representing tokens, tags, annotated sentences,
- support for configurable positional tagsets,
- routines for reading and writing annotated sentences in various formats (including XCES, so-called CCL format and simple plain text format),
- data structures and routines for dealing with shallow syntactic annotations (chunking).
- The original C++ API, consisting of header files.
- Python modules wrapping almost the whole C++ API (made using SWIG), making it possible to write NLP applications in Python, while offloading the core functionality to C++ implementation.
The Python wrappers are built automatically when CMake (of both Corpus2 and WCCL projects) discovers SWIG and Python installed with headers. To check if they are installed, try to import
API overview¶A typical usage usually includes some of the following points:
- Create a Tagset object using a tagset name.
- Create a TokenReader object to read sentences, tokens or whole paragraphs.
- Create a TokenWriter object to write processed input.
- Create a WCCL Parser and parse a given file with WCCL expressions.
- Apply functional operators against sentences.
- Apply tag or annotation/match rules.
Tagsets (Corpus2)¶A tagset defines a set of valid tags. More specifically, a tagset defines the following:
- set of grammatical class mnemonics (grammatical class ~= part of speech; e.g.
substin the KIPI tagset)
- set of attribute name mnemonics (attributes ~= grammatical categories; the mnemonics might be somewhat arbitrary, e.g.
casis used by default Corpus2 config to represent grammatical case in the KIPI tagset)
- set of attribute value mnemonics (e.g.
nomstands for nominative case in the KIPI tagset)
- assignment of attributes to grammatical classes (some attributes may be marked as optional — they value may be not given and the tag is still valid)
A tagset is defined by its INI-style file (e.g.
kipi.tagset), see the corpus2 documentation for details. Tagset definition files are contained in
prefix/share/corpus2; they are also sought in the current directory.
Any functionality that may need to parse existing tags, create new tags and perform inference assuming some knowledge about the tagset, need to know what tagset to use. Hence, you will have to create an object representing the tagset to be used. Tagsets are immutable in nature and they are cached in a singleton dictionary, preventing from unneccessary I/O operation.
Tagsets are instantiated with the
Corpus2::get_named_tagset call (
corpus2.get_named_tagset in Python).
Token readers and writers (Corpus2)¶Corpus2 offers a rich library of readers and writers. There are three possible modes of reader operation, corresponding to three different methods of the reader object:
get_next_sentence: reading sentence-by-sentence, discarding possible paragraph information. This is the recommended mode for most of the normal applications, unless you know that the input contains paragraph division. It is perfectly valid to read the input divided into paragraphs with this mode, yet you will not see the paragraph division.
get_next_chunk: reading paragraph-by-paragraph. NOTE: if the input contains no paragraph division (e.g.
chunkXML tags other than
type="s"in XCES), it is likely that the whole input will be read into memory as one huge paragraph. A paragraph (
Chunkclass) contains a list of sentence objects, which may be accessed.
get_next_token: reading token-by-token, discarding any sentence and paragraph information (if present). This is the recommended mode when only token-level information is needed, i.e. orthographic forms, sets of morpho-syntactic interpretations.
Writers are also suited for writing sentences, tokens or paragraphs (“chunks”).
There are two ways to create a reader:
create_path_reader (reading from a file),
create_stream_reader (using an open stream). The Python wrapper also provides a convenience
create_stdin_reader. Similar API is available in the TokenWriter class.
Token readers and writers are created by using format names, possibly containing options, e.g.
xces,flat. This idea is explained in the MACA User Guide.
WCCL Parser and WCCL files¶The WCCL parser is constructed with a tagset object. A parser object is ready to parse on of the following:
- Whole WCCL file (recommended usage):
- Parse a single functional operator from a string or a stream. There are several variants, one for any-type operator (the parsed operator will be returned as a generic FunctionalOperator) and one for each type, e.g. parseStringOperator. This is useful for short operators, such as predicates to filter corpora retrieved from the user's supplied command-line args.
There are also routines for parsing rules, yet the recommended way to parse rules is to parse a whole WCCL file (the syntax is identical in this case).
Functional expressions are applied against a sentence wrapped as a SentenceContext object. SentenceContext is essentially a pointer to a sentence with one position highlighted as the centre (referred to as 0 in the WCCL code). The following C++ code instantiates a SentenceContext:
The centre may be set using
i is an absolute position, 0 <= i < sentence.size().
Functional expressions may be generic (any-type) or have a specified type. Any-type functional expressions should be applied using the
base_apply method, while the typed operators may be used with the
The resulting type of the application is Value. Values may represent one of the basic WCCL data types. Hence set of tagset symbols is one of the types, in principle you need to give a tagset object to retrieve string representation of the Value object.
When parsing a WCCL file, operators are stored in named sections. Each section may contain one ore more operators. Sections may be queried for, and lists of parsed expressions may be retrieved. See wcclfile.h for details of the C++ API, or issue dir(wccl_file) on a parsed WCCL file in you Python interpreter to see the available methods — the names and docs should be pretty self explanatory. Examples are described below.
Python note: please use the get_…_ptr variants of the methods, they use smart pointers, which guarantee correct memory management.
There are two special sections of the WCCL file: one for tagging rules, one for match/annotation rules. Both sections may be present or not (you can test it with the corresponding has… function).
Tagging rules may be fired once (recommended) or until no changes (in extreme cases may lead to almost-infinite loops).
Annotation rules may only be fired once.
For details, see examples (sections below).
Python note: please use the get_…_ptr variants of the functions.
Examples using the Python API¶
doc/wccl-rules.py for usage of the Python API to Corpus2 readers and writers, the WCCL parser and practical usage of WCCL rules. The example also demonstrates how to use the data structures related to shallow syntactic annotation.
doc/wccl-run.py for usage of
Corpus2 readers and WCCL functional expressions. This is a simplified-but-working version of the wccl-run toolkit made in Python.
More examples of Corpus2 API may be found in corpus2/doc. Also see the sources of the
corpus-get script provided with Corpus2 (this is a Python code).
Examples using the C++ API¶
See the source code of all the utils:
wccl-parser. There is currently no C++ example that would illustrate the usage of the underlying data structures for shallow syntactic annotation — please look at the Python script
wccl-rules.py — the method and class names are generally the same across APIs.