Corpus2

Corpus2 — data structures & routines for processing annotated corpora with positional tagsets

Corpus2 library offers the following features:
  • data structures representing tokens, tags, annotated sentences,
  • support for configurable positional tagsets (defined as simple INI files),
  • routines for reading and writing annotated sentences in various formats (including XCES, so-called CCL format and simple plain text format),
  • data structures and routines for dealing with shallow syntactic annotations (chunking),
  • optional support for reading Poliqarp format — binary indexed corpora
The library contains two APIs:
  1. The original C++ API, consisting of header files
  2. Python modules wrapping almost the whole C++ API (made using SWIG), making it possible to write NLP applications in Python, while offloading the core functionality to C++ implementation.

Installation

Corpus2 in its basic installation (without Poliqarp reader) is available under GNU LGPL 3.0.
CAUTION: installing Poliqarp library with Corpus2 results in Corpus2 being GNU GPL 3.0.

The sources may be obtained from our public git repository:

git clone http://nlp.pwr.wroc.pl/corpus2.git

The following libraries must be installed beforehand (headers are also needed, so if you're using a package manager from your Linux distro, please install development versions):
  • Boost 1.41 or later (tested with 1.41, 1.42 and 1.47), packages: program_options, system, filesystem, regex (Ubuntu packages libboost-(...)-dev / Fedora package boost-devel; remember to check the version).
  • ICU library (tested with 4.2; Ubuntu package libicu-dev / Fedora package libicu-devel)
  • LibXML++ (libxml++2.6-dev / libxml++-devel)
  • bison and flex (bison, flex)
  • Loki (Ubuntu package libloki-dev / Fedora package loki-lib-devel)

For the build process, you will need CMake 2.8 (or later; cmake) as well as gcc with C++ support (g++ / gcc-c++).

The following packages are highly recommended — they are needed to build Python wrappers (the wrappers may be required by other software using Corpus2, e.g. WMBT, Fextor):
  • SWIG (tested with 1.3 and 2.0, swig)
  • Python with development files (tested with 2.6 and 2.7; Ubuntu package python-dev / Fedora package python-devel)

After installing the required dependencies, proceed to the usual CMake && make compilation procedure:

mkdir corpus2/bin
cd corpus2/bin
cmake ..
# confirm the default values with ENTER
# analyse the output, if some required dependencies are missing, install the lacking packages, remove CMakeCache.txt file and re-run cmake
make
sudo make install
sudo ldconfig
# optionally make test

The build process has been tested on Ubuntu (10.04) and Fedora 16.
It is also possible to build and install Corpus2 on Windows using Visual Studio, although the .sln file generated by CMake may require some tweaking to get the dependencies right.

API overview

A typical usage usually includes some of the following points:
  1. Create a Tagset object using a tagset name.
  2. Create a TokenReader object to read sentences, tokens or whole paragraphs.
  3. Create a TokenWriter object to write processed input.
  4. Access sentences, tokens, tags, process or analyse/gather statistics.

Tagsets

A tagset defines a set of valid tags. More specifically, a tagset defines the following:
  1. set of grammatical class mnemonics (grammatical class ~= part of speech; e.g. subst in the KIPI tagset)
  2. set of attribute name mnemonics (attributes ~= grammatical categories; the mnemonics for attributes are defined in .tagset files and may be somewhat arbitrary, e.g. cas is used by default Corpus2 tagset def file to represent grammatical case in the KIPI and NKJP tagsets)
  3. set of attribute value mnemonics (e.g. nom stands for nominative case in the KIPI tagset)
  4. assignment of attributes to grammatical classes (some attributes may be marked as optional — they value may be not given and the tag is still valid)

A tagset is defined by its INI-style file (e.g. kipi.tagset), see the corpus2 documentation for details. Tagset definition files are contained in prefix/share/corpus2; they are also sought in the current directory.

Any functionality that may need to parse existing tags, create new tags and perform inference assuming some knowledge about the tagset, need to know what tagset to use. Hence, you will have to create an object representing the tagset to be used. Tagsets are immutable in nature and they are cached in a singleton dictionary, preventing from unneccessary I/O operation.

Tagsets are instantiated with the Corpus2::get_named_tagset call (corpus2.get_named_tagset in Python).

Accessing tags and tagset symbols

NOTE: this section describes low-level tag manipulation methods, which you may want to skip reading.

Internally, a tag is essentially a bit mask. The mask consists of two parts:
  1. description of the grammatical class/Part-Of-Speech (POS, e.g. noun)
  2. description of the values of attributes that apply for the grammatical class (e.g. nominative, the value of grammatical case, which may apply to nouns, adjectives…)

The POS part's bits correspond to subsequent possible grammatical classes. The attribute part's bits correspond to all possible values of subsequent attributes as defined in tagset, e.g. attr1-val1 attr1-val2 attr1-val3 attr2-val1 attr2-val2.

Under normal circumstances, exactly one bit is set in the POS part and some (or none) bits are set in the attribute part, depending on which attributes are defined for the POS/class. Such tags are called singular, they are essentially what you'd expect from a valid tag. It is useful for internal purposes to create ‘invalid’ tags that may have arbitrary bits sets. Those tags may be used as bit masks to get desired tag parts from other tags. Such invalid tags can't be represented with the usual tagset.tag_to_string, but there are dedicated methods to represent them as comma-separated symbol strings and parse such representations, see tagset.parse_symbol, tagset.parse_symbol_string, tag_to_symbol_string and tag_to_symbol_string_vector. Note that when parsing, you may also use the name of the whole attribute to get a mask consisting of all the possible values of the attribute.

Regular (valid) tags may be manipulated using such masks, e.g.:

# init
#ts = corpus2.get_named_tagset(…)
# empty strings means 'get me the all-pos mask' here
full_pos_mask = corpus2.get_attribute_mask(ts, '')

# code using tag t1
t1_pos_mask = t1.get_masked(full_pos_mask)
t1_pos_str = ts.tag_to_symbol_string(t1_pos_mask)

# init (assuming tagset defines those symbols)
ngt_mask = tagset.parse_symbol('ngt') # negation attribute
neg_mask = tagset.parse_symbol('neg') # negation value 'neg'
aff_mask = tagset.parse_symbol( 'aff') # negation value 'aff'
// using tag t1
t1_ngt_mask = t1.get_masked(ngt_mask)
if t1_ngt_mask == aff_mask:
   print 'affirmative'
elif t1_ngt_mask == neg_masl:
   print 'negated'
else:
   print 'negation unspec'

C++ NOTE: some of the functions used above are defined in libcorpus2/tagging.h, you may need to include it if writing C++ code.

Token readers and writers

Corpus2 offers a rich library of readers and writers. There are three possible modes of reader operation, corresponding to three different methods of the reader object:
  1. get_next_sentence: reading sentence-by-sentence, discarding possible paragraph information. This is the recommended mode for most of the normal applications, unless you know that the input contains paragraph division. It is perfectly valid to read the input divided into paragraphs with this mode, yet you will not see the paragraph division.
  2. get_next_chunk: reading paragraph-by-paragraph. NOTE: if the input contains no paragraph division (e.g. chunk XML tags other than type="s" in XCES), it is likely that the whole input will be read into memory as one huge paragraph. A paragraph (Chunk class) contains a list of sentence objects, which may be accessed.
  3. get_next_token: reading token-by-token, discarding any sentence and paragraph information (if present). This is the recommended mode when only token-level information is needed, i.e. orthographic forms, sets of morpho-syntactic interpretations.

Writers are also suited for writing sentences, tokens or paragraphs (“chunks”).

There are two ways to create a reader: create_path_reader (reading from a file), create_path_reader, create_stream_reader (using an open stream). The Python wrapper also provides a convenience create_stdin_reader. Similar API is available in the TokenWriter class.

Token readers and writers are created by using format names, possibly containing options, e.g. xces,flat. This idea is explained in the MACA User Guide (MACA uses Corpus2 for this purpose).

Example code

Python code:
  • corpus2/doc/corpstats.py (reads a corpus and reports simple statistics)
  • corpus2/corpus2tools/corpus-{get,merge} (useful tools to convert corpus format & merge multiple files)
  • corpus2/utils/tagger-eval.py (tagger evaluation script able to deal with segmentation changes)
  • wccl /doc/wccl-rules.py (running WCCL rules, also: inspecting chunk-style annotations in corpus2.AnnotatedSentence objects)
C++ code:
  • corpus2tools/tagset-tool.cpp (tagset inspection tool, interactive tag validation; example of Tagset and Tag routines)
  • wccl /wccl-apps/wccl-rules.cpp (runs WCCL rules, also: using corpus2 readers and writers)

Note: C++ and Python APIs are very similar. See documentation in the source code (or run Doxygen to get it as HTML).

corpus2_whole library

Along with corpus2, a small library called corpus2_whole is distributed. The library also offeres data structures and routines for processing annotated corpora, but its assumptions are different:
  1. corpus2 assumes sequential processing of possibly very large corpora and hence, only one token, sentence or (at most) paragraph is loaded into memory at a time;
  2. corpus2_whole is built on top of corpus2 and assumes loading a whole corpus into memory and, thanks to it, offers a simple API for accessing the documents the corpus consists of.

corpus2 is designed to help tagging, chunking, named entity recognition, etc. -- where the whole processing may be enclosed within sentence boundaries. Corpus format conversion and tagset conversion also fit into this usage scenario (see MACA). corpus2 doesn't have a data structure for a whole corpus or document. Corpora or documents may be processed, but they are just input streams (or files in case of some readers, e.g. Poliqarp reader).

corpus2_whole is useful when there is a need to refer to a whole corpus and manipulate its documents. It also supports inter-annotation relations that may cross sentence boundaries (hence the need for reading larger pieces of input at a time).

NOTE: corpus2_whole also comes with Python API, which is available directly from the corpus2 Python module (no separate Python module is designated for corpus2_whole).

Basic concepts and classes

Corpus (Corpus2::whole::Corpus or corpus2.Corpus in Python) is essentially a container for documents.
  • supports iteration over documents (next_document) or
  • obtaining whole collection (documents call).
  • The only parameter of Corpus is the corpus name.
Document (Corpus2::whole::Document or corpus2.Document) stores two type of information:
  1. a list of paragraphs, each being a corpus2.Chunk object (a list of sentences, each annotated morphosyntactically and possibly with additional chunk-style annotations)
  2. a list of relations (relations may cross sentence boundaries, that's why they are stored separately).
Relation (Corpus2::whole::Relation or corpus2.Relation) represents a directed relation between two annotations.
  • source and target are referred to with DirectionPoint objects,
  • each DirectionPoint points to sentence id, channel name (channel describes annotation type, e.g. NP), and in-channel annotation number.

Readers

TODO simple API overview for readers; also, which is the standard CCL+rel reader and which is Poliqarp?

NOTE: currently there is no support for writing whole documents, at least relation writing is planned.

Contact and reporting issues

How to report issues

Tag internal representation (Polish)

Reprezentacja tagu

Description of input and output formats (Polish)

Formaty wejściowe i wyjściowe (Input/output formats)

Other

Konfiguracja klastra