Corpus2¶
Corpus2 — data structures & routines for processing annotated corpora with positional tagsets
Corpus2 library offers the following features:- data structures representing tokens, tags, annotated sentences,
- support for configurable positional tagsets (defined as simple INI files),
- routines for reading and writing annotated sentences in various formats (including XCES, so-called CCL format and simple plain text format),
- data structures and routines for dealing with shallow syntactic annotations (chunking),
- optional support for reading Poliqarp format — binary indexed corpora
- The original C++ API, consisting of header files
- Python modules wrapping almost the whole C++ API (made using SWIG), making it possible to write NLP applications in Python, while offloading the core functionality to C++ implementation.
Installation¶
Corpus2 in its basic installation (without Poliqarp reader) is available under GNU LGPL 3.0.
CAUTION: installing Poliqarp library with Corpus2 results in Corpus2 being GNU GPL 3.0.
The sources may be obtained from our public git repository:
git clone http://nlp.pwr.wroc.pl/corpus2.gitThe following libraries must be installed beforehand (headers are also needed, so if you're using a package manager from your Linux distro, please install development versions):
- Boost 1.41 or later (tested with 1.41, 1.42 and 1.47), packages:
program-options
,system
,filesystem
,regex
(Ubuntu packageslibboost-(...)-dev
/ Fedora packageboost-devel
; remember to check the version). - ICU library (tested with 4.2; Ubuntu package
libicu-dev
/ Fedora packagelibicu-devel
) - LibXML++ (
libxml++2.6-dev
/libxml++-devel
) - bison and flex (
bison
,flex
) - Loki (Ubuntu package
libloki-dev
/ Fedora packageloki-lib-devel
)
For the build process, you will need CMake 2.8 (or later; cmake
) as well as gcc with C++ support (g++
/ gcc-c++
).
- SWIG (tested with 1.3 and 2.0,
swig
) - Python with development files (tested with 2.6 and 2.7; Ubuntu package
python-dev
/ Fedora packagepython-devel
)
After installing the required dependencies, proceed to the usual CMake && make compilation procedure:
mkdir corpus2/bin cd corpus2/bin cmake .. # confirm the default values with ENTER # analyse the output, if some required dependencies are missing, install the lacking packages, remove CMakeCache.txt file and re-run cmake make sudo make install sudo ldconfig # optionally make test
The build process has been tested on Ubuntu (10.04) and Fedora 16.
It is also possible to build and install Corpus2 on Windows using Visual Studio, although the .sln file generated by CMake may require some tweaking to get the dependencies right.
API overview¶
A typical usage usually includes some of the following points:- Create a Tagset object using a tagset name.
- Create a TokenReader object to read sentences, tokens or whole paragraphs.
- Create a TokenWriter object to write processed input.
- Access sentences, tokens, tags, process or analyse/gather statistics.
Tagsets¶
A tagset defines a set of valid tags. More specifically, a tagset defines the following:- set of grammatical class mnemonics (grammatical class ~= part of speech; e.g.
subst
in the KIPI tagset) - set of attribute name mnemonics (attributes ~= grammatical categories; the mnemonics for attributes are defined in .tagset files and may be somewhat arbitrary, e.g.
cas
is used by default Corpus2 tagset def file to represent grammatical case in the KIPI and NKJP tagsets) - set of attribute value mnemonics (e.g.
nom
stands for nominative case in the KIPI tagset) - assignment of attributes to grammatical classes (some attributes may be marked as optional — they value may be not given and the tag is still valid)
A tagset is defined by its INI-style file (e.g. kipi.tagset
), see the corpus2 documentation for details. Tagset definition files are contained in prefix/share/corpus2
; they are also sought in the current directory.
Any functionality that may need to parse existing tags, create new tags and perform inference assuming some knowledge about the tagset, need to know what tagset to use. Hence, you will have to create an object representing the tagset to be used. Tagsets are immutable in nature and they are cached in a singleton dictionary, preventing from unneccessary I/O operation.
Tagsets are instantiated with the Corpus2::get_named_tagset
call (corpus2.get_named_tagset
in Python).
Accessing tags and tagset symbols¶
NOTE: this section describes low-level tag manipulation methods, which you may want to skip reading.
Internally, a tag is essentially a bit mask. The mask consists of two parts:- description of the grammatical class/Part-Of-Speech (POS, e.g. noun)
- description of the values of attributes that apply for the grammatical class (e.g. nominative, the value of grammatical case, which may apply to nouns, adjectives…)
The POS part's bits correspond to subsequent possible grammatical classes. The attribute part's bits correspond to all possible values of subsequent attributes as defined in tagset, e.g. attr1-val1
attr1-val2
attr1-val3
attr2-val1
attr2-val2
.
Under normal circumstances, exactly one bit is set in the POS part and some (or none) bits are set in the attribute part, depending on which attributes are defined for the POS/class. Such tags are called singular, they are essentially what you'd expect from a valid tag. It is useful for internal purposes to create ‘invalid’ tags that may have arbitrary bits sets. Those tags may be used as bit masks to get desired tag parts from other tags. Such invalid tags can't be represented with the usual tagset.tag_to_string
, but there are dedicated methods to represent them as comma-separated symbol strings and parse such representations, see tagset.parse_symbol
, tagset.parse_symbol_string
, tag_to_symbol_string
and tag_to_symbol_string_vector
. Note that when parsing, you may also use the name of the whole attribute to get a mask consisting of all the possible values of the attribute.
Regular (valid) tags may be manipulated using such masks, e.g.:
# init #ts = corpus2.get_named_tagset(…) # empty strings means 'get me the all-pos mask' here full_pos_mask = corpus2.get_attribute_mask(ts, '') # code using tag t1 t1_pos_mask = t1.get_masked(full_pos_mask) t1_pos_str = ts.tag_to_symbol_string(t1_pos_mask)
# init (assuming tagset defines those symbols) ngt_mask = tagset.parse_symbol('ngt') # negation attribute neg_mask = tagset.parse_symbol('neg') # negation value 'neg' aff_mask = tagset.parse_symbol( 'aff') # negation value 'aff' // using tag t1 t1_ngt_mask = t1.get_masked(ngt_mask) if t1_ngt_mask == aff_mask: print 'affirmative' elif t1_ngt_mask == neg_masl: print 'negated' else: print 'negation unspec'
C++ NOTE: some of the functions used above are defined in libcorpus2/tagging.h
, you may need to include it if writing C++ code.
Token readers and writers¶
Corpus2 offers a rich library of readers and writers. There are three possible modes of reader operation, corresponding to three different methods of the reader object:get_next_sentence
: reading sentence-by-sentence, discarding possible paragraph information. This is the recommended mode for most of the normal applications, unless you know that the input contains paragraph division. It is perfectly valid to read the input divided into paragraphs with this mode, yet you will not see the paragraph division.get_next_chunk
: reading paragraph-by-paragraph. NOTE: if the input contains no paragraph division (e.g.chunk
XML tags other thantype="s"
in XCES), it is likely that the whole input will be read into memory as one huge paragraph. A paragraph (Chunk
class) contains a list of sentence objects, which may be accessed.get_next_token
: reading token-by-token, discarding any sentence and paragraph information (if present). This is the recommended mode when only token-level information is needed, i.e. orthographic forms, sets of morpho-syntactic interpretations.
Writers are also suited for writing sentences, tokens or paragraphs (“chunks”).
There are two ways to create a reader: create_path_reader
(reading from a file), create_path_reader
, create_stream_reader
(using an open stream). The Python wrapper also provides a convenience create_stdin_reader
. Similar API is available in the TokenWriter class.
Token readers and writers are created by using format names, possibly containing options, e.g. xces,flat
. This idea is explained in the MACA User Guide (MACA uses Corpus2 for this purpose).
Example code¶
Python code:- corpus2/doc/corpstats.py (reads a corpus and reports simple statistics)
- corpus2/corpus2tools/corpus-{get,merge} (useful tools to convert corpus format & merge multiple files)
- corpus2/utils/tagger-eval.py (tagger evaluation script able to deal with segmentation changes)
- wccl /doc/wccl-rules.py (running WCCL rules, also: inspecting chunk-style annotations in corpus2.AnnotatedSentence objects)
- corpus2tools/tagset-tool.cpp (tagset inspection tool, interactive tag validation; example of Tagset and Tag routines)
- wccl /wccl-apps/wccl-rules.cpp (runs WCCL rules, also: using corpus2 readers and writers)
Note: C++ and Python APIs are very similar. See documentation in the source code (or run Doxygen to get it as HTML).
corpus2_whole library¶
Along withcorpus2
, a small library called corpus2_whole
is distributed. The library also offeres data structures and routines for processing annotated corpora, but its assumptions are different:
corpus2
assumes sequential processing of possibly very large corpora and hence, only one token, sentence or (at most) paragraph is loaded into memory at a time;corpus2_whole
is built on top ofcorpus2
and assumes loading a whole corpus into memory and, thanks to it, offers a simple API for accessing the documents the corpus consists of.
corpus2
is designed to help tagging, chunking, named entity recognition, etc. -- where the whole processing may be enclosed within sentence boundaries. Corpus format conversion and tagset conversion also fit into this usage scenario (see MACA). corpus2
doesn't have a data structure for a whole corpus or document. Corpora or documents may be processed, but they are just input streams (or files in case of some readers, e.g. Poliqarp reader).
corpus2_whole
is useful when there is a need to refer to a whole corpus and manipulate its documents. It also supports inter-annotation relations that may cross sentence boundaries (hence the need for reading larger pieces of input at a time).
NOTE: corpus2_whole also comes with Python API, which is available directly from the corpus2
Python module (no separate Python module is designated for corpus2_whole).
Basic concepts and classes¶
Corpus (Corpus2::whole::Corpus
or corpus2.Corpus
in Python) is essentially a container for documents.
- supports iteration over documents (
next_document
) or - obtaining whole collection (
documents
call). - The only parameter of Corpus is the corpus name.
Corpus2::whole::Document
or corpus2.Document
) stores two type of information:
- a list of paragraphs, each being a
corpus2.Chunk
object (a list of sentences, each annotated morphosyntactically and possibly with additional chunk-style annotations) - a list of relations (relations may cross sentence boundaries, that's why they are stored separately).
Corpus2::whole::Relation
or corpus2.Relation
) represents a directed relation between two annotations.
- source and target are referred to with
DirectionPoint
objects, - each
DirectionPoint
points to sentence id, channel name (channel describes annotation type, e.g. NP), and in-channel annotation number.
Readers¶
TODO simple API overview for readers; also, which is the standard CCL+rel reader and which is Poliqarp?
NOTE: currently there is no support for writing whole documents, at least relation writing is planned.
Contact and reporting issues¶
Tag internal representation (Polish)¶
Description of input and output formats (Polish)¶
Formaty wejściowe i wyjściowe (Input/output formats)