User guide

This user guide describes how to tag plain text input using one of the standard configuration (nkjp_e2.ini or nkjp_s2.ini) and standard tagger model trained on the NKJP.

Which configuration to use?
  1. If speed and low memory consumption is the priority, use the default nkjp_e2 configuration. Also, it works with a trained model that is very small and already installed system-wide (you don't need to download it separately).
  2. If tagging accuracy is the priority, use nkjp_s2 configuration. This configuration makes ~5% less errors. To use it, you need to download the trained model and point to it with -d path/to/model_nkjp10_wcrft_s2).
  3. If you want to use Morfeusz SGJP v. 2.0 you can use nkjp_e2-morfeusz2 or nkjp_s2-morfeusz2 configurations.

To use nkjp_e2 configuration with the shipped model, use:

wcrft-app nkjp_e2 …

To use nkjp_s2, use:

wcrft-app nkjp_s2 -d path/to/model_nkjp10_wcrft_s2 …

Below we will assume usage of nkjp_e2 config. Also, we will assume that WCRFT (version 2) has been installed system-wide.

Some additional functionality is deliberately omitted here for simplicity, including:
  1. Tagging morphologically analysed corpora (some info here)
  2. Tagging semi-structured text (pre-morph XML files; some info here)
  3. Training the tagger with your corpus (read here)

Glossary

WCRFT is a morphosyntactic tagger made for Polish. It is able to process plain text and output the following structure:
  1. text is divided into paragraphs,
  2. each paragraph is divided into sentences,
  3. each sentence is divided into tokens (words, punctuation, numbers, symbols; in rare cases words are split into tokens),
  4. each token is assigned an interpretation (in some rare cases more than one interpretation1),
  5. interpretation consists of a lemma (base form, dictionary form) and a morphosyntactic tag,
  6. tags are structured as well; this will be at the end of this manual.

Tagging text files

WCRFT executable is able to tag a text file and generate output in one of output formats.
The input should be a text file encoded in UTF-8.

Usage:

wcrft-app nkjp_e2 -i txt input.txt -O tagged.xml

The default output format is XCES. The above call is equivalent to one with explicit output format specification (-o xces):

wcrft-app nkjp_e2 -i txt -o xces input.txt -O tagged.xml

Reading stdin, writing to stdout

WCRFT is able to write its output to the standard output (stdout). It will do so if you don't use the -O FILENAME option, e.g.:

wcrft-app nkjp_e2 -i txt input.txt

It is also possible to read the input data stream from the standard input (stdin) and write directly to the standard output (stdout):

wcrft-app nkjp_e2 -i txt  -

echo 'Przedsiębiorcy, którzy skorzystali z prawa do zwolnienia nie rozliczają podatku VAT. Nie składają więc deklaracji VAT i nie wystawiają faktur.' | wcrft-app nkjp_e2 -i txt -

This way is especially useful when working together with next stages of processing, e.g. using IOBBER to get syntactic chunks:

echo 'Przedsiębiorcy, którzy skorzystali z prawa do zwolnienia nie rozliczają podatku VAT. Nie składają więc deklaracji VAT i nie wystawiają faktur.' | wcrft-app nkjp_e2 -i txt - | iobber kpwr.ini -i xces -o ccl -d model-kpwr11-H -

Processing multiple files

When processing multiple small files it is recommended to run WCRFT once for a number of files. This will reduce the overhead of tagger start-up time. On the other hand, the peak memory usage will be somewhat higher as WCRFT loads its model incrementally on demand. This obviously does not mean that the peak memory usage is proportional to input size — the memory usage grows only when there are situations not encountered earlier, which gets less and less likely as the tagging goes.

There are two modes of tagging multiple files. The simpler one is just to give multiple input files as arguments:

wcrft-app nkjp_e2 -i txt input1.txt input2.txt input3.txt

The other option is to prepare file lists (a file list is a text file with paths to input files, each path in a separate line) and use the batch mode:

wcrft-app nkjp_e2 -i txt --batch list.txt

You may specify more than one file list (file from all the lists will be processed):

wcrft-app nkjp_e2 -i txt --batch list1.txt list2.txt

In each of the above scenarios, the output will be written to INPUTFILE.tag, e.g. processed input3.txt will be written to input3.txt.tag in the same directory where input3.txt was.

Output formats

The output format is selected with -o FORMAT, e.g. -o xces, -o ccl, -o iob-chan.

Below is a summary of the formats recommended for tagging. The list is not exhaustive; to get a list of all the supported output formats, run wcrft -h.

Format mnemonic Full name Type Suitable for tagging (WCRFT) Suitable for chunking (IOBBER) Division into paragraphs Whitespace between tokens
plain simple plain text text yes no no yes
xces XCES XML yes no yes (optional) yes
ccl CCL XML ues yes yes (obligatory) yes
iob-chan IOB-CHAN text yes yes yes (writed but not read) no
Legend:
  • Format mnemonic is the codename used with the -o switch
  • All the formats described here are suitable for tagging, meaning that they allow to attach each token an interpretation (= morphosyntactic tag and a lemma); some formats allow to attach more than one interpretation, see details below
  • Suitable for chunking means that this format may be used to store information on syntactic chunks. CCL also allows to store information about the location of syntactic heads of chunks.
  • Whitespace between tokens means that the format keep information whether any whitespace occured between given two consecutive tokens. This information may be useful to restore plain text from tagger/chunker output (e.g. that a full stop occurred directly after the last word, with no space).

Simple plain text format

This format should not be confused with plain text files containing just text (input to the tagger).

The format is simple plain text (UTF-8) containing information written in subsequent lines. There are three types of lines:
  1. Token orthographic form and space information (orth line)
  2. An interpretation attached to last token orth line (interpretation line)
  3. Sentence delimiter (empty line)
Each plain output file has the following syntax:
  1. A token description is one orth line followed by one or more interpretation lines
  2. A sentence description is a sequence of token lines (one or more) followed by an empty line
  3. A paragraph description is a sequence of sentences (one or more) followed by an empty line
  4. Whole file is a sequence of sentence descriptions or paragraph descriptions

NOTE: currenlty the paragraph boundaries are actually written (marked with two subsequent empty lines), although the reader implemented in Corpus2 library is not reading this information.

Orth line has the following form: ORTH TAB SPACE (token's orhtographic form followed by a tab character, followed by space-info string). Orthographic form is the unchanged text of the token as encountered in the original text it was taken from. Space-info string is one of the following:
  • newline if the token came after a newline (or beginning of the file),
  • space if the token came after after a space
  • none if the token came directly after the previous token (e.g. the token is a comma after a word).

An example orth line: prawa space

Interpretation line has on of the following forms:
  • TAB LEMMA TAB TAG disamb if the interpretation is chosen as the correct one (disamb is just this 6-letter string),
  • TAB LEMMA TAB TAG (unless you tweak with tagger options, you will never get this form)

An example interpretation line: prawo subst:sg:gen:n disamb
Note: this line starts with a tab character, which may be not visible in this manual.

Example file in the simple plain text format (note: all the spaces used below should in fact be singular tab characters; this might be corrupted in this manual):

Mam    newline
    mieć    fin:sg:pri:imperf    disamb
kręgi    space
    krąg    subst:pl:acc:m3    disamb
    kręg    subst:pl:acc:m3    disamb
.    none
    .    interp    disamb

Ona    space
    on    ppron3:sg:nom:f:ter:akc:npraep    disamb
nie    space
    nie    qub    disamb
.    none
    .    interp    disamb

The above example has been generated using this call:

echo 'Mam kręgi. Ona nie.' | wcrft nkjp_s2.ini -i txt -o plain -d model_dir/model_nkjp10_wcrft_s2 -

XCES

The format described here is a dialect of the XCES format. This dialect has been made for the IPI PAN Corpus of Polish.

XCES is an XML format. The root node for each document is chunkList. Within the node, there should be either a paragraph list or a sentence list. The XCES format supported by WCRFT an other Corpus2-based tools requires that there may be no nested paragraphs: any document may have a level of paragraphs, the paragraphs must consist of sentences.

Paragraphs are marked with chunk XML nodes. The nodes may be assigned unique id values (names should start with letters), but also type string. This is optional. Types may be used to distinguish between regular paragraphs and, for instance, document headers. When tagging plain text, no paragraph types are assigned. Note: chunk type="s" is reserved for sentences, you cannot use this type for paragraphs.

Sentences are marked as chunk XML nodes with type="s" attribute.

Note: this naming is confusing. The term chunk is here used to denote a bunch of sentences = a paragraph. This has nothing to do with syntactic chunks (sequences of tokens corresponding to syntactic phrases as recognised by a chunker).

Sentences consist of tokens and no-space nodes. Each token is marked with tok XML node (no attributes allowed), consisting of the following items (should be given in the following order):
  • orth -- token's orthographic form (as encountered in running text),
  • list of interpretations, each marked by lex XML node.

No-space nodes are empty ns XML nodes: <ns/>. No-space nodes are placed between tokens to mark that no space came between the tokens in running text before tokenisation.

Each lex may have no attributes or disamb="1" denoting that this interpretation has been chosen by the tagger. By default, the tagger leaves only the chosen interpretations, hence every lex node will be marked as disamb="2".

lex node consists of two nodes:
  • base -- lemma
  • ctag -- morphosyntactic tag

Here is a very short example XCES document:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE cesAna SYSTEM "xcesAnaIPI.dtd">
<cesAna xmlns:xlink="http://www.w3.org/1999/xlink" version="1.0" type="lex disamb">
<chunkList>
 <chunk>
  <chunk type="s">
   <tok>
    <orth>Bez</orth>
    <lex><base>bez</base><ctag>prep:gen:nwok</ctag></lex>
    <lex><base>bez</base><ctag>subst:sg:nom:m3</ctag></lex>
    <lex><base>bez</base><ctag>subst:sg:acc:m3</ctag></lex>
    <lex><base>beza</base><ctag>subst:pl:gen:f</ctag></lex>
   </tok>
   <tok>
    <orth>pracy</orth>
    <lex><base>praca</base><ctag>subst:sg:gen:f</ctag></lex>
    <lex><base>praca</base><ctag>subst:sg:dat:f</ctag></lex>
    <lex><base>praca</base><ctag>subst:sg:loc:f</ctag></lex>
   </tok>
  </chunk>
 </chunk>
</chunkList>
</cesAna>

CCL

CCL is a conservative modification of XCES that allows to store information on syntactic chunks and their heads. Initially the format has been developed to use with WCCL, hence the name.

Note: when using the tagger and no chunker, no chunks/syntactic information will be generated in the output file, even if you choose -o ccl. Annotating chunks is not the job of the tagger. In case of the CCL format, it means that there will be no ann XML nodes.

CCL format description may be found on this site: http://nlp.pwr.wroc.pl/redmine/projects/corpus2/wiki/CCL_format

IOB-CHAN

IOB-CHAN is a very simple text-based format that allows to store the following information:
  • division into sentences (division into paragraphs is ignored)
  • division into tokens (without no-space information)
  • morphosyntactic annotations (limited to one interpretation per token)
  • chunk-style annotations (no possibility to annotate heads)
The format is simple plain text (UTF-8) consisting of two types of lines:
  1. Token line
  2. Sentence delimiter (empty line)
Token line contains information about:
  1. Token orthographic form (ORTH)
  2. Lemma (LEMMA)
  3. Morphosyntactic tag (TAG)
  4. IOB-string describing syntactic chunks that cross the given token (IOB)

Note: in case of using the tagger only, the IOB-string will always be empty (no chunk information).

Every token line consists of the above elements in the above order separated by the TAB character:

ORTH  LEMMA  TAG  IOB

(note: TAB character is used in fact, this may be not rendered properly in this manual)

IOB-string is a comma-separated sequence of labels, one label for each of the channels. An example IOB-string is chunk_np-I,chunk_agp-B.
Each labels describes the state of the channel with respect to the current token. The label consists of two parts:
  1. channel name (e.g. chunk_np)
  2. IOB tag, that is I, O, or B.

IOB tags are used to describe chunk annotation in a concise per-token way. B tag denotes that a chunk withing the channel begins with this token. I tag denotes that this token belongs to the given chunk type (according to the channel name) but it is not the first one (I is for _inside). O tag denotes that this token is outside of any chunk in the given channel.

Linguistic information

NKJP tagset

The default configuration of the tagger and the chunker assume usage of the NKJP tagset -- that is, the tagset of the National Corpus of Polish (http://nkjp.pl).

The tagset is described in the following paper in English:

Adam Przepiórkowski. A comparison of two morphosyntactic tagsets of Polish. In: Violetta Koseska-Toszewa, Ludmila Dimitrova and Roman Roszko, eds., Representing Semantics in Digital Lexicography: Proceedings of MONDILEX Fourth Open Workshop, Warsaw, 29-30 June 2009, pp. 138-144

Article full text may be obtained from http://nlp.ipipan.waw.pl/~adamp/Papers/2009-mondilex/

A more detail description of the tagset and the underlying annotation principles and tokenisation may be found in the following book (in Polish, though):

Adam Przepiórkowski, Mirosław Bańko, Rafał L. Górski and Barbara Lewandowska-Tomaszczyk (eds.) Narodowy Korpus Języka Polskiego. Wydawnictwo Naukowe PWN, Warszawa

Book full text is available under Creative Commons here: http://nkjp.pl/settings/papers/NKJP_ksiazka.pdf

It is recommended to consult the above references for detailed information. Below is only a brief tagset summary.

Each tag consists of the grammatical class (roughly, part-of-speech). Depending on the grammatical class, the tag may be assigned attribute values (values of grammatical categories). For instance, the class of nouns (subst) requires specifying the value of grammatical number, case and gender. An example noun tag is subst:sg:nom:f, meaning a noun in singular number (sg), nominative case (f) and feminine gender (f).

Note that there are more grammatical classes than traditional parts of speech. There are multiple verb classes, for instance inf (infinitives), fin (finite verb forms, including some present and future verbs). There is a special class for punctuation tokens: interp (no values assigned), there is a class for abbreviations (brev).

Also note that the classes are distinguished primarily on the grounds of inflection. Therefore some forms will be assigned classes differently than according to "school" grammars. E.g., there is no general class for pronouns; pronouns inflecting like adjectives are marked as adjectives (adj), pronouns inflecting as nouns are marked as subst. There are two classes for personal pronouns: first/second-person pronouns (ppron12) and third-person pronouns (ppron3).

Every grammatical class is assigned a lemmatisation strategy. In some cases the strategy may seem controversial, e.g. gerunds and participles are lemmatised to infinitive forms (e.g. jedzenie -> jeść, zjedzony -> jeść).

Also note that the segmentation strategy is untraditional when it comes to some verb forms. Past verb forms are interpreted as consisting of the l-participle and agglutinative form of BYĆ. E.g., poszedłem is split into poszedł and em. Please consult the above references for a description of the strategy and its motivation.

Below is a full list of grammatical classes defined in the NKJP tagset. The right column assigns attributes to classes. Attributes written in square brackets are considered optional, i.e. a tag will still be valid if no value is given for such attributes.

adja
adjp
adjc
conj
comp
interp
pred
xxx
adv     [deg]
imps    asp
inf     asp
pant    asp
pcon    asp
qub     [vcl]
prep    cas [vcl]
siebie  cas
subst   nmb cas gnd
depr    nmb cas gnd
ger     nmb cas gnd asp ngt
ppron12 nmb cas gnd per [acn]
ppron3  nmb cas gnd per [acn] [ppr]
num     nmb cas gnd [acm]
numcol  nmb cas gnd [acm]
adj     nmb cas gnd deg
pact    nmb cas gnd asp ngt
ppas    nmb cas gnd asp ngt
winien  nmb gnd asp
praet   nmb gnd asp [agg]
bedzie  nmb per asp
fin     nmb per asp
impt    nmb per asp
aglt    nmb per asp vcl
ign
brev    dot
burk
interj

The following list specifies possible value (right column) of each attribute (left column).

nmb     sg pl
cas     nom gen dat acc inst loc voc
gnd     m1 m2 m3 f n
per     pri sec ter
deg     pos com sup
asp     imperf perf
ngt     aff neg
acm     congr rec
acn     akc nakc
ppr     npraep praep
agg     agl nagl
vcl     nwok wok
dot     pun npun