User guide

This user guide describes how to process plain text using the iobber_txt utility and standard configurations. The utility serves two purposes:
  1. Performs morphosyntactic tagging of the input using the WCRFT tagger (http://nlp.pwr.wroc.pl/redmine/projects/wcrft/wiki/)
  2. Performs syntactic chunking, that is, recognises syntactic phrases without analysing their internal structure. Each chunk is also attached it syntactic head.

Linux version

The manual assumes that both IOBBER (the chunker) and WCRFT (the tagger) have been installed system-wide. This should have resulted in the iobber_txt executable available for any user, whatever the current directory.

IOBBER is distributed with its trained model and successful installation of the codes will also make the model available, hence there should be no need for pointing to the directory with the model (iobber_txt assumes a default value for the model that should work).

WCRFT is also distributed with its trained model (for its nkjp_e2.ini configuration). In case you want to use another configuration and another model, you can specify it using -w (config) and -W (model) options.

Windows version

The Windows version is distributed as a directory with precompiled codes for iobber_txt (iobber_txt.exe) and all its dependencies, including WCRFT, Maca. Also, the trained models for both the tagger and the chunker are available in the directory as subdirectories. Their names match the default values assumed by iobber_txt and under normal circumstances you don't have to use the -W or -C switches to override them.

Note: for the Windows version to work, you must change the current directory (e.g. with cd command in cmd window) to the directory containing iobber_txt.exe and run it from there.

Glossary

WCRFT is a morphosyntactic tagger made for Polish. It is able to process plain text and output the following structure:
  1. text is divided into paragraphs,
  2. each paragraph is divided into sentences,
  3. each sentence is divided into tokens (words, punctuation, numbers, symbols; in rare cases words are split into tokens),
  4. each token is assigned an interpretation (in some rare cases more than one interpretation1),
  5. interpretation consists of a lemma (base form, dictionary form) and a morphosyntactic tag,
  6. tags are structured as well; this will be at the end of this manual.
IOBBER is a syntactic chunker for Polish. Is is able to extend the above structure in the following way:
  1. each sentence is annotated with syntactic chunks (description of recognised chunk types may be found at the end of this manual); each syntactic chunk is a sequence of tokens grouped together, corresponding to a syntactic phrase, e.g. noun phrase or verb phrase; in other words, chunks are flat phrases (that is, only phrase boundaries are detected, but no internal structure);
  2. syntactic chunks are extended with the location of their syntactic heads; syntactic head is one of the chunk's tokens that represents the whole chunk syntactically; heads of noun phrase chunks are most often nouns.

The iobber_txt tool runs both WCRFT and IOBBER (if you need only tagging, there is possibility to run WCRFT only: use the --no-chunk).

Note: chunk recognition is a form of shallow parsing, that is partial syntactic analysis. A chunk of a given type (e.g. noun phrase chunk) may not overlap with any other chunk of the same type. In full parsing, a noun phrase may consist of several other noun phrases; in chunking, one level of nesting is selected and only this level is represented as chunks, discarding the information about inner or outer phrases.

Chunks are organised in channels. A channel is a set of chunks in a sentence that have the same type. For instance, chunk_np is a channel for noun phrase chunks. Channels are a simple concept that help represent chunks if various types at the same time. The idea follows the assumption that chunks of a given type cannot overlap, while chunks of different types may overlap. Below is an example sentence fragment Ministerstwo Edukacji Narodowej i sąd and two channels: chunk_np and chunk_agp:

tokens      Ministerstwo   Edukacji   Narodowej   i   sąd
chunk_np   [=====***===========================]     [***]
chunk_agp  [=====***====] [===***==============]     [***]

The asterisks (***) above are used to indicate that a token is a chunk's syntactic head, e.g. for the chunk_np Ministerstwo Edukacji Narodowej, the head is located at the Ministerstwo token.

1 The tagger disambiguates only tags; if there are two (or more) possible lemmas with the same tag, the output remains partially ambiguous. E.g., the form kręgi may be of lemma kręg or krąg. Some output formats (e.g. iob-chan) force one interpretation per token. In such cases, only the first lemma according to alphabetical order will remain.

Processing text files

iobber_txt executable is able to process a text file and generate output in one of output formats.
The input should be a text file encoded in UTF-8.

Usage:

iobber_txt input.txt -O output.xml

Note: if you get an error message cannot locate model….ini, you might need to provide path to tagger or/and chunker model explicitly. This will happen if the model directories are not located in system/current directory or have been renamed. You can give the paths to models explicitly using -C (for chunker) and -W (tagger) switches, e.g.:

iobber_txt -W path/to/model_nkjp10_wcrft_e2 -C path/to/model-kpwr11-H input.txt -O output.xml

The default output format is CCL (-o ccl). You can select a different output format, e.g. iob-chan:

iobber_txt input.txt -o iob-chan -O output.txt

Output formats are described below.

Reading stdin, writing to stdout

IOBBER is able to write its output to the standard output (stdout). It will do so if you don't use the -O FILENAME option, e.g.:

iobber_txt input.txt
# or any other call, e.g.
iobber_txt input.txt -o iob-chan

WCRFT is also able to read the input data stream from the standard input (stdin) and write directly to the standard output (stdout):

iobber_txt -

E.g.

echo 'Przedsiębiorcy, którzy skorzystali z prawa do zwolnienia nie rozliczają podatku VAT. Nie składają więc deklaracji VAT i nie wystawiają faktur.' | iobber_txt -

NOTE: the behaviour of the above call depends on the input encoding of your operating system and/or terminal. IOBBER expects the input in UTF-8. Under modern Linux distributions the input encoding is by default UTF-8, which is ok, but the same cannot be said about Windows — the input encoding varies and is most likely not UTF-8 on your Windows computer. This shouldn't, however, affect redirecting UTF-8 txt files to IOBBER — it should work correctly anywhere as long as the files are correct UTF-8 plain text.

Processing multiple files

When processing multiple small files it is recommended to run IOBBER once for a number of files. This will reduce the overhead of chunker start-up time. On the other hand, the peak memory usage will be somewhat higher as WCRFT loads its model incrementally on demand (the same happens for IOBBER itself, although IOBBER's trained model is much smaller than that of WCRFT). This obviously does not mean that the peak memory usage is proportional to input size — the memory usage grows only when there are situations not encountered earlier, which gets less and less likely as the tagging goes.

There are two modes of processing multiple files. The simpler one is just to give multiple input files as arguments:

iobber_txt input.txt in2.txt in3.txt

The other option is to prepare file lists (a file list is a text file with paths to input files, each path in a separate line) and use the batch mode:

iobber_txt --batch list.txt

You may specify more than one file list (file from all the lists will be processed):

iobber_txt --batch list1.txt list2.txt

In each of the above scenarios, the output will be written to INPUTFILE.tag, e.g. processed input3.txt will be written to input3.txt.tag in the same directory where input3.txt was.

Each processed file will be saved in the selected output format, by default CCL. You can use another format, e.g.:

iobber_txt input.txt in2.txt in3.txt -o iob-chan

Using the tagger only

If you need only morphosyntactic tags (and lemmas) but no chunking information, it is recommended to switch off chunking. This will make the process a bit faster and the resulting file will be smaller (no chunking information).

To use iobber_txt to get input tagged, use the --no-chunk switch. In this mode, IOBBER is used only as an interface to the WCRFT tagger.

The switch works for all the scenarios described above, including tagging text files, standard input, batch mode.

You can use any of the supported output formats for tagging only. Note that some formats (plain and xces) cannot store chunking information. If you plan to use one of those, you really should use the --no-chunk to prevent from unnecessary computation of chunks, which will not be visible at output in those formats.

Some examples:

iobber_txt -W ~/corp/model_nkjp10_wcrft_e2  input.txt --no-chunk # CCL format with no chunks to stdout
iobber_txt --no-chunk -o xces -O out-xces.xml input.txt # XCES output saved to file

Practical tips

The tools will perform best when run against standard Polish input, with correct punctuation and correct spelling. Lack of Polish diacritics will significantly worsen the quality of the resulting tagging and chunking.

The tagger contains a module for paragraph and sentence splitting. Sentence splits are based mainly on the presence of punctuation marks. If you are going to process a number of separate fragments put into one text file, it is recommended to introduce at least two newlines between the fragments. Two or more newline characters will force the tagger to mark a paragraph boundary, which will in turn force to mark a sentence boundary.

Without the above trick, most likely the tagger will merge the ending of one fragment with the beginning of the next one (as long as the first fragment is not finished with a punctuation mark and there is at most one newline between the fragments).

First run of iobber_txt lasts significantly longer. Next runs will exhibit much faster performance, since a part of the linguistic data that is read during processing gets buffered by the operating system.

Output formats

The output format is selected with -o FORMAT, e.g. -o xces, -o ccl, -o iob-chan.

Below is a summary of the formats recommended for tagging. The list is not exhaustive; to get a list of all the supported output formats, run wcrft -h.

Format mnemonic Full name Type Suitable for tagging (WCRFT) Suitable for chunking (IOBBER) Division into paragraphs Whitespace between tokens
plain simple plain text text yes no yes yes
xces XCES XML yes no yes yes
ccl CCL XML yes yes yes yes
iob-chan IOB-CHAN text yes yes no no
Legend:
  • Format mnemonic is the codename used with -o switch
  • All the formats described here are suitable for tagging, meaning that they allow to attach each token an interpretation (= morphosyntactic tag and a lemma); some formats allow to attach more than one interpretation, see details below
  • Suitable for chunking means that this format may be used to store information on syntactic chunks. CCL also allows to store information about the location of syntactic heads of chunks.
  • Whitespace between tokens means that the format keeps information whether any whitespace occured between given two consecutive tokens. This information may be useful to restore plain text from tagger/chunker output (e.g. that a full stop occurred directly after the last word, with no space).

Format choice. CCL format allows to store all the information that may be generated by IOBBER (tagging, chunks and their heads) and should be preferred if there are no other constraints. On the other hand, other formats are simpler, so if you don't need all this sort of information, you may prefer to use iob-chan or plain — it is very simple to write a parser for those formats using standard string processing routines available in all modern programming languages.

Simple plain text format

This format should not be confused with plain text files containing just text (input to the tagger).

The format is simple plain text (UTF-8) containing information written in subsequent lines. There are three types of lines:
  1. Token orthographic form and space information (orth line)
  2. An interpretation attached to last token orth line (interpretation line)
  3. Sentence delimiter (empty line)
Each plain output file has the following syntax:
  1. A token description is one orth line followed by one or more interpretation lines
  2. A sentence description is a sequence of token lines (one or more) followed by an empty line
  3. A paragraph description is a sequence of sentences (one or more) followed by an empty line
  4. Whole file is a sequence of sentence descriptions or paragraph descriptions

Paragraph boundaries are marked as two empty lines.

Orth line has the following form: ORTH TAB SPACE (token's orhtographic form followed by a tab character, followed by space-info string). Orthographic form is the unchanged text of the token as encountered in the original text it was taken from. Space-info string is one of the following:
  • newline if the token came after a newline (or beginning of the file),
  • space if the token came after after a space
  • none if the token came directly after the previous token (e.g. the token is a comma after a word).

An example orth line: prawa space

Interpretation line has on of the following forms:
  • TAB LEMMA TAB TAG disamb if the interpretation is chosen as the correct one (disamb is just this 6-letter string),
  • TAB LEMMA TAB TAG (unless you tweak with tagger options, you will never get this form)

An example interpretation line: prawo subst:sg:gen:n disamb
Note: this line starts with a tab character, which may be not visible in this manual.

Example file in the simple plain text format (note: all the spaces used below should in fact be singular tab characters; this might be not rendered correctly in this manual):

Mam     newline
        mieć         fin:sg:pri:imperf               disamb
kręgi   space
        krąg         subst:pl:acc:m3                 disamb
        kręg         subst:pl:acc:m3                 disamb
.       none
        .            interp    disamb

Ona     space
        on           ppron3:sg:nom:f:ter:akc:npraep  disamb
nie     space
        nie          qub                             disamb
.       none
        .            interp                          disamb

The above example has been generated using this call:

echo 'Mam kręgi. Ona nie.' | iobber_txt -o plain -

XCES

The format described here is a dialect of the XCES format. This dialect has been made for the IPI PAN Corpus of Polish.

XCES is an XML format. The root node for each document is chunkList. Within the node, there should be either a paragraph list or a sentence list. The XCES format supported by WCRFT an other Corpus2-based tools requires that there may be no nested paragraphs: any document may have a level of paragraphs, the paragraphs must consist of sentences.

Paragraphs are marked with chunk XML nodes. The nodes may be assigned unique id values (names should start with letters), but also type string. This is optional. Types may be used to distinguish between regular paragraphs and, for instance, document headers. When tagging plain text, no paragraph types are assigned. Note: chunk type="s" is reserved for sentences, you cannot use this type for paragraphs.

Sentences are marked as chunk XML nodes with type="s" attribute.

Note: this naming is confusing. The term chunk is here used to denote a bunch of sentences = a paragraph. This has nothing to do with syntactic chunks (sequences of tokens corresponding to syntactic phrases as recognised by a chunker).

Sentences consist of tokens and no-space nodes. Each token is marked with tok XML node (no attributes allowed), consisting of the following items (should be given in the following order):
  • orth -- token's orthographic form (as encountered in running text),
  • list of interpretations, each marked by lex XML node.

No-space nodes are empty ns XML nodes: <ns/>. No-space nodes are placed between tokens to mark that no space came between the tokens in running text before tokenisation.

Each lex may have no attributes or disamb="1" denoting that this interpretation has been chosen by the tagger. By default, the tagger leaves only the chosen interpretations, hence every lex node will be marked as disamb="2".

lex node consists of two nodes:
  • base -- lemma
  • ctag -- morphosyntactic tag

Here is a veryvshort example XCES document:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE cesAna SYSTEM "xcesAnaIPI.dtd">
<cesAna xmlns:xlink="http://www.w3.org/1999/xlink" version="1.0" type="lex disamb">
<chunkList>
 <chunk>
  <chunk type="s">
   <tok>
    <orth>Kot</orth>
    <lex disamb="1"><base>kot</base><ctag>subst:sg:nom:m2</ctag></lex>
   </tok>
   <tok>
    <orth>mruczy</orth>
    <lex disamb="1"><base>mruczeć</base><ctag>fin:sg:ter:imperf</ctag></lex>
   </tok>
   <ns/>
   <tok>
    <orth>.</orth>
    <lex disamb="1"><base>.</base><ctag>interp</ctag></lex>
   </tok>
  </chunk>
 </chunk>
</chunkList>
</cesAna>

The DTD for XCES format may be obtained from http://nlp.pwr.wroc.pl/redmine/projects/corpus2/repository/revisions/master/show/doc (xcesAnaIPI.dtd and xheaderIPI.elt).

CCL

CCL is a conservative modification of XCES that allows to store information on syntactic chunks and their heads. Initially the format has been developed to use with WCCL, hence the name.

Note: when using the tagger and no chunker, no chunks/syntactic information will be generated in the output file, even if you choose -o ccl. Annotating chunks is not the job of the tagger.

CCL allows to store:
  • division into paragraphs and sentences
  • paragraphs and sentences may be gived IDs (id attribute; in XCES only paragraphs may have IDs)
  • division into tokens and no-space information (XCES-like)
  • morphosyntactic annotations (XCES-like)
  • chunk-style annotations with possible discontinuities (IOBBER chunker will not mark any discontinuities)
  • syntactic heads of chunks
  • properties of tokens and, implicitly, properties of annotations (IOBBER does not use this, so it won't be discussed here)

CCL's root note is also called chunkList. The division into paragraphs is mandatory for CCL files (or, at least, strongly recommended).

Paragraphs are marked with chunk nodes (as in XCES). The nodes may be assigned unique id values (names should start with letters), but also type string. This is optional. Types may be used to distinguish between regular paragraphs and, for instance, document headers. When tagging plain text, paragraphs are not assigned any types. Note: chunk type="s" should be avoided, this is reserved for sentences in XCES and may result in trouble when converting to XCES.

Sentences are marked with sentence XML nodes (unlike XCES, chunk node is not used here). Sentences may have ids (id attribute). Ids should start with a letter, digits may follow.

Each sentence consists of tokens. No-space nodes are also allowed. No-space nodes are empty ns XML nodes: <ns/>. No-space nodes are placed between tokens to mark that no space came between the tokens in running text before tokenisation.

Each token is marked with tok XML node (no attributes allowed), consisting of the following items (should be given in the following order):
  • orth -- token's orthographic form (as encountered in running text),
  • list of interpretations, each marked by lex XML node.
  • list of syntactic annotation information.
lex node consists of two nodes (exactly as in XCES format):
  • base -- lemma
  • ctag -- morphosyntactic tag

Syntactic annotation information consists of ann XML nodes. Each ann node should have chan attribute specifying the channel name. ann node may also contain information that the current token is a syntactic head — this is expressed as head="1" attribute. There should be at most one head token for each chunk. Inside the ann node there should be a number denoting to which syntactic chunk the current token belongs. 0 means that the token belongs to no chunk in the given channel, positive values designate number/identifier of the chunk within the channel and the sentence. The numbers are usually ordered, but it is not required — any positive number may be used, ordering is optional.

Here is an example syntactic chunk annotation:

tokens      Ministerstwo   Edukacji   Narodowej   i   sąd
chunk_np   [=====***===========================]     [***]
chunk_agp  [=====***====] [===***==============]     [***]

The above annotation would result in the following CCL output:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE chunkList SYSTEM "ccl.dtd">
<chunkList>
 <chunk>
  <sentence>
   <tok>
    <orth>Ministerstwo</orth>
    <lex disamb="1"><base>ministerstwo</base><ctag>subst:sg:nom:n</ctag></lex>
    <ann chan="chunk_agp" head="1">1</ann>
    <ann chan="chunk_np" head="1">1</ann>
   </tok>
   <tok>
    <orth>Edukacji</orth>
    <lex disamb="1"><base>edukacja</base><ctag>subst:sg:gen:f</ctag></lex>
    <ann chan="chunk_agp" head="1">2</ann>
    <ann chan="chunk_np">1</ann>
   </tok>
   <tok>
    <orth>Narodowej</orth>
    <lex disamb="1"><base>narodowy</base><ctag>adj:sg:gen:f:pos</ctag></lex>
    <ann chan="chunk_agp">2</ann>
    <ann chan="chunk_np">1</ann>
   </tok>
   <tok>
    <orth>i</orth>
    <lex disamb="1"><base>i</base><ctag>conj</ctag></lex>
    <ann chan="chunk_agp">0</ann>
    <ann chan="chunk_np">0</ann>
   </tok>
   <tok>
    <orth>sąd</orth>
    <lex disamb="1"><base>sąd</base><ctag>subst:sg:nom:m3</ctag></lex>
    <ann chan="chunk_agp" head="1">3</ann>
    <ann chan="chunk_np" head="1">2</ann>
   </tok>
  </sentence>
 </chunk>
</chunkList>

The chunk numbers given within the ann nodes have the following interpretation:
is an example syntactic chunk annotation:

tokens      Ministerstwo   Edukacji   Narodowej   i   sąd
chunk_np   [    *1*            1          1    ]     [*2*]
chunk_agp  [    *1*     ] [   *2*         2    ]     [*3*]

The numbers should be interpreted separately in each sentence and each channel. The sequence of three 1's in the chunk_np means that there is a continuous chunk stretching through the three first tokens. It is not important that the number used is 1, it is important that the same non-zero number is attached to each of the three tokens -- meaning that the three tokens belong to one chunk. The chunk_agp channel contains three chunks: the first one consists just of the first token (Ministerstwo), the second one consists of the second and third token (Edukacji Narodowej) and the third one consists of the last token (sąd). Again, the numbers are not importat. What is important is that the same number is assigned to the second and third token, and, a different one is assigned to the first token, the same with the last chunk. Note that the numbering should be interpreted separately in each channel: the first chunk in the chunk_np channel is assigned the number of 1, the first chunk in the chunk_agp channel is also assigned number 1, but this is accidental.

The CCL format also makes it possible to specify which tokens are chunks' heads. IOBBER does this by default. In the above CCL output, tokens being chunks' heads are marked with head="1". This was marked with asterisks surrounding the numbers in the above visualisation. This should be read as follows — there are the following chunks in the above CCL output:
  • chunk_np Ministerstwo Edukacji Narodowej with Ministerstwo set as syntactic head
  • chunk_np sąd with sąd set as syntactic head
  • chunk_agp Ministerstwo with Ministerstwo set as syntactic head
  • chunk_agp sąd with sąd set as syntactic head

NOTE: if you run the default IOBBER configuration, it will also attempt to mark chunk_vp and chunk_adjp chunks. Even if no chunk of those types is found in a sentence, the corresponding channels will be present in the output (but empty). For clarity we omit those empty channels in this example and in the following ones.

The DTD for CCL format may be obtained from http://nlp.pwr.wroc.pl/redmine/projects/corpus2/repository/revisions/master/show/doc (ccl.dtd).

IOB-CHAN

IOB-CHAN is a very simple text-based format that allows to store the following information:
  • division into sentences (division into paragraphs is ignored)
  • division into tokens (without no-space information)
  • morphosyntactic annotations (limited to one interpretation per token)
  • chunk-style annotations (no possibility to annotate heads)
The format is simple plain text (UTF-8) consisting of two types of lines:
  1. Token line
  2. Sentence delimiter (empty line)
Token line contains information about:
  1. Token orthographic form (ORTH)
  2. Lemma (LEMMA)
  3. Morphosyntactic tag (TAG)
  4. IOB-string describing syntactic chunks that cross the given token (IOB)

Every token line consists of the above elements in the above order separated by the TAB character:

ORTH  LEMMA  TAG  IOB

(note: TAB character is used in fact, this may be not rendered properly in this manual)

If the sentence contains no syntactic chunks, the IOB-strings will be empty. The format may still be useful for storing tagging output as it is very simple.

IOB-string is a comma-separated sequence of labels, one label for each of the channels. An example IOB-string is chunk_np-I,chunk_agp-B.
Each labels describes the state of the channel with respect to the current token. The label consists of two parts:
  1. channel name (e.g. chunk_np)
  2. IOB tag, that is I, O, or B.

IOB tags are used to describe chunk annotation in a concise per-token way. B tag denotes that a chunk within the channel begins with this token. I tag denotes that this token belongs to the given chunk type (according to the channel name) but it is not the first one (I is for _inside). O tag denotes that this token is outside of any chunk in the given channel.

Here is an example fragment:

tokens      Ministerstwo   Edukacji   Narodowej   i   sąd
chunk_np   [===================================]     [===]
chunk_agp  [============] [====================]     [===]

This fragment would get the following IOB tag representation:

tokens      Ministerstwo   Edukacji   Narodowej   i   sąd
chunk_np   [      B            I          I    ]  O  [ B ]
chunk_agp  [      B     ] [    B          I    ]  O  [ B ]

Note that every chunk is realised as a sequence of IOB tags starting with B and continuing with I tags. If the chunk is one-token-long, it is limited to single B.

Ministerstwo    ministerstwo    subst:sg:nom:n    chunk_agp-B,chunk_np-B
Edukacji    edukacja    subst:sg:gen:f    chunk_agp-B,chunk_np-I
Narodowej    narodowy    adj:sg:gen:f:pos    chunk_agp-I,chunk_np-I
i    i    conj    chunk_agp-O,chunk_np-O
sąd    sąd    subst:sg:nom:m3    chunk_agp-B,chunk_np-B

Linguistic information

NKJP tagset

The default configuration of the tagger and the chunker assume usage of the NKJP tagset -- that is, the tagset of the National Corpus of Polish (http://nkjp.pl).

The tagset is described in the following paper in English:

b1. Adam Przepiórkowski, Aleksander Buczyński, Jakub Wilk. The National Corpus of Polish Cheatsheet (on-line manual).

Tagset specification is available at http://nkjp.pl/poliqarp/help/ense2.html

Adam Przepiórkowski. A comparison of two morphosyntactic tagsets of Polish. In: Violetta Koseska-Toszewa, Ludmila Dimitrova and Roman Roszko, eds., Representing Semantics in Digital Lexicography: Proceedings of MONDILEX Fourth Open Workshop, Warsaw, 29-30 June 2009, pp. 138-144

Article full text may be obtained from http://nlp.ipipan.waw.pl/~adamp/Papers/2009-mondilex/

A more detail description of the tagset and the underlying annotation principles and tokenisation may be found in the following book (in Polish, though):

Adam Przepiórkowski, Mirosław Bańko, Rafał L. Górski and Barbara Lewandowska-Tomaszczyk (eds.) Narodowy Korpus Języka Polskiego. Wydawnictwo Naukowe PWN, Warszawa

Book full text is available under Creative Commons here: http://nkjp.pl/settings/papers/NKJP_ksiazka.pdf

It is recommended to consult the above references for detailed information (the first reference should be sufficient). Below is only a brief tagset summary.

Each tag consists of the grammatical class (roughly, part-of-speech). Depending on the grammatical class, the tag may be assigned attribute values (values of grammatical categories). For instance, the class of nouns (subst) requires specifying the value of grammatical number, case and gender. An example noun tag is subst:sg:nom:f, meaning a noun in singular number (sg), nominative case (f) and feminine gender (f).

Note that there are more grammatical classes than traditional parts of speech. There are multiple verb classes, for instance inf (infinitives), fin (finite verb forms, including some present and future verbs). There is a special class for punctuation tokens: interp (no values assigned), there is a class for abbreviations (brev).

Also note that the classes are distinguished primarily on the grounds of inflection. Therefore some forms will be assigned classes differently than according to "school" grammars. E.g., there is no general class for pronouns; pronouns inflecting like adjectives are marked as adjectives (adj), pronouns inflecting as nouns are marked as subst. There are two classes for personal pronouns: first/second-person pronouns (ppron12) and third-person pronouns (ppron3).

Every grammatical class is assigned a lemmatisation strategy. In some cases the strategy may seem controversial, e.g. gerunds and participles are lemmatised to infinitive forms (e.g. jedzenie -> jeść, zjedzony -> jeść).

Also note that the segmentation strategy is untraditional when it comes to some verb forms. Past verb forms are interpreted as consisting of the l-participle and agglutinative form of BYĆ. E.g., poszedłem is split into poszedł and em. Please consult the above references for a description of the strategy and its motivation.

Below is a full list of grammatical classes defined in the NKJP tagset. The right column assigns attributes to classes. Attributes written in square brackets are considered optional, i.e. a tag will still be valid if no value is given for such attributes.

adja
adjp
adjc
conj
comp
interp
pred
xxx
adv     [deg]
imps    asp
inf     asp
pant    asp
pcon    asp
qub     [vcl]
prep    cas [vcl]
siebie  cas
subst   nmb cas gnd
depr    nmb cas gnd
ger     nmb cas gnd asp ngt
ppron12 nmb cas gnd per [acn]
ppron3  nmb cas gnd per [acn] [ppr]
num     nmb cas gnd [acm]
numcol  nmb cas gnd [acm]
adj     nmb cas gnd deg
pact    nmb cas gnd asp ngt
ppas    nmb cas gnd asp ngt
winien  nmb gnd asp
praet   nmb gnd asp [agg]
bedzie  nmb per asp
fin     nmb per asp
impt    nmb per asp
aglt    nmb per asp vcl
ign
brev    dot
burk
interj

The following list specifies possible value (right column) of each attribute (left column).

nmb     sg pl
cas     nom gen dat acc inst loc voc
gnd     m1 m2 m3 f n
per     pri sec ter
deg     pos com sup
asp     imperf perf
ngt     aff neg
acm     congr rec
acn     akc nakc
ppr     npraep praep
agg     agl nagl
vcl     nwok wok
dot     pun npun

Chunks

The default configuration of the chunker uses chunk definitions from the KPWr corpus (Polish Corpus of Wrocław University of Technology; http://nlp.pwr.wroc.pl/kpwr).

For a detailed description of the employed chunk types, please consult the following article:

Adam Radziszewski, Marek Maziarz and Jan Wieczorek. Shallow syntactic annotation in the Corpus of Wrocław University of Technology. Cognitive Studies 12, SOW, Warszawa 2012

Here is the full text of the article: http://nlp.pwr.wroc.pl/en/117/show/publication

Below is a brief summary of the chunk types:

  1. Noun Phrases (chunk_np) — possibly complex noun and prepositional phrases (both are labelled NP here), limited to clause boundaries. Also, top-level coordination is always split (i.e. if the coordinated elements have no common syntactic superordinate NP, they constitute separate chunks).
  2. Adjective Phrases (chunk_adjp) — top-level adj phrases, e.g. annotated only when not modifying any superordinate NP.
  3. Verb Phrases (chunk_vp) — (complex) verbs + adverbs that clearly modify the verbs + infinitive modifiers. Nominal arguments are not included, they constitute separate chunks.
  4. Agreement Phrases (chunk_agp) — simple noun or adjective phrases based on morphological agreement on number, gender and case, possibly also containing indeclinable elements that modify other parts of a chunk. AgP are based on local accomodations, while NPs, AdjPs and VPs are based on sentence predicate-argument structure.

The above set of chunks is grouped into two layers: one for Agreement Phrases, the other for NPs, VPs and AdjPs together (chunks defined with one layer shouldn't overlap, overlaps across layers do happen).