User guide¶
This user guide describes how to process plain text using theiobber_txt
utility and standard configurations. The utility serves two purposes:
- Performs morphosyntactic tagging of the input using the WCRFT tagger (http://nlp.pwr.wroc.pl/redmine/projects/wcrft/wiki/)
- Performs syntactic chunking, that is, recognises syntactic phrases without analysing their internal structure. Each chunk is also attached it syntactic head.
Linux version¶
The manual assumes that both IOBBER (the chunker) and WCRFT (the tagger) have been installed system-wide. This should have resulted in the iobber_txt
executable available for any user, whatever the current directory.
IOBBER is distributed with its trained model and successful installation of the codes will also make the model available, hence there should be no need for pointing to the directory with the model (iobber_txt
assumes a default value for the model that should work).
WCRFT is also distributed with its trained model (for its nkjp_e2.ini
configuration). In case you want to use another configuration and another model, you can specify it using -w
(config) and -W
(model) options.
Windows version¶
The Windows version is distributed as a directory with precompiled codes for iobber_txt
(iobber_txt.exe
) and all its dependencies, including WCRFT, Maca. Also, the trained models for both the tagger and the chunker are available in the directory as subdirectories. Their names match the default values assumed by iobber_txt
and under normal circumstances you don't have to use the -W
or -C
switches to override them.
Note: for the Windows version to work, you must change the current directory (e.g. with cd
command in cmd
window) to the directory containing iobber_txt.exe
and run it from there.
Glossary¶
WCRFT is a morphosyntactic tagger made for Polish. It is able to process plain text and output the following structure:- text is divided into paragraphs,
- each paragraph is divided into sentences,
- each sentence is divided into tokens (words, punctuation, numbers, symbols; in rare cases words are split into tokens),
- each token is assigned an interpretation (in some rare cases more than one interpretation1),
- interpretation consists of a lemma (base form, dictionary form) and a morphosyntactic tag,
- tags are structured as well; this will be at the end of this manual.
- each sentence is annotated with syntactic chunks (description of recognised chunk types may be found at the end of this manual); each syntactic chunk is a sequence of tokens grouped together, corresponding to a syntactic phrase, e.g. noun phrase or verb phrase; in other words, chunks are flat phrases (that is, only phrase boundaries are detected, but no internal structure);
- syntactic chunks are extended with the location of their syntactic heads; syntactic head is one of the chunk's tokens that represents the whole chunk syntactically; heads of noun phrase chunks are most often nouns.
The iobber_txt
tool runs both WCRFT and IOBBER (if you need only tagging, there is possibility to run WCRFT only: use the --no-chunk
).
Note: chunk recognition is a form of shallow parsing, that is partial syntactic analysis. A chunk of a given type (e.g. noun phrase chunk) may not overlap with any other chunk of the same type. In full parsing, a noun phrase may consist of several other noun phrases; in chunking, one level of nesting is selected and only this level is represented as chunks, discarding the information about inner or outer phrases.
Chunks are organised in channels. A channel is a set of chunks in a sentence that have the same type. For instance, chunk_np
is a channel for noun phrase chunks. Channels are a simple concept that help represent chunks if various types at the same time. The idea follows the assumption that chunks of a given type cannot overlap, while chunks of different types may overlap. Below is an example sentence fragment Ministerstwo Edukacji Narodowej i sąd and two channels: chunk_np
and chunk_agp
:
tokens Ministerstwo Edukacji Narodowej i sąd chunk_np [=====***===========================] [***] chunk_agp [=====***====] [===***==============] [***]
The asterisks (***
) above are used to indicate that a token is a chunk's syntactic head, e.g. for the chunk_np
Ministerstwo Edukacji Narodowej, the head is located at the Ministerstwo token.
1 The tagger disambiguates only tags; if there are two (or more) possible lemmas with the same tag, the output remains partially ambiguous. E.g., the form kręgi may be of lemma kręg or krąg. Some output formats (e.g. iob-chan
) force one interpretation per token. In such cases, only the first lemma according to alphabetical order will remain.
Processing text files¶
iobber_txt
executable is able to process a text file and generate output in one of output formats.
The input should be a text file encoded in UTF-8.
Usage:
iobber_txt input.txt -O output.xml
Note: if you get an error message cannot locate model….ini
, you might need to provide path to tagger or/and chunker model explicitly. This will happen if the model directories are not located in system/current directory or have been renamed. You can give the paths to models explicitly using -C (for chunker) and -W (tagger) switches, e.g.:
iobber_txt -W path/to/model_nkjp10_wcrft_e2 -C path/to/model-kpwr11-H input.txt -O output.xml
The default output format is CCL (-o ccl
). You can select a different output format, e.g. iob-chan:
iobber_txt input.txt -o iob-chan -O output.txt
Output formats are described below.
Reading stdin, writing to stdout¶
IOBBER is able to write its output to the standard output (stdout). It will do so if you don't use the -O FILENAME
option, e.g.:
iobber_txt input.txt # or any other call, e.g. iobber_txt input.txt -o iob-chan
WCRFT is also able to read the input data stream from the standard input (stdin) and write directly to the standard output (stdout):
iobber_txt -
E.g.
echo 'Przedsiębiorcy, którzy skorzystali z prawa do zwolnienia nie rozliczają podatku VAT. Nie składają więc deklaracji VAT i nie wystawiają faktur.' | iobber_txt -
NOTE: the behaviour of the above call depends on the input encoding of your operating system and/or terminal. IOBBER expects the input in UTF-8. Under modern Linux distributions the input encoding is by default UTF-8, which is ok, but the same cannot be said about Windows — the input encoding varies and is most likely not UTF-8 on your Windows computer. This shouldn't, however, affect redirecting UTF-8 txt files to IOBBER — it should work correctly anywhere as long as the files are correct UTF-8 plain text.
Processing multiple files¶
When processing multiple small files it is recommended to run IOBBER once for a number of files. This will reduce the overhead of chunker start-up time. On the other hand, the peak memory usage will be somewhat higher as WCRFT loads its model incrementally on demand (the same happens for IOBBER itself, although IOBBER's trained model is much smaller than that of WCRFT). This obviously does not mean that the peak memory usage is proportional to input size — the memory usage grows only when there are situations not encountered earlier, which gets less and less likely as the tagging goes.
There are two modes of processing multiple files. The simpler one is just to give multiple input files as arguments:
iobber_txt input.txt in2.txt in3.txt
The other option is to prepare file lists (a file list is a text file with paths to input files, each path in a separate line) and use the batch mode:
iobber_txt --batch list.txt
You may specify more than one file list (file from all the lists will be processed):
iobber_txt --batch list1.txt list2.txt
In each of the above scenarios, the output will be written to INPUTFILE.tag
, e.g. processed input3.txt
will be written to input3.txt.tag
in the same directory where input3.txt
was.
Each processed file will be saved in the selected output format, by default CCL. You can use another format, e.g.:
iobber_txt input.txt in2.txt in3.txt -o iob-chan
Using the tagger only¶
If you need only morphosyntactic tags (and lemmas) but no chunking information, it is recommended to switch off chunking. This will make the process a bit faster and the resulting file will be smaller (no chunking information).
To use iobber_txt
to get input tagged, use the --no-chunk
switch. In this mode, IOBBER is used only as an interface to the WCRFT tagger.
The switch works for all the scenarios described above, including tagging text files, standard input, batch mode.
You can use any of the supported output formats for tagging only. Note that some formats (plain
and xces
) cannot store chunking information. If you plan to use one of those, you really should use the --no-chunk
to prevent from unnecessary computation of chunks, which will not be visible at output in those formats.
Some examples:
iobber_txt -W ~/corp/model_nkjp10_wcrft_e2 input.txt --no-chunk # CCL format with no chunks to stdout iobber_txt --no-chunk -o xces -O out-xces.xml input.txt # XCES output saved to file
Practical tips¶
The tools will perform best when run against standard Polish input, with correct punctuation and correct spelling. Lack of Polish diacritics will significantly worsen the quality of the resulting tagging and chunking.
The tagger contains a module for paragraph and sentence splitting. Sentence splits are based mainly on the presence of punctuation marks. If you are going to process a number of separate fragments put into one text file, it is recommended to introduce at least two newlines between the fragments. Two or more newline characters will force the tagger to mark a paragraph boundary, which will in turn force to mark a sentence boundary.
Without the above trick, most likely the tagger will merge the ending of one fragment with the beginning of the next one (as long as the first fragment is not finished with a punctuation mark and there is at most one newline between the fragments).
First run of iobber_txt
lasts significantly longer. Next runs will exhibit much faster performance, since a part of the linguistic data that is read during processing gets buffered by the operating system.
Output formats¶
The output format is selected with -o FORMAT
, e.g. -o xces
, -o ccl
, -o iob-chan
.
Below is a summary of the formats recommended for tagging. The list is not exhaustive; to get a list of all the supported output formats, run wcrft -h
.
Format mnemonic | Full name | Type | Suitable for tagging (WCRFT) | Suitable for chunking (IOBBER) | Division into paragraphs | Whitespace between tokens |
---|---|---|---|---|---|---|
plain | simple plain text | text | yes | no | yes | yes |
xces | XCES | XML | yes | no | yes | yes |
ccl | CCL | XML | yes | yes | yes | yes |
iob-chan | IOB-CHAN | text | yes | yes | no | no |
- Format mnemonic is the codename used with
-o
switch - All the formats described here are suitable for tagging, meaning that they allow to attach each token an interpretation (= morphosyntactic tag and a lemma); some formats allow to attach more than one interpretation, see details below
- Suitable for chunking means that this format may be used to store information on syntactic chunks. CCL also allows to store information about the location of syntactic heads of chunks.
- Whitespace between tokens means that the format keeps information whether any whitespace occured between given two consecutive tokens. This information may be useful to restore plain text from tagger/chunker output (e.g. that a full stop occurred directly after the last word, with no space).
Format choice. CCL format allows to store all the information that may be generated by IOBBER (tagging, chunks and their heads) and should be preferred if there are no other constraints. On the other hand, other formats are simpler, so if you don't need all this sort of information, you may prefer to use iob-chan
or plain
— it is very simple to write a parser for those formats using standard string processing routines available in all modern programming languages.
Simple plain text format¶
This format should not be confused with plain text files containing just text (input to the tagger).
The format is simple plain text (UTF-8) containing information written in subsequent lines. There are three types of lines:- Token orthographic form and space information (orth line)
- An interpretation attached to last token orth line (interpretation line)
- Sentence delimiter (empty line)
plain
output file has the following syntax:
- A token description is one orth line followed by one or more interpretation lines
- A sentence description is a sequence of token lines (one or more) followed by an empty line
- A paragraph description is a sequence of sentences (one or more) followed by an empty line
- Whole file is a sequence of sentence descriptions or paragraph descriptions
Paragraph boundaries are marked as two empty lines.
Orth line has the following form:ORTH TAB SPACE
(token's orhtographic form followed by a tab character, followed by space-info string). Orthographic form is the unchanged text of the token as encountered in the original text it was taken from. Space-info string is one of the following:
newline
if the token came after a newline (or beginning of the file),space
if the token came after after a spacenone
if the token came directly after the previous token (e.g. the token is a comma after a word).
An example orth line: prawa space
TAB LEMMA TAB TAG disamb
if the interpretation is chosen as the correct one (disamb
is just this 6-letter string),TAB LEMMA TAB TAG
(unless you tweak with tagger options, you will never get this form)
An example interpretation line: prawo subst:sg:gen:n disamb
Note: this line starts with a tab character, which may be not visible in this manual.
Example file in the simple plain text format (note: all the spaces used below should in fact be singular tab characters; this might be not rendered correctly in this manual):
Mam newline mieć fin:sg:pri:imperf disamb kręgi space krąg subst:pl:acc:m3 disamb kręg subst:pl:acc:m3 disamb . none . interp disamb Ona space on ppron3:sg:nom:f:ter:akc:npraep disamb nie space nie qub disamb . none . interp disamb
The above example has been generated using this call:
echo 'Mam kręgi. Ona nie.' | iobber_txt -o plain -
XCES¶
The format described here is a dialect of the XCES format. This dialect has been made for the IPI PAN Corpus of Polish.
XCES is an XML format. The root node for each document is chunkList
. Within the node, there should be either a paragraph list or a sentence list. The XCES format supported by WCRFT an other Corpus2-based tools requires that there may be no nested paragraphs: any document may have a level of paragraphs, the paragraphs must consist of sentences.
Paragraphs are marked with chunk
XML nodes. The nodes may be assigned unique id
values (names should start with letters), but also type
string. This is optional. Types may be used to distinguish between regular paragraphs and, for instance, document headers. When tagging plain text, no paragraph types are assigned. Note: chunk type="s"
is reserved for sentences, you cannot use this type for paragraphs.
Sentences are marked as chunk
XML nodes with type="s"
attribute.
Note: this naming is confusing. The term chunk is here used to denote a bunch of sentences = a paragraph. This has nothing to do with syntactic chunks (sequences of tokens corresponding to syntactic phrases as recognised by a chunker).
Sentences consist of tokens and no-space nodes. Each token is marked withtok
XML node (no attributes allowed), consisting of the following items (should be given in the following order):
orth
-- token's orthographic form (as encountered in running text),- list of interpretations, each marked by
lex
XML node.
No-space nodes are empty ns
XML nodes: <ns/>. No-space nodes are placed between tokens to mark that no space came between the tokens in running text before tokenisation.
Each lex
may have no attributes or disamb="1"
denoting that this interpretation has been chosen by the tagger. By default, the tagger leaves only the chosen interpretations, hence every lex
node will be marked as disamb="2"
.
lex
node consists of two nodes:
base
-- lemmactag
-- morphosyntactic tag
Here is a veryvshort example XCES document:
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE cesAna SYSTEM "xcesAnaIPI.dtd"> <cesAna xmlns:xlink="http://www.w3.org/1999/xlink" version="1.0" type="lex disamb"> <chunkList> <chunk> <chunk type="s"> <tok> <orth>Kot</orth> <lex disamb="1"><base>kot</base><ctag>subst:sg:nom:m2</ctag></lex> </tok> <tok> <orth>mruczy</orth> <lex disamb="1"><base>mruczeć</base><ctag>fin:sg:ter:imperf</ctag></lex> </tok> <ns/> <tok> <orth>.</orth> <lex disamb="1"><base>.</base><ctag>interp</ctag></lex> </tok> </chunk> </chunk> </chunkList> </cesAna>
The DTD for XCES format may be obtained from http://nlp.pwr.wroc.pl/redmine/projects/corpus2/repository/revisions/master/show/doc (xcesAnaIPI.dtd and xheaderIPI.elt).
CCL¶
CCL is a conservative modification of XCES that allows to store information on syntactic chunks and their heads. Initially the format has been developed to use with WCCL, hence the name.
Note: when using the tagger and no chunker, no chunks/syntactic information will be generated in the output file, even if you choose -o ccl
. Annotating chunks is not the job of the tagger.
- division into paragraphs and sentences
- paragraphs and sentences may be gived IDs (id attribute; in XCES only paragraphs may have IDs)
- division into tokens and no-space information (XCES-like)
- morphosyntactic annotations (XCES-like)
- chunk-style annotations with possible discontinuities (IOBBER chunker will not mark any discontinuities)
- syntactic heads of chunks
- properties of tokens and, implicitly, properties of annotations (IOBBER does not use this, so it won't be discussed here)
CCL's root note is also called chunkList
. The division into paragraphs is mandatory for CCL files (or, at least, strongly recommended).
Paragraphs are marked with chunk
nodes (as in XCES). The nodes may be assigned unique id
values (names should start with letters), but also type
string. This is optional. Types may be used to distinguish between regular paragraphs and, for instance, document headers. When tagging plain text, paragraphs are not assigned any types. Note: chunk type="s"
should be avoided, this is reserved for sentences in XCES and may result in trouble when converting to XCES.
Sentences are marked with sentence
XML nodes (unlike XCES, chunk
node is not used here). Sentences may have ids (id
attribute). Ids should start with a letter, digits may follow.
Each sentence consists of tokens. No-space nodes are also allowed. No-space nodes are empty ns
XML nodes: <ns/>. No-space nodes are placed between tokens to mark that no space came between the tokens in running text before tokenisation.
tok
XML node (no attributes allowed), consisting of the following items (should be given in the following order):
orth
-- token's orthographic form (as encountered in running text),- list of interpretations, each marked by
lex
XML node. - list of syntactic annotation information.
lex
node consists of two nodes (exactly as in XCES format):
base
-- lemmactag
-- morphosyntactic tag
Syntactic annotation information consists of ann
XML nodes. Each ann
node should have chan
attribute specifying the channel name. ann
node may also contain information that the current token is a syntactic head — this is expressed as head="1"
attribute. There should be at most one head token for each chunk. Inside the ann
node there should be a number denoting to which syntactic chunk the current token belongs. 0 means that the token belongs to no chunk in the given channel, positive values designate number/identifier of the chunk within the channel and the sentence. The numbers are usually ordered, but it is not required — any positive number may be used, ordering is optional.
Here is an example syntactic chunk annotation:
tokens Ministerstwo Edukacji Narodowej i sąd chunk_np [=====***===========================] [***] chunk_agp [=====***====] [===***==============] [***]
The above annotation would result in the following CCL output:
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE chunkList SYSTEM "ccl.dtd"> <chunkList> <chunk> <sentence> <tok> <orth>Ministerstwo</orth> <lex disamb="1"><base>ministerstwo</base><ctag>subst:sg:nom:n</ctag></lex> <ann chan="chunk_agp" head="1">1</ann> <ann chan="chunk_np" head="1">1</ann> </tok> <tok> <orth>Edukacji</orth> <lex disamb="1"><base>edukacja</base><ctag>subst:sg:gen:f</ctag></lex> <ann chan="chunk_agp" head="1">2</ann> <ann chan="chunk_np">1</ann> </tok> <tok> <orth>Narodowej</orth> <lex disamb="1"><base>narodowy</base><ctag>adj:sg:gen:f:pos</ctag></lex> <ann chan="chunk_agp">2</ann> <ann chan="chunk_np">1</ann> </tok> <tok> <orth>i</orth> <lex disamb="1"><base>i</base><ctag>conj</ctag></lex> <ann chan="chunk_agp">0</ann> <ann chan="chunk_np">0</ann> </tok> <tok> <orth>sąd</orth> <lex disamb="1"><base>sąd</base><ctag>subst:sg:nom:m3</ctag></lex> <ann chan="chunk_agp" head="1">3</ann> <ann chan="chunk_np" head="1">2</ann> </tok> </sentence> </chunk> </chunkList>
The chunk numbers given within the ann
nodes have the following interpretation:
is an example syntactic chunk annotation:
tokens Ministerstwo Edukacji Narodowej i sąd chunk_np [ *1* 1 1 ] [*2*] chunk_agp [ *1* ] [ *2* 2 ] [*3*]
The numbers should be interpreted separately in each sentence and each channel. The sequence of three 1's in the chunk_np
means that there is a continuous chunk stretching through the three first tokens. It is not important that the number used is 1, it is important that the same non-zero number is attached to each of the three tokens -- meaning that the three tokens belong to one chunk. The chunk_agp
channel contains three chunks: the first one consists just of the first token (Ministerstwo), the second one consists of the second and third token (Edukacji Narodowej) and the third one consists of the last token (sąd). Again, the numbers are not importat. What is important is that the same number is assigned to the second and third token, and, a different one is assigned to the first token, the same with the last chunk. Note that the numbering should be interpreted separately in each channel: the first chunk in the chunk_np
channel is assigned the number of 1, the first chunk in the chunk_agp
channel is also assigned number 1, but this is accidental.
head="1"
. This was marked with asterisks surrounding the numbers in the above visualisation. This should be read as follows — there are the following chunks in the above CCL output:
- chunk_np Ministerstwo Edukacji Narodowej with Ministerstwo set as syntactic head
- chunk_np sąd with sąd set as syntactic head
- chunk_agp Ministerstwo with Ministerstwo set as syntactic head
- chunk_agp sąd with sąd set as syntactic head
NOTE: if you run the default IOBBER configuration, it will also attempt to mark chunk_vp
and chunk_adjp
chunks. Even if no chunk of those types is found in a sentence, the corresponding channels will be present in the output (but empty). For clarity we omit those empty channels in this example and in the following ones.
The DTD for CCL format may be obtained from http://nlp.pwr.wroc.pl/redmine/projects/corpus2/repository/revisions/master/show/doc (ccl.dtd).
IOB-CHAN¶
IOB-CHAN is a very simple text-based format that allows to store the following information:- division into sentences (division into paragraphs is ignored)
- division into tokens (without no-space information)
- morphosyntactic annotations (limited to one interpretation per token)
- chunk-style annotations (no possibility to annotate heads)
- Token line
- Sentence delimiter (empty line)
- Token orthographic form (ORTH)
- Lemma (LEMMA)
- Morphosyntactic tag (TAG)
- IOB-string describing syntactic chunks that cross the given token (IOB)
Every token line consists of the above elements in the above order separated by the TAB character:
ORTH LEMMA TAG IOB
(note: TAB character is used in fact, this may be not rendered properly in this manual)
If the sentence contains no syntactic chunks, the IOB-strings will be empty. The format may still be useful for storing tagging output as it is very simple.
IOB-string is a comma-separated sequence of labels, one label for each of the channels. An example IOB-string ischunk_np-I,chunk_agp-B
.Each labels describes the state of the channel with respect to the current token. The label consists of two parts:
- channel name (e.g.
chunk_np
) - IOB tag, that is
I
,O
, orB
.
IOB tags are used to describe chunk annotation in a concise per-token way. B
tag denotes that a chunk within the channel begins with this token. I
tag denotes that this token belongs to the given chunk type (according to the channel name) but it is not the first one (I
is for _inside). O
tag denotes that this token is outside of any chunk in the given channel.
Here is an example fragment:
tokens Ministerstwo Edukacji Narodowej i sąd chunk_np [===================================] [===] chunk_agp [============] [====================] [===]
This fragment would get the following IOB tag representation:
tokens Ministerstwo Edukacji Narodowej i sąd chunk_np [ B I I ] O [ B ] chunk_agp [ B ] [ B I ] O [ B ]
Note that every chunk is realised as a sequence of IOB tags starting with B
and continuing with I
tags. If the chunk is one-token-long, it is limited to single B
.
Ministerstwo ministerstwo subst:sg:nom:n chunk_agp-B,chunk_np-B Edukacji edukacja subst:sg:gen:f chunk_agp-B,chunk_np-I Narodowej narodowy adj:sg:gen:f:pos chunk_agp-I,chunk_np-I i i conj chunk_agp-O,chunk_np-O sąd sąd subst:sg:nom:m3 chunk_agp-B,chunk_np-B
Linguistic information¶
NKJP tagset¶
The default configuration of the tagger and the chunker assume usage of the NKJP tagset -- that is, the tagset of the National Corpus of Polish (http://nkjp.pl).
The tagset is described in the following paper in English:
b1. Adam Przepiórkowski, Aleksander Buczyński, Jakub Wilk. The National Corpus of Polish Cheatsheet (on-line manual).
Tagset specification is available at http://nkjp.pl/poliqarp/help/ense2.html
Adam Przepiórkowski. A comparison of two morphosyntactic tagsets of Polish. In: Violetta Koseska-Toszewa, Ludmila Dimitrova and Roman Roszko, eds., Representing Semantics in Digital Lexicography: Proceedings of MONDILEX Fourth Open Workshop, Warsaw, 29-30 June 2009, pp. 138-144
Article full text may be obtained from http://nlp.ipipan.waw.pl/~adamp/Papers/2009-mondilex/
A more detail description of the tagset and the underlying annotation principles and tokenisation may be found in the following book (in Polish, though):
Adam Przepiórkowski, Mirosław Bańko, Rafał L. Górski and Barbara Lewandowska-Tomaszczyk (eds.) Narodowy Korpus Języka Polskiego. Wydawnictwo Naukowe PWN, Warszawa
Book full text is available under Creative Commons here: http://nkjp.pl/settings/papers/NKJP_ksiazka.pdf
It is recommended to consult the above references for detailed information (the first reference should be sufficient). Below is only a brief tagset summary.
Each tag consists of the grammatical class (roughly, part-of-speech). Depending on the grammatical class, the tag may be assigned attribute values (values of grammatical categories). For instance, the class of nouns (subst
) requires specifying the value of grammatical number, case and gender. An example noun tag is subst:sg:nom:f
, meaning a noun in singular number (sg
), nominative case (f
) and feminine gender (f
).
Note that there are more grammatical classes than traditional parts of speech. There are multiple verb classes, for instance inf
(infinitives), fin
(finite verb forms, including some present and future verbs). There is a special class for punctuation tokens: interp
(no values assigned), there is a class for abbreviations (brev
).
Also note that the classes are distinguished primarily on the grounds of inflection. Therefore some forms will be assigned classes differently than according to "school" grammars. E.g., there is no general class for pronouns; pronouns inflecting like adjectives are marked as adjectives (adj
), pronouns inflecting as nouns are marked as subst
. There are two classes for personal pronouns: first/second-person pronouns (ppron12
) and third-person pronouns (ppron3
).
Every grammatical class is assigned a lemmatisation strategy. In some cases the strategy may seem controversial, e.g. gerunds and participles are lemmatised to infinitive forms (e.g. jedzenie -> jeść, zjedzony -> jeść).
Also note that the segmentation strategy is untraditional when it comes to some verb forms. Past verb forms are interpreted as consisting of the l-participle and agglutinative form of BYĆ. E.g., poszedłem is split into poszedł and em. Please consult the above references for a description of the strategy and its motivation.
Below is a full list of grammatical classes defined in the NKJP tagset. The right column assigns attributes to classes. Attributes written in square brackets are considered optional, i.e. a tag will still be valid if no value is given for such attributes.
adja adjp adjc conj comp interp pred xxx adv [deg] imps asp inf asp pant asp pcon asp qub [vcl] prep cas [vcl] siebie cas subst nmb cas gnd depr nmb cas gnd ger nmb cas gnd asp ngt ppron12 nmb cas gnd per [acn] ppron3 nmb cas gnd per [acn] [ppr] num nmb cas gnd [acm] numcol nmb cas gnd [acm] adj nmb cas gnd deg pact nmb cas gnd asp ngt ppas nmb cas gnd asp ngt winien nmb gnd asp praet nmb gnd asp [agg] bedzie nmb per asp fin nmb per asp impt nmb per asp aglt nmb per asp vcl ign brev dot burk interj
The following list specifies possible value (right column) of each attribute (left column).
nmb sg pl cas nom gen dat acc inst loc voc gnd m1 m2 m3 f n per pri sec ter deg pos com sup asp imperf perf ngt aff neg acm congr rec acn akc nakc ppr npraep praep agg agl nagl vcl nwok wok dot pun npun
Chunks¶
The default configuration of the chunker uses chunk definitions from the KPWr corpus (Polish Corpus of Wrocław University of Technology; http://nlp.pwr.wroc.pl/kpwr).
For a detailed description of the employed chunk types, please consult the following article:
Adam Radziszewski, Marek Maziarz and Jan Wieczorek. Shallow syntactic annotation in the Corpus of Wrocław University of Technology. Cognitive Studies 12, SOW, Warszawa 2012
Here is the full text of the article: http://nlp.pwr.wroc.pl/en/117/show/publication
Below is a brief summary of the chunk types:
- Noun Phrases (
chunk_np
) — possibly complex noun and prepositional phrases (both are labelled NP here), limited to clause boundaries. Also, top-level coordination is always split (i.e. if the coordinated elements have no common syntactic superordinate NP, they constitute separate chunks). - Adjective Phrases (
chunk_adjp
) — top-level adj phrases, e.g. annotated only when not modifying any superordinate NP. - Verb Phrases (
chunk_vp
) — (complex) verbs + adverbs that clearly modify the verbs + infinitive modifiers. Nominal arguments are not included, they constitute separate chunks. - Agreement Phrases (
chunk_agp
) — simple noun or adjective phrases based on morphological agreement on number, gender and case, possibly also containing indeclinable elements that modify other parts of a chunk. AgP are based on local accomodations, while NPs, AdjPs and VPs are based on sentence predicate-argument structure.
The above set of chunks is grouped into two layers: one for Agreement Phrases, the other for NPs, VPs and AdjPs together (chunks defined with one layer shouldn't overlap, overlaps across layers do happen).