User guide for WCRFT ver. 1¶
This user guide describes how to tag plain text input using the standard configuration (
nkjp_s2.ini) and standard tagger model trained on the NKJP (
The current version of the guide assumes that WCRFT has been installed system-wide.Some additional functionality is deliberately omitted here for simplicity, including:
- Tagging morphologically analysed corpora (some info here)
- Tagging semi-structured text (pre-morph XML files; some info here)
- Training the tagger with your corpus (read here)
Glossary¶WCRFT is a morphosyntactic tagger made for Polish. It is able to process plain text and output the following structure:
- text is divided into paragraphs (this is optional),
- each paragraph is divided into sentences,
- each sentence is divided into tokens (words, punctuation, numbers, symbols; in rare cases words are split into tokens),
- each token is assigned an interpretation (in some rare cases more than one interpretation1),
- interpretation consists of a lemma (base form, dictionary form) and a morphosyntactic tag,
- tags are structured as well; this will be at the end of this manual.
Tagging text files¶
WCRFT executable is able to tag a text file and generate output in one of output formats.
The input should be a text file encoded in UTF-8.
wcrft nkjp_s2.ini -C -i txt -d model_dir/model_nkjp10_wcrft_s2 input.txt -O tagged.xml
The default output format is XCES. The above call is equivalent to one with explicit output format specification (
wcrft nkjp_s2.ini -C -i txt -o xces -d model_dir/model_nkjp10_wcrft_s2 input.txt -O tagged.xml
Reading stdin, writing to stdout¶
WCRFT is able to write its output to the standard output (stdout). It will do so if you don't use the
-O FILENAME option, e.g.:
wcrft nkjp_s2.ini -C -i txt -d model_dir/model_nkjp10_wcrft_s2 input.txt
WCRFT is also able to read the input data stream from the standard input (stdin) and write directly to the standard output (stdout):
wcrft nkjp_s2.ini -C -i txt -d model_dir/model_nkjp10_wcrft_s2 -
echo 'Przedsiębiorcy, którzy skorzystali z prawa do zwolnienia nie rozliczają podatku VAT. Nie składają więc deklaracji VAT i nie wystawiają faktur.' | wcrft nkjp_s2.ini -C -i txt -d model_dir/model_nkjp10_wcrft_s2 -
This way is especially useful when working together with next stages of processing, e.g. using IOBBER to get syntactic chunks:
echo 'Przedsiębiorcy, którzy skorzystali z prawa do zwolnienia nie rozliczają podatku VAT. Nie składają więc deklaracji VAT i nie wystawiają faktur.' | wcrft nkjp_s2.ini -C -i txt -d model_dir/model_nkjp10_wcrft_s2 - | iobber kpwr.ini -i xces -o ccl -d model-kpwr04 -
The above calls used the
-C switch. It is highly recommended to use it. It turns on the division of input text into paragraphs. Basically, a new paragraph is opened when there are at least two consecutive newline characters. Running without this switch will cause each sentence to be wrapped within a separate paragraph. Some of the output formats doesn't support division into paragraphs, but it does no harm to process with
Processing multiple files¶
When processing multiple small files it is recommended to run WCRFT once for a number of files. This will reduce the overhead of tagger start-up time. On the other hand, the peak memory usage will be somewhat higher as WCRFT loads its model incrementally on demand. This obviously does not mean that the peak memory usage is proportional to input size — the memory usage grows only when there are situations not encountered earlier, which gets less and less likely as the tagging goes.
There are two modes of tagging multiple files. The simpler one is just to give multiple input files as arguments:
wcrft nkjp_s2.ini -C -i txt -d model_dir/model_nkjp10_wcrft_s2 input1.txt input2.txt input3.txt
The other option is to prepare file lists (a file list is a text file with paths to input files, each path in a separate line) and use the batch mode:
wcrft nkjp_s2.ini -C -i txt --batch -d model_dir/model_nkjp10_wcrft_s2 list.txt
You may specify more than one file list (file from all the lists will be processed):
wcrft nkjp_s2.ini -C -i txt --batch -d model_dir/model_nkjp10_wcrft_s2 list1.txt list2.txt
In each of the above scenarios, the output will be written to
INPUTFILE.tag, e.g. processed
input3.txt will be written to
input3.txt.tag in the same directory where
The output format is selected with
-o FORMAT, e.g.
Below is a summary of the formats recommended for tagging. The list is not exhaustive; to get a list of all the supported output formats, run
|Format mnemonic||Full name||Type||Suitable for tagging (WCRFT)||Suitable for chunking (IOBBER)||Division into paragraphs||Whitespace between tokens|
|plain||simple plain text||text||yes||no||no||yes|
|iob-chan||IOB-CHAN||text||yes||yes||yes (writed but not read)||no|
- Format mnemonic is the codename used with the
- All the formats described here are suitable for tagging, meaning that they allow to attach each token an interpretation (= morphosyntactic tag and a lemma); some formats allow to attach more than one interpretation, see details below
- Suitable for chunking means that this format may be used to store information on syntactic chunks. CCL also allows to store information about the location of syntactic heads of chunks.
- Whitespace between tokens means that the format keep information whether any whitespace occured between given two consecutive tokens. This information may be useful to restore plain text from tagger/chunker output (e.g. that a full stop occurred directly after the last word, with no space).
Simple plain text format¶
This format should not be confused with plain text files containing just text (input to the tagger).The format is simple plain text (UTF-8) containing information written in subsequent lines. There are three types of lines:
- Token orthographic form and space information (orth line)
- An interpretation attached to last token orth line (interpretation line)
- Sentence delimiter (empty line)
plainoutput file has the following syntax:
- A token description is one orth line followed by one or more interpretation lines
- A sentence description is a sequence of token lines (one or more) followed by an empty line
- A paragraph description is a sequence of sentences (one or more) followed by an empty line
- Whole file is a sequence of sentence descriptions or paragraph descriptions
NOTE: currenlty the paragraph boundaries are actually written (marked with two subsequent empty lines), although the reader implemented in Corpus2 library is not reading this information.Orth line has the following form:
ORTH TAB SPACE(token's orhtographic form followed by a tab character, followed by space-info string). Orthographic form is the unchanged text of the token as encountered in the original text it was taken from. Space-info string is one of the following:
newlineif the token came after a newline (or beginning of the file),
spaceif the token came after after a space
noneif the token came directly after the previous token (e.g. the token is a comma after a word).
An example orth line:
TAB LEMMA TAB TAG disambif the interpretation is chosen as the correct one (
disambis just this 6-letter string),
TAB LEMMA TAB TAG(unless you tweak with tagger options, you will never get this form)
An example interpretation line:
prawo subst:sg:gen:n disamb
Note: this line starts with a tab character, which may be not visible in this manual.
Example file in the simple plain text format (note: all the spaces used below should in fact be singular tab characters; this might be corrupted in this manual):
Mam newline mieć fin:sg:pri:imperf disamb kręgi space krąg subst:pl:acc:m3 disamb kręg subst:pl:acc:m3 disamb . none . interp disamb Ona space on ppron3:sg:nom:f:ter:akc:npraep disamb nie space nie qub disamb . none . interp disamb
The above example has been generated using this call:
echo 'Mam kręgi. Ona nie.' | wcrft nkjp_s2.ini -i txt -o plain -d model_dir/model_nkjp10_wcrft_s2 -
XCES is an XML format. The root node for each document is
chunkList. Within the node, there should be either a paragraph list or a sentence list. The XCES format supported by WCRFT an other Corpus2-based tools requires that there may be no nested paragraphs: any document may have a level of paragraphs, the paragraphs must consist of sentences.
Paragraphs are marked with
chunk XML nodes. The nodes may be assigned unique
id values (names should start with letters), but also
type string. This is optional. Types may be used to distinguish between regular paragraphs and, for instance, document headers. When tagging plain text, no paragraph types are assigned. Note:
chunk type="s" is reserved for sentences, you cannot use this type for paragraphs.
Sentences are marked as
chunk XML nodes with
Note: this naming is confusing. The term chunk is here used to denote a bunch of sentences = a paragraph. This has nothing to do with syntactic chunks (sequences of tokens corresponding to syntactic phrases as recognised by a chunker).Sentences consist of tokens and no-space nodes. Each token is marked with
tokXML node (no attributes allowed), consisting of the following items (should be given in the following order):
orth-- token's orthographic form (as encountered in running text),
- list of interpretations, each marked by
No-space nodes are empty
ns XML nodes: <ns/>. No-space nodes are placed between tokens to mark that no space came between the tokens in running text before tokenisation.
lex may have no attributes or
disamb="1" denoting that this interpretation has been chosen by the tagger. By default, the tagger leaves only the chosen interpretations, hence every
lex node will be marked as
lexnode consists of two nodes:
ctag-- morphosyntactic tag
Here is a very short example XCES document:
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE cesAna SYSTEM "xcesAnaIPI.dtd"> <cesAna xmlns:xlink="http://www.w3.org/1999/xlink" version="1.0" type="lex disamb"> <chunkList> <chunk> <chunk type="s"> <tok> <orth>Bez</orth> <lex><base>bez</base><ctag>prep:gen:nwok</ctag></lex> <lex><base>bez</base><ctag>subst:sg:nom:m3</ctag></lex> <lex><base>bez</base><ctag>subst:sg:acc:m3</ctag></lex> <lex><base>beza</base><ctag>subst:pl:gen:f</ctag></lex> </tok> <tok> <orth>pracy</orth> <lex><base>praca</base><ctag>subst:sg:gen:f</ctag></lex> <lex><base>praca</base><ctag>subst:sg:dat:f</ctag></lex> <lex><base>praca</base><ctag>subst:sg:loc:f</ctag></lex> </tok> </chunk> </chunk> </chunkList> </cesAna>
CCL is a conservative modification of XCES that allows to store information on syntactic chunks and their heads. Initially the format has been developed to use with WCCL, hence the name.
Note: when using the tagger and no chunker, no chunks/syntactic information will be generated in the output file, even if you choose
-o ccl. Annotating chunks is not the job of the tagger. In case of the CCL format, it means that there will be no
ann XML nodes.
CCL format description may be found on this site: http://nlp.pwr.wroc.pl/redmine/projects/corpus2/wiki/CCL_format
IOB-CHAN¶IOB-CHAN is a very simple text-based format that allows to store the following information:
- division into sentences (division into paragraphs is ignored)
- division into tokens (without no-space information)
- morphosyntactic annotations (limited to one interpretation per token)
- chunk-style annotations (no possibility to annotate heads)
- Token line
- Sentence delimiter (empty line)
- Token orthographic form (ORTH)
- Lemma (LEMMA)
- Morphosyntactic tag (TAG)
- IOB-string describing syntactic chunks that cross the given token (IOB)
Note: in case of using the tagger only, the IOB-string will always be empty (no chunk information).
Every token line consists of the above elements in the above order separated by the TAB character:
ORTH LEMMA TAG IOB
(note: TAB character is used in fact, this may be not rendered properly in this manual) IOB-string is a comma-separated sequence of labels, one label for each of the channels. An example IOB-string is
Each labels describes the state of the channel with respect to the current token. The label consists of two parts:
- channel name (e.g.
- IOB tag, that is
IOB tags are used to describe chunk annotation in a concise per-token way.
B tag denotes that a chunk withing the channel begins with this token.
I tag denotes that this token belongs to the given chunk type (according to the channel name) but it is not the first one (
I is for _inside).
O tag denotes that this token is outside of any chunk in the given channel.
The default configuration of the tagger and the chunker assume usage of the NKJP tagset -- that is, the tagset of the National Corpus of Polish (http://nkjp.pl).
The tagset is described in the following paper in English:
Adam Przepiórkowski. A comparison of two morphosyntactic tagsets of Polish. In: Violetta Koseska-Toszewa, Ludmila Dimitrova and Roman Roszko, eds., Representing Semantics in Digital Lexicography: Proceedings of MONDILEX Fourth Open Workshop, Warsaw, 29-30 June 2009, pp. 138-144
Article full text may be obtained from http://nlp.ipipan.waw.pl/~adamp/Papers/2009-mondilex/
A more detail description of the tagset and the underlying annotation principles and tokenisation may be found in the following book (in Polish, though):
Adam Przepiórkowski, Mirosław Bańko, Rafał L. Górski and Barbara Lewandowska-Tomaszczyk (eds.) Narodowy Korpus Języka Polskiego. Wydawnictwo Naukowe PWN, Warszawa
Book full text is available under Creative Commons here: http://nkjp.pl/settings/papers/NKJP_ksiazka.pdf
It is recommended to consult the above references for detailed information. Below is only a brief tagset summary.
Each tag consists of the grammatical class (roughly, part-of-speech). Depending on the grammatical class, the tag may be assigned attribute values (values of grammatical categories). For instance, the class of nouns (
subst) requires specifying the value of grammatical number, case and gender. An example noun tag is
subst:sg:nom:f, meaning a noun in singular number (
sg), nominative case (
f) and feminine gender (
Note that there are more grammatical classes than traditional parts of speech. There are multiple verb classes, for instance
fin (finite verb forms, including some present and future verbs). There is a special class for punctuation tokens:
interp (no values assigned), there is a class for abbreviations (
Also note that the classes are distinguished primarily on the grounds of inflection. Therefore some forms will be assigned classes differently than according to "school" grammars. E.g., there is no general class for pronouns; pronouns inflecting like adjectives are marked as adjectives (
adj), pronouns inflecting as nouns are marked as
subst. There are two classes for personal pronouns: first/second-person pronouns (
ppron12) and third-person pronouns (
Every grammatical class is assigned a lemmatisation strategy. In some cases the strategy may seem controversial, e.g. gerunds and participles are lemmatised to infinitive forms (e.g. jedzenie -> jeść, zjedzony -> jeść).
Also note that the segmentation strategy is untraditional when it comes to some verb forms. Past verb forms are interpreted as consisting of the l-participle and agglutinative form of BYĆ. E.g., poszedłem is split into poszedł and em. Please consult the above references for a description of the strategy and its motivation.
Below is a full list of grammatical classes defined in the NKJP tagset. The right column assigns attributes to classes. Attributes written in square brackets are considered optional, i.e. a tag will still be valid if no value is given for such attributes.
adja adjp adjc conj comp interp pred xxx adv [deg] imps asp inf asp pant asp pcon asp qub [vcl] prep cas [vcl] siebie cas subst nmb cas gnd depr nmb cas gnd ger nmb cas gnd asp ngt ppron12 nmb cas gnd per [acn] ppron3 nmb cas gnd per [acn] [ppr] num nmb cas gnd [acm] numcol nmb cas gnd [acm] adj nmb cas gnd deg pact nmb cas gnd asp ngt ppas nmb cas gnd asp ngt winien nmb gnd asp praet nmb gnd asp [agg] bedzie nmb per asp fin nmb per asp impt nmb per asp aglt nmb per asp vcl ign brev dot burk interj
The following list specifies possible value (right column) of each attribute (left column).
nmb sg pl cas nom gen dat acc inst loc voc gnd m1 m2 m3 f n per pri sec ter deg pos com sup asp imperf perf ngt aff neg acm congr rec acn akc nakc ppr npraep praep agg agl nagl vcl nwok wok dot pun npun