- User guide
This guide describes how to use the command-line utilities provided with Maca, Toki and Corpus2.
Technical note: those utilities are simple programs referencing the corresponding API functions, hence similar functionality may be easily obtained by using the libraries.
It is assumed here that Maca, Toki and the SFST plugin have been installed. Whenever Morfeusz is mentioned, it is assumed that Maca has been build with Morfeusz support (if the Morfeusz library is found in the system, the plugin will be built by default).
Getting started¶Maca provides two command-line utils:
maca-analyse: the actual analyser
maca-convert: the tagset converter
Toki provides one util called
toki-app. Corpus2 library provides a simple tool to inspect tagsets and perform tag validation (
You can invoke each util with
--help) to get some description and allowed options.
toki-app operate on streams, allowing for pipeline processing (e.g. analyse-and-convert sequences). To see analyser in action, issue:
This will load the free morphological data converted from Morfologik and the default tokenisation strategy. Enter some text and hit
^D. You can also set different output mode, e.g. XCES (
maca-analyse uses the concept of configuration files that define particular analyser and tokeniser behaviour. Such configuration files are sought according to the current search path (the simplest way to see it is to call
maca-analyse with a non-existent config name). Note that a config file may reference other files, e.g. compiled morphological dictionaries. Such files are also sought in the search path.
Maca configurations are tied to particular Toki (tokeniser) configurations. This may be overridden of desired. (see
maca-analyse -h for details).
Using the tokeniser¶Toki is a configurable tokeniser, performing segmentation into tokens and sentence-splitting (optional). Toki is used by Maca, but it is also useful as a stand-alone tool.
Each token output by Toki is attached the following information:
- orthographical form (as encountered in the running text),
- token type label (which may be any string), useful for early classification, such as punctuation mark, symbol or regular word form
- qualitative description of whitespace amount that preceded the token.
The segmentation strategy and token types attached depend on the configuration used. The default configuration is targeted at Polish. To see it working, run
toki-app, enter some text and hit
^D. To see the division into sentences, use custom format string, e.g.
toki-app -f "\$bs|\t\$orth\$type:\$ws\n.
There is also an option of using
toki-app only as sentence-splitter (using SRX rules). More information is available in
doc/Overview.txt as well as in help message (see
Input and output types¶
maca-analyse may process plain text (default) or simple XML files containing plain text divided into paragraphs.
In the latter case, the XML file structure is output intact, while each text element between XML tags (PCDATA) is fed through the sentencer+analyser pipeline. This leads to substitution of each input block of text as a sequences of sentence nodes. The intended usage was to analyse
pre-morph files (as in the IPI PAN Corpus). This input mode is triggered with
-i premorph-stream switch. NOTE: there is also input mode labelled just
premorph, which requires the input to be valid pre_morph. This mode additionally skips empty paragraphs (“chunk” nodes containing nothing but whitespaces). It should be preferred when the input is supposed to be a valid pre_morph.
# the input must be a valid pre_morph file maca-analyse -q morfeusz-kipi -i premorph -o xces < pre_morph.xml > morph.xml
# XML document divided into paragraphs maca-analyse -q morfeusz-kipi -i premorph-stream -o xces < document.xml > morph.xml
# as above but don't output sentence boundaries # (useful if they are already marked in the input document.xml) maca-analyse -q morfeusz-kipi -i premorph-stream-nosent -o xces < document.xml > morph.xml
To analyse multiple pre-morph files at a time (reduced start-up overhead), use
maca-analyse-batch (see its help message,
maca-analyseis intended to deal with text (possibly partially segmented). The utilities that read morphologically analysed input (
maca-convert) offer more input formats, as defined by the
corpus2I/O API. To see the list of possible input formats, issue
maca-convert -h(they are listed under the
-ioptions). For instance:
xces, the XCES format for morphologically analysed files, as used in the IPI PAN Corpus
rft, the simple text format as expected by RFTagger (this format is lossy, e.g. lemmas are not stored)
ccl, a simplified XCES derivate extended with means of putting chunk annotations
The format may be parametrised by options, e.g.
xces,disamb_only reads only interpretations marked with
There is also a range of available output formats, most of them corresponding to the offered input formats. Some of the output format are lossy, e.g.
-o premorph attemts at reproducing the
pre_morph XML file containing only text divided into paragraphs. The output formats may also be parametrised by options, e.g.
xces,flat prevents XML indenting (useful for Spejd );
rft,mbt outputs in MBT dialect.
MACA may also be used just to convert file formats. It may be accomplished by using
maca-convert with the nop conversion routine.
NOTE on outputting in XCES: when analysing plain text, by default the output is not divided into paragraphs (non-“s”
chunk nodes). Most applications will probably want the input divided. This is where the
--split) option comes in handy: it automatically generates a paragraph boundary each time many-newline whitespace is encountered. Note that this option does not affect processing
pre-morph-like files as they already include paragraph boundaries.
API note: the whole I/O framework is available in the
corpus2 library, with two simple factories, able to create the desired reader/writer based on user-provided string (e.g.
Using Maca as an analyser shell for Morfeusz SIaT and Morfeusz SGJP¶
- using the configurations that assume that the only Morfeusz installed is the official Morfeusz SGJP (recommended; configs:
- if you need both versions coexisting peacefully, please contact MACA authors, we've got a tweaked version of Morfeusz SGJP (to be used with
Anyway, to use data from any version of Morfeusz, you have to compile MACA with Morfeusz plug-in (if Morfeusz library is discovered during CMake run, it will be installed automatically).The following configurations allow to use the data from Morfeusz SIaT (assuming this version is installed, no tweaks are needed):
morfeusz— outputs in unchanged Morfeusz tagset
morfeusz-kipi— outputs in KIPI (IPIC) tagset as in korpus.pl (less genders)
morfeusz-kipi-guesser— as above, also uses guesser from TaKIPI's libcorpus1 (Maca must be compiled with guesser plugin, libcorpus1 must be present during CMake run).
sgjp-official— outputs in the original Morfeusz SGJP tagset (many genders)
morfeusz-nkjp-official— outputs in the real NKJP tagset (less genders, recommended for general purpose usage)
morfeusz-nkjp-official-guesser— as above, but also uses the guesser from TaKIPI/libcorpus1 (guesser output is converted into the NKJP tagset)
morfsgjp-kipi— performs naive conversion into the KIPI tagset; guesser is not used
If you're using the tweaked version of Morfeusz SGJP (PWr repo), use the config variants without the word
official. Those versions explicitly specify the library soname to be loaded (this soname has been tweaked).
In case of trouble, please inspect the configuration (INI file) first, try removing any custom Morfeusz library name if present from the config. You can copy the system installation of the configuration file to any local directory and make changes there, MACA seeks the current directory first. If this does not fix the problem, please contact the authors.
Using MACA with TaKIPI¶TaKIPI is able to tag plain text (using
-it TXT). This is achieved by using hard-coded tokenisation and sentence splitting rules, and its own Morfeusz wrapper. There are several reasons to prefer external means of performing these tasks, e.g.:
- when tagging plain text, TaKIPI uses data from Morfeusz SIaT, whose licence is quite restrictive
- TaKIPI segmentation rules are not flexible,
- sentence segmentation heuristics are rudimentary,
- TaKIPI won't split sentences if there is no explicit punctuation mark (whatever the vertical whitespace amount),
- there is no control over morphological analysis (besides guesser on/off switch).
Fortunately, TaKIPI is able to read an already tokenised and morphologically analysed input (using
-it CORPUS). This is where Maca comes in.
Note: before version 1.8-2 (revision 534) TaKIPI didn't read sentence division in the input XCES/XML file (the input was anyway re-segmented with libcorpus1's sentencer). To get the benefit of Marcin Miłkowski's SRX rules that are bundled with Maca, make sure that you have a recent version of TaKIPI.
To tag plain text using Maca-bundled Morfeusz-wrapper, issue the following (optionally, use
-q to suppress diagnostic outputting messages):
maca-analyse morfeusz-kipi -s -o xces < INPUT_TEXT > out-mor.xml takipi -i out-mor.xml -o out-tagged.xml -it CORPUS
NOTE that to comply with TaKIPI, you have to use one of the configurations that output in the KIPI tagset (as a matter of fact, the KIPI tagset is hardcoded in TaKIPI sources).
--split) option forces Maca to divide text into paragraphs whenever many newline characters occur. This is recommended. Note that without this option, the output text will not contain paragraph division at all (the output XML will consist of sentence chunks only, there will be no paragraph chunks). Some utils (including
maca-convert) expect paragraphs in the XML.
To use TaKIPI's morphological guesser, use
morfeusz-kipi-guesser instead. Note that this requires Maca built with guesser plug-in (it is probably already installed if TaKIPI had been installed in the system when you installed Maca).
Note: If you are interested in processing multiple files at once, there is a util called
maca-analyse-batch -- consult its
--help message. This is a batch mode version of
maca-analyse, suitable for working in tandem with TaKIPI in its own batch mode (
-is). Batch processing is recommended, since it allows to reduce start-up overhead (loading of morphological dictionaries, especially important when using the guesser plug-in). TODO: add example usage.
To analyse pre-morph-style XML files (i.e. text divided only into paragaphs by XML tags), use
-i premorph-stream mode. E.g.:
maca-analyse -q morfeusz-kipi -i premorph-stream -o xces < premorph.xml > morph.xml
To use the data resulting from Morfologik conversion (free Morfeusz alternative), use the
morfo1122-ikipi configuration. Note that this configuration outputs in the intermediate “IKIPI” tagset. To have it converted into the IPIC (KIPI) tagset, use
echo "Zjadłaś dwa śledzie. Znikły bez śladu." | maca-analyse morfo1222-ikipi --split -o xces | maca-convert ikipi2kipi.conv -o xces > out-kipi.xml
Tagsets and tagset converter¶
Maca is able to perform simple tagset conversions.
To see different tagsets, run
tagset-tool, e.g. to see the differences between KIPI (IPIC) and IKIPI tagsets, you can compare
tagset-tool kipi with
tagset-tool ikipi. Tagsets are defined by INI files; you can supply your own definitions. To browse the existing, check the
tagset-tool search path and browse the directory.
The tagset converter (
maca-convert) works on streams. Usually it makes sense to set output to XCES (
-o xces). It is parametrised with tagset conversion routines, being INI files (note that, currently, you must specify the full name including the extension). For instance, to convert from IKIPI to KIPI, use the following:
maca-convert ikipi2kipi.conv < in-ikipi.xml > out-kipi.xml
Maca convert may be useful as a tool to convert file formats or even dialects of the same format (e.g. format XCES XML and sort tags, useful before calling
diff). To use it that way, provide
nop as converter name and specify the tagset name (
-t name). E.g.
maca-convert -q nop -t kipi -o xces,sorttags < diag.retagd > diag.mretagd
We also provide three conversion routines related to the NKJP tagset:
nkjp2kipi.conv (NKJP to KIPI conversion),
morfsgjp2kipi.conv (the actual tagset of Morfeusz SGJP to KIPI) and
sgjp2nkjp.conv (actual Morfeusz SGJP to NKJP). Note that the conversion into KIPI is by no means perfect; some additional comments may be found in the conversion routines themselves.
Generating random samples for training and testing of taggers¶
maca-convert may be used to generate N random splits into train and test files:
maca-convert -c nop -t ... -i .... -I ..... -F NUM_SPLITS -f FILE_PREFIX -r FRAC_TRAIN -R FRAC_TEST
maca-convert -t nkjp -c nop -I nkjp-whole.xml -F 10 -f ~/nkjp-folds/ -r 0.9
This will generate ~/nkjp-folds/train01.xml ~/nkjp-folds/test01.xml etc. up to 10 from 90% train i 10% test.