User guide

This guide describes how to use the command-line utilities provided with Maca, Toki and Corpus2.

Technical note: those utilities are simple programs referencing the corresponding API functions, hence similar functionality may be easily obtained by using the libraries.

It is assumed here that Maca, Toki and the SFST plugin have been installed. Whenever Morfeusz is mentioned, it is assumed that Maca has been build with Morfeusz support (if the Morfeusz library is found in the system, the plugin will be built by default).

Getting started

Maca provides two command-line utils:
  • maca-analyse: the actual analyser
  • maca-convert: the tagset converter

Toki provides one util called toki-app. Corpus2 library provides a simple tool to inspect tagsets and perform tag validation (tagset-tool).

You can invoke each util with -h (--help) to get some description and allowed options.

maca-analyse, maca-convert and toki-app operate on streams, allowing for pipeline processing (e.g. analyse-and-convert sequences). To see analyser in action, issue:

maca-analyse morfo1222-ikipi

This will load the free morphological data converted from Morfologik and the default tokenisation strategy. Enter some text and hit ^D. You can also set different output mode, e.g. XCES (-o xces).

maca-analyse uses the concept of configuration files that define particular analyser and tokeniser behaviour. Such configuration files are sought according to the current search path (the simplest way to see it is to call maca-analyse with a non-existent config name). Note that a config file may reference other files, e.g. compiled morphological dictionaries. Such files are also sought in the search path.

Maca configurations are tied to particular Toki (tokeniser) configurations. This may be overridden of desired. (see maca-analyse -h for details).

Using the tokeniser

Toki is a configurable tokeniser, performing segmentation into tokens and sentence-splitting (optional). Toki is used by Maca, but it is also useful as a stand-alone tool.
Each token output by Toki is attached the following information:
  • orthographical form (as encountered in the running text),
  • token type label (which may be any string), useful for early classification, such as punctuation mark, symbol or regular word form
  • qualitative description of whitespace amount that preceded the token.

The segmentation strategy and token types attached depend on the configuration used. The default configuration is targeted at Polish. To see it working, run toki-app, enter some text and hit ^D. To see the division into sentences, use custom format string, e.g. toki-app -f "\$bs|\t\$orth\$type:\$ws\n.

There is also an option of using toki-app only as sentence-splitter (using SRX rules). More information is available in doc/Overview.txt as well as in help message (see toki-app -h).

Input and output types

maca-analyse may process plain text (default) or simple XML files containing plain text divided into paragraphs.

In the latter case, the XML file structure is output intact, while each text element between XML tags (PCDATA) is fed through the sentencer+analyser pipeline. This leads to substitution of each input block of text as a sequences of sentence nodes. The intended usage was to analyse pre-morph files (as in the IPI PAN Corpus). This input mode is triggered with -i premorph-stream switch. NOTE: there is also input mode labelled just premorph, which requires the input to be valid pre_morph. This mode additionally skips empty paragraphs (“chunk” nodes containing nothing but whitespaces). It should be preferred when the input is supposed to be a valid pre_morph.

Examples:

# the input must be a valid pre_morph file
maca-analyse -q morfeusz-kipi -i premorph -o xces < pre_morph.xml > morph.xml

# XML document divided into paragraphs
maca-analyse -q morfeusz-kipi -i premorph-stream -o xces < document.xml > morph.xml
# as above but don't output sentence boundaries
# (useful if they are already marked in the input document.xml)
maca-analyse -q morfeusz-kipi -i premorph-stream-nosent -o xces < document.xml > morph.xml

To analyse multiple pre-morph files at a time (reduced start-up overhead), use maca-analyse-batch (see its help message, -h).

Note that maca-analyse is intended to deal with text (possibly partially segmented). The utilities that read morphologically analysed input (maca-convert) offer more input formats, as defined by the corpus2 I/O API. To see the list of possible input formats, issue maca-convert -h (they are listed under the -i options). For instance:
  • xces, the XCES format for morphologically analysed files, as used in the IPI PAN Corpus
  • rft, the simple text format as expected by RFTagger (this format is lossy, e.g. lemmas are not stored)
  • ccl, a simplified XCES derivate extended with means of putting chunk annotations

The format may be parametrised by options, e.g. xces,disamb_only reads only interpretations marked with disamb="1".

There is also a range of available output formats, most of them corresponding to the offered input formats. Some of the output format are lossy, e.g. -o premorph attemts at reproducing the pre_morph XML file containing only text divided into paragraphs. The output formats may also be parametrised by options, e.g. xces,flat prevents XML indenting (useful for Spejd ); rft,mbt outputs in MBT dialect.

MACA may also be used just to convert file formats. It may be accomplished by using maca-convert with the nop conversion routine.

NOTE on outputting in XCES: when analysing plain text, by default the output is not divided into paragraphs (non-“s” chunk nodes). Most applications will probably want the input divided. This is where the -s (--split) option comes in handy: it automatically generates a paragraph boundary each time many-newline whitespace is encountered. Note that this option does not affect processing pre-morph-like files as they already include paragraph boundaries.

API note: the whole I/O framework is available in the corpus2 library, with two simple factories, able to create the desired reader/writer based on user-provided string (e.g. xces,sorttags).

Using Maca as an analyser shell for Morfeusz SIaT and Morfeusz SGJP

There are two versions of Morfeusz: SIaT (non-free) and SGJP (two-clause BSD licence). Both of them are distributed as a shared library hard-coded with data.

Normally those two versions cannot be installed side-by-side: although the versions behave quite differently, both have the same internal name (soname) and version. Unless the soname/version gets fixed, you've got two ways to use Morfeusz SGJP:
  1. using the configurations that assume that the only Morfeusz installed is the official Morfeusz SGJP (recommended; configs: morfeusz-nkjp-official and morfeusz-nkjp-official-guesser),
  2. if you need both versions coexisting peacefully, please contact MACA authors, we've got a tweaked version of Morfeusz SGJP (to be used with morfeusz-nkjp and morfeusz-nkjp-guesser configs).

Anyway, to use data from any version of Morfeusz, you have to compile MACA with Morfeusz plug-in (if Morfeusz library is discovered during CMake run, it will be installed automatically).

The following configurations allow to use the data from Morfeusz SIaT (assuming this version is installed, no tweaks are needed):
  • morfeusz — outputs in unchanged Morfeusz tagset
  • morfeusz-kipi — outputs in KIPI (IPIC) tagset as in korpus.pl (less genders)
  • morfeusz-kipi-guesser — as above, also uses guesser from TaKIPI's libcorpus1 (Maca must be compiled with guesser plugin, libcorpus1 must be present during CMake run).
To use the official version of Morfeusz SGJP (assuming it's the only Morfeusz installed), use the following configurations:
  • sgjp-official — outputs in the original Morfeusz SGJP tagset (many genders)
  • morfeusz-nkjp-official — outputs in the real NKJP tagset (less genders, recommended for general purpose usage)
  • morfeusz-nkjp-official-guesser — as above, but also uses the guesser from TaKIPI/libcorpus1 (guesser output is converted into the NKJP tagset)
  • morfsgjp-kipi — performs naive conversion into the KIPI tagset; guesser is not used

If you're using the tweaked version of Morfeusz SGJP (PWr repo), use the config variants without the word official. Those versions explicitly specify the library soname to be loaded (this soname has been tweaked).

In case of trouble, please inspect the configuration (INI file) first, try removing any custom Morfeusz library name if present from the config. You can copy the system installation of the configuration file to any local directory and make changes there, MACA seeks the current directory first. If this does not fix the problem, please contact the authors.

Using MACA with TaKIPI

TaKIPI is able to tag plain text (using -it TXT). This is achieved by using hard-coded tokenisation and sentence splitting rules, and its own Morfeusz wrapper. There are several reasons to prefer external means of performing these tasks, e.g.:
  • when tagging plain text, TaKIPI uses data from Morfeusz SIaT, whose licence is quite restrictive
  • TaKIPI segmentation rules are not flexible,
  • sentence segmentation heuristics are rudimentary,
  • TaKIPI won't split sentences if there is no explicit punctuation mark (whatever the vertical whitespace amount),
  • there is no control over morphological analysis (besides guesser on/off switch).

Fortunately, TaKIPI is able to read an already tokenised and morphologically analysed input (using -it CORPUS). This is where Maca comes in.

Note: before version 1.8-2 (revision 534) TaKIPI didn't read sentence division in the input XCES/XML file (the input was anyway re-segmented with libcorpus1's sentencer). To get the benefit of Marcin Miłkowski's SRX rules that are bundled with Maca, make sure that you have a recent version of TaKIPI.

To tag plain text using Maca-bundled Morfeusz-wrapper, issue the following (optionally, use -q to suppress diagnostic outputting messages):

maca-analyse morfeusz-kipi -s -o xces < INPUT_TEXT > out-mor.xml
takipi -i out-mor.xml -o out-tagged.xml -it CORPUS

NOTE that to comply with TaKIPI, you have to use one of the configurations that output in the KIPI tagset (as a matter of fact, the KIPI tagset is hardcoded in TaKIPI sources).

The -s (--split) option forces Maca to divide text into paragraphs whenever many newline characters occur. This is recommended. Note that without this option, the output text will not contain paragraph division at all (the output XML will consist of sentence chunks only, there will be no paragraph chunks). Some utils (including maca-convert) expect paragraphs in the XML.

To use TaKIPI's morphological guesser, use morfeusz-kipi-guesser instead. Note that this requires Maca built with guesser plug-in (it is probably already installed if TaKIPI had been installed in the system when you installed Maca).

Note: If you are interested in processing multiple files at once, there is a util called maca-analyse-batch -- consult its --help message. This is a batch mode version of maca-analyse, suitable for working in tandem with TaKIPI in its own batch mode (-is). Batch processing is recommended, since it allows to reduce start-up overhead (loading of morphological dictionaries, especially important when using the guesser plug-in). TODO: add example usage.

To analyse pre-morph-style XML files (i.e. text divided only into paragaphs by XML tags), use -i premorph-stream mode. E.g.:

maca-analyse -q morfeusz-kipi -i premorph-stream -o xces < premorph.xml > morph.xml

To use the data resulting from Morfologik conversion (free Morfeusz alternative), use the morfo1122-ikipi configuration. Note that this configuration outputs in the intermediate “IKIPI” tagset. To have it converted into the IPIC (KIPI) tagset, use maca-convert:

echo "Zjadłaś dwa śledzie. Znikły bez śladu." | maca-analyse morfo1222-ikipi --split -o xces | maca-convert ikipi2kipi.conv -o xces > out-kipi.xml

Tagsets and tagset converter

Maca is able to perform simple tagset conversions.

To see different tagsets, run tagset-tool, e.g. to see the differences between KIPI (IPIC) and IKIPI tagsets, you can compare tagset-tool kipi with tagset-tool ikipi. Tagsets are defined by INI files; you can supply your own definitions. To browse the existing, check the tagset-tool search path and browse the directory.

The tagset converter (maca-convert) works on streams. Usually it makes sense to set output to XCES (-o xces). It is parametrised with tagset conversion routines, being INI files (note that, currently, you must specify the full name including the extension). For instance, to convert from IKIPI to KIPI, use the following:

maca-convert ikipi2kipi.conv < in-ikipi.xml > out-kipi.xml

Maca convert may be useful as a tool to convert file formats or even dialects of the same format (e.g. format XCES XML and sort tags, useful before calling diff). To use it that way, provide nop as converter name and specify the tagset name (-t name). E.g.

maca-convert -q nop -t kipi -o xces,sorttags < diag.retagd > diag.mretagd

We also provide three conversion routines related to the NKJP tagset: nkjp2kipi.conv (NKJP to KIPI conversion), morfsgjp2kipi.conv (the actual tagset of Morfeusz SGJP to KIPI) and sgjp2nkjp.conv (actual Morfeusz SGJP to NKJP). Note that the conversion into KIPI is by no means perfect; some additional comments may be found in the conversion routines themselves.

Generating random samples for training and testing of taggers

maca-convert may be used to generate N random splits into train and test files:

maca-convert -c nop -t ... -i .... -I ..... -F NUM_SPLITS -f FILE_PREFIX -r FRAC_TRAIN -R FRAC_TEST

e.g.

maca-convert -t nkjp -c nop -I nkjp-whole.xml -F 10 -f ~/nkjp-folds/ -r 0.9

This will generate ~/nkjp-folds/train01.xml ~/nkjp-folds/test01.xml etc. up to 10 from 90% train i 10% test.