User guide¶
This guide describes how to use the command-line utilities provided with Maca, Toki and Corpus2.
Technical note: those utilities are simple programs referencing the corresponding API functions, hence similar functionality may be easily obtained by using the libraries.
It is assumed here that Maca, Toki and the SFST plugin have been installed. Whenever Morfeusz is mentioned, it is assumed that Maca has been build with Morfeusz support (if the Morfeusz library is found in the system, the plugin will be built by default).
Getting started¶
Maca provides two command-line utils:maca-analyse
: the actual analysermaca-convert
: the tagset converter
Toki provides one util called toki-app
. Corpus2 library provides a simple tool to inspect tagsets and perform tag validation (tagset-tool
).
You can invoke each util with -h
(--help
) to get some description and allowed options.
maca-analyse
, maca-convert
and toki-app
operate on streams, allowing for pipeline processing (e.g. analyse-and-convert sequences). To see analyser in action, issue:
maca-analyse morfo1222-ikipi
This will load the free morphological data converted from Morfologik and the default tokenisation strategy. Enter some text and hit
^D
. You can also set different output mode, e.g. XCES (-o xces
).
maca-analyse
uses the concept of configuration files that define particular analyser and tokeniser behaviour. Such configuration files are sought according to the current search path (the simplest way to see it is to call maca-analyse
with a non-existent config name). Note that a config file may reference other files, e.g. compiled morphological dictionaries. Such files are also sought in the search path.
Maca configurations are tied to particular Toki (tokeniser) configurations. This may be overridden of desired. (see maca-analyse -h
for details).
Using the tokeniser¶
Toki is a configurable tokeniser, performing segmentation into tokens and sentence-splitting (optional). Toki is used by Maca, but it is also useful as a stand-alone tool.Each token output by Toki is attached the following information:
- orthographical form (as encountered in the running text),
- token type label (which may be any string), useful for early classification, such as punctuation mark, symbol or regular word form
- qualitative description of whitespace amount that preceded the token.
The segmentation strategy and token types attached depend on the configuration used. The default configuration is targeted at Polish. To see it working, run toki-app
, enter some text and hit ^D
. To see the division into sentences, use custom format string, e.g. toki-app -f "\$bs|\t\$orth\$type:\$ws\n
.
There is also an option of using toki-app
only as sentence-splitter (using SRX rules). More information is available in doc/Overview.txt
as well as in help message (see toki-app -h
).
Input and output types¶
maca-analyse
may process plain text (default) or simple XML files containing plain text divided into paragraphs.
In the latter case, the XML file structure is output intact, while each text element between XML tags (PCDATA) is fed through the sentencer+analyser pipeline. This leads to substitution of each input block of text as a sequences of sentence nodes. The intended usage was to analyse pre-morph
files (as in the IPI PAN Corpus). This input mode is triggered with -i premorph-stream
switch. NOTE: there is also input mode labelled just premorph
, which requires the input to be valid pre_morph. This mode additionally skips empty paragraphs (“chunk” nodes containing nothing but whitespaces). It should be preferred when the input is supposed to be a valid pre_morph.
Examples:
# the input must be a valid pre_morph file maca-analyse -q morfeusz-kipi -i premorph -o xces < pre_morph.xml > morph.xml
# XML document divided into paragraphs maca-analyse -q morfeusz-kipi -i premorph-stream -o xces < document.xml > morph.xml
# as above but don't output sentence boundaries # (useful if they are already marked in the input document.xml) maca-analyse -q morfeusz-kipi -i premorph-stream-nosent -o xces < document.xml > morph.xml
To analyse multiple pre-morph files at a time (reduced start-up overhead), use maca-analyse-batch
(see its help message, -h
).
maca-analyse
is intended to deal with text (possibly partially segmented). The utilities that read morphologically analysed input (maca-convert
) offer more input formats, as defined by the corpus2
I/O API. To see the list of possible input formats, issue maca-convert -h
(they are listed under the -i
options). For instance:
xces
, the XCES format for morphologically analysed files, as used in the IPI PAN Corpusrft
, the simple text format as expected by RFTagger (this format is lossy, e.g. lemmas are not stored)ccl
, a simplified XCES derivate extended with means of putting chunk annotations
The format may be parametrised by options, e.g. xces,disamb_only
reads only interpretations marked with disamb="1"
.
There is also a range of available output formats, most of them corresponding to the offered input formats. Some of the output format are lossy, e.g. -o premorph
attemts at reproducing the pre_morph
XML file containing only text divided into paragraphs. The output formats may also be parametrised by options, e.g. xces,flat
prevents XML indenting (useful for Spejd ); rft,mbt
outputs in MBT dialect.
MACA may also be used just to convert file formats. It may be accomplished by using maca-convert
with the nop conversion routine.
NOTE on outputting in XCES: when analysing plain text, by default the output is not divided into paragraphs (non-“s” chunk
nodes). Most applications will probably want the input divided. This is where the -s
(--split
) option comes in handy: it automatically generates a paragraph boundary each time many-newline whitespace is encountered. Note that this option does not affect processing pre-morph
-like files as they already include paragraph boundaries.
API note: the whole I/O framework is available in the corpus2
library, with two simple factories, able to create the desired reader/writer based on user-provided string (e.g. xces,sorttags
).
Using Maca as an analyser shell for Morfeusz SIaT and Morfeusz SGJP¶
There are two versions of Morfeusz: SIaT (non-free) and SGJP (two-clause BSD licence). Both of them are distributed as a shared library hard-coded with data.
Normally those two versions cannot be installed side-by-side: although the versions behave quite differently, both have the same internal name (soname) and version. Unless the soname/version gets fixed, you've got two ways to use Morfeusz SGJP:- using the configurations that assume that the only Morfeusz installed is the official Morfeusz SGJP (recommended; configs:
morfeusz-nkjp-official
andmorfeusz-nkjp-official-guesser
), - if you need both versions coexisting peacefully, please contact MACA authors, we've got a tweaked version of Morfeusz SGJP (to be used with
morfeusz-nkjp
andmorfeusz-nkjp-guesser
configs).
Anyway, to use data from any version of Morfeusz, you have to compile MACA with Morfeusz plug-in (if Morfeusz library is discovered during CMake run, it will be installed automatically).
The following configurations allow to use the data from Morfeusz SIaT (assuming this version is installed, no tweaks are needed):morfeusz
— outputs in unchanged Morfeusz tagsetmorfeusz-kipi
— outputs in KIPI (IPIC) tagset as in korpus.pl (less genders)morfeusz-kipi-guesser
— as above, also uses guesser from TaKIPI's libcorpus1 (Maca must be compiled with guesser plugin, libcorpus1 must be present during CMake run).
sgjp-official
— outputs in the original Morfeusz SGJP tagset (many genders)morfeusz-nkjp-official
— outputs in the real NKJP tagset (less genders, recommended for general purpose usage)morfeusz-nkjp-official-guesser
— as above, but also uses the guesser from TaKIPI/libcorpus1 (guesser output is converted into the NKJP tagset)morfsgjp-kipi
— performs naive conversion into the KIPI tagset; guesser is not used
If you're using the tweaked version of Morfeusz SGJP (PWr repo), use the config variants without the word official
. Those versions explicitly specify the library soname to be loaded (this soname has been tweaked).
In case of trouble, please inspect the configuration (INI file) first, try removing any custom Morfeusz library name if present from the config. You can copy the system installation of the configuration file to any local directory and make changes there, MACA seeks the current directory first. If this does not fix the problem, please contact the authors.
Using MACA with TaKIPI¶
TaKIPI is able to tag plain text (using-it TXT
). This is achieved by using hard-coded tokenisation and sentence splitting rules, and its own Morfeusz wrapper. There are several reasons to prefer external means of performing these tasks, e.g.:
- when tagging plain text, TaKIPI uses data from Morfeusz SIaT, whose licence is quite restrictive
- TaKIPI segmentation rules are not flexible,
- sentence segmentation heuristics are rudimentary,
- TaKIPI won't split sentences if there is no explicit punctuation mark (whatever the vertical whitespace amount),
- there is no control over morphological analysis (besides guesser on/off switch).
Fortunately, TaKIPI is able to read an already tokenised and morphologically analysed input (using -it CORPUS
). This is where Maca comes in.
Note: before version 1.8-2 (revision 534) TaKIPI didn't read sentence division in the input XCES/XML file (the input was anyway re-segmented with libcorpus1's sentencer). To get the benefit of Marcin Miłkowski's SRX rules that are bundled with Maca, make sure that you have a recent version of TaKIPI.
To tag plain text using Maca-bundled Morfeusz-wrapper, issue the following (optionally, use -q
to suppress diagnostic outputting messages):
maca-analyse morfeusz-kipi -s -o xces < INPUT_TEXT > out-mor.xml takipi -i out-mor.xml -o out-tagged.xml -it CORPUS
NOTE that to comply with TaKIPI, you have to use one of the configurations that output in the KIPI tagset (as a matter of fact, the KIPI tagset is hardcoded in TaKIPI sources).
The -s
(--split
) option forces Maca to divide text into paragraphs whenever many newline characters occur. This is recommended. Note that without this option, the output text will not contain paragraph division at all (the output XML will consist of sentence chunks only, there will be no paragraph chunks). Some utils (including maca-convert
) expect paragraphs in the XML.
To use TaKIPI's morphological guesser, use morfeusz-kipi-guesser
instead. Note that this requires Maca built with guesser plug-in (it is probably already installed if TaKIPI had been installed in the system when you installed Maca).
Note: If you are interested in processing multiple files at once, there is a util called maca-analyse-batch
-- consult its --help
message. This is a batch mode version of maca-analyse
, suitable for working in tandem with TaKIPI in its own batch mode (-is
). Batch processing is recommended, since it allows to reduce start-up overhead (loading of morphological dictionaries, especially important when using the guesser plug-in). TODO: add example usage.
To analyse pre-morph-style XML files (i.e. text divided only into paragaphs by XML tags), use -i premorph-stream
mode. E.g.:
maca-analyse -q morfeusz-kipi -i premorph-stream -o xces < premorph.xml > morph.xml
To use the data resulting from Morfologik conversion (free Morfeusz alternative), use the morfo1122-ikipi
configuration. Note that this configuration outputs in the intermediate “IKIPI” tagset. To have it converted into the IPIC (KIPI) tagset, use maca-convert
:
echo "Zjadłaś dwa śledzie. Znikły bez śladu." | maca-analyse morfo1222-ikipi --split -o xces | maca-convert ikipi2kipi.conv -o xces > out-kipi.xml
Tagsets and tagset converter¶
Maca is able to perform simple tagset conversions.
To see different tagsets, run tagset-tool
, e.g. to see the differences between KIPI (IPIC) and IKIPI tagsets, you can compare tagset-tool kipi
with tagset-tool ikipi
. Tagsets are defined by INI files; you can supply your own definitions. To browse the existing, check the tagset-tool
search path and browse the directory.
The tagset converter (maca-convert
) works on streams. Usually it makes sense to set output to XCES (-o xces
). It is parametrised with tagset conversion routines, being INI files (note that, currently, you must specify the full name including the extension). For instance, to convert from IKIPI to KIPI, use the following:
maca-convert ikipi2kipi.conv < in-ikipi.xml > out-kipi.xml
Maca convert may be useful as a tool to convert file formats or even dialects of the same format (e.g. format XCES XML and sort tags, useful before calling diff
). To use it that way, provide nop
as converter name and specify the tagset name (-t name
). E.g.
maca-convert -q nop -t kipi -o xces,sorttags < diag.retagd > diag.mretagd
We also provide three conversion routines related to the NKJP tagset: nkjp2kipi.conv
(NKJP to KIPI conversion), morfsgjp2kipi.conv
(the actual tagset of Morfeusz SGJP to KIPI) and sgjp2nkjp.conv
(actual Morfeusz SGJP to NKJP). Note that the conversion into KIPI is by no means perfect; some additional comments may be found in the conversion routines themselves.
Generating random samples for training and testing of taggers¶
maca-convert
may be used to generate N random splits into train and test files:
maca-convert -c nop -t ... -i .... -I ..... -F NUM_SPLITS -f FILE_PREFIX -r FRAC_TRAIN -R FRAC_TEST
e.g.
maca-convert -t nkjp -c nop -I nkjp-whole.xml -F 10 -f ~/nkjp-folds/ -r 0.9
This will generate ~/nkjp-folds/train01.xml ~/nkjp-folds/test01.xml etc. up to 10 from 90% train i 10% test.