Custom dictionaries

This tutorial explains how to build a custom dictionary, compile it into an SFST transducer and write a Maca configuration that uses it.

It is assumed that Maca is compiled with SFST support. For details, see installation instructions on the main page. Also note that SFST is licensed under GNU GPL, hence compiling with SFST support will render the whole Maca package licensed under GNU GPL (without SFST the licence is more liberal — GNU LGPL 3.0).

SFST (Stuttgart Finite State Transducer Tools) is a library and set of utils to handle transducers.
More info: http://www.ims.uni-stuttgart.de/projekte/gramotron/SOFTWARE/SFST.html

Using transducers allows to have compact files, lower memory load and efficient processing. It is recommended to use compiled transducers unless prototyping or working with small morphological data files (in such cases, consider using MapAnalyser). Here we will use SFST.

File format of morphological dictionaries

Input files are suitable for both SFSTAnalyser (class=sfst in Maca config files) and MapAnalyser (class=map or map-case, hashmap, hashmap-case). The file format is simple. Each entry in new line, each line consists of three fields, delimited by white spaces — preferably tab (this may make a difference for some scripts):

form    lemma    tags

The file MUST be encoded in UTF-8.
Form is the orthographic form (possibly inflected) as generated by the tokeniser or read from tokenised input. Lemma is the dictionary base form. Tags is a specification of a set of tags, possibly representing one or many tags. The simplest way is to list exactly one tag per each entry and use tabclean.py script to automatically generate the compacted form. We will follow this way here.

In the compacted form, multiple tags are separated by plus character (+) with no surrounding spaces. Besides, alternative values for one category may be separated by dot character (.). When all possible values of a category (as defined by tagset) are desired, it may be shortened to underscore (_). Examples (assuming nkjp tagset):

  • subst:sg.pl:nom.acc:f will be expanded to 4 tags:
    • subst:sg:nom:f
    • subst:pl:nom:f
    • subst:sg:acc:f
    • subst:pl:acc:f

The same four tags may be represented by subst:_:nom.acc:f (as sg and pl are all the possible values of gender, which is the first attribute for nouns).

Note that it is the same format than may be read by tagset-tool with -p switch (e.g. tagset-tool -p nkjp).

For more examples, see the .txt files included in the data directory distributed with Maca sources.

Note on duplicates: if the morphological data contain duplicated tags (specified explicitely or using wildcard representations that evaluate to duplicates), this will result in duplicated tags in the analyser output. To clean such duplicates, use tabclean.py script.

Note on case sensitivity: if the morphological data are intended to be used in case-insensitive manner, either prepare it that way (e.g. all forms should be lower case) or use tabclean.py script to convert the data (all the scripts are inside the tools directory). Remember to switch the SFSTAnalyser to lower-case=true in the config (if using MapAnalyser, you have to choose a class with desired case-sensitivity).

Let's assume we want to add all the inflected forms of two Polish nouns: pazurogon and długoogon. We write the following dictionary file: myitems.tab with the following contents (field delimiter is actually TAB character, which might be incorrectly rendered below):

długoogon    długoogon    subst:sg:nom:m2
długoogona    długoogon    subst:sg:gen:m2
długoogonowi    długoogon    subst:sg:dat:m2
długoogona    długoogon    subst:sg:acc:m2
długoogonie    długoogon    subst:sg:loc:m2
długoogonem    długoogon    subst:sg:inst:m2
długoogonie    długoogon    subst:sg:voc:m2
długoogony    długoogon    subst:pl:nom:m2
długoogonów    długoogon    subst:pl:gen:m2
długoogonom    długoogon    subst:pl:dat:m2
długoogony    długoogon    subst:pl:acc:m2
długoogonach    długoogon    subst:pl:loc:m2
długoogonami    długoogon    subst:pl:inst:m2
długoogony    długoogon    subst:pl:voc:m2
pazurogon    pazurogon    subst:sg:nom:m2
pazurogona    pazurogon    subst:sg:gen:m2
pazurogonowi    pazurogon    subst:sg:dat:m2
pazurogona    pazurogon    subst:sg:acc:m2
pazurogonie    pazurogon    subst:sg:loc:m2
pazurogonem    pazurogon    subst:sg:inst:m2
pazurogonie    pazurogon    subst:sg:voc:m2
pazurogony    pazurogon    subst:pl:nom:m2
pazurogonów    pazurogon    subst:pl:gen:m2
pazurogonom    pazurogon    subst:pl:dat:m2
pazurogony    pazurogon    subst:pl:acc:m2
pazurogonach    pazurogon    subst:pl:loc:m2
pazurogonami    pazurogon    subst:pl:inst:m2
pazurogony    pazurogon    subst:pl:voc:m2

Compacting and cleaning dictionaries

Before actual compilation into a transducer you may want to clean up your dictionary, remove duplicates, possibly convert letter case and compress to compact tag representation. All this may be done by using tabclean.py helper script, which is located in tools directory of the source distribution (as obtained from the Git repository).

For details, run tabclean.py --help. Here we'll assume that we don't want to change letter case of forms or lemmas. Also, we will assume that the input file is not sorted (hence we'll be using the -u switch). Note that for large files it may be better to sort the dictionary beforehand using sort Unix command and using the script in the default (sorted) mode, since -u forces the script to read the whole dictionary into memory before actual processing.

We use the following call:

tabclean.py -u myitems.tab myitems-clean.tab

This will generate myitems-clean.tab dictionary in the same TAB format. Note the way entries are compacted by using the dot shorthand, e.g.:

długoogon    długoogon    subst:sg:nom:m2
długoogona    długoogon    subst:sg:acc.gen:m2
…
długoogonie    długoogon    subst:sg:loc.voc:m2
długoogonom    długoogon    subst:pl:dat:m2
długoogonowi    długoogon    subst:sg:dat:m2
długoogony    długoogon    subst:pl:acc.nom.voc:m2
…

Compiling into an SFST transducer.

We provide a convenience script: tools/tab-to-sfst-now, which does the job of compilation.

./tab-to-sfst-now work/myitems-clean.tab

This will generate binary file myitems-clean.tab.fst with the transducer.

Note: if you want to add your compiled transducers into standard Maca distribution when the user issues make install, use .fst extension and place them within the data subdir of the repository. The same goes for ini files (Maca configs) and conv files (tagset conversion routines). If you just want to install them system-wise without adding to the repository, also place those files into the data subdir and go with the make install. Without system-wide installation Maca will still be able to find the files in your current directory.

3. Updating analyser config

Let's assume we want to add this dictionary to the standard morfeusz-nkjp-official config in such a way that Morfeusz dictionary is first consulted and if the sought form is not there then Maca resorts to our additional dictionary.

Let's make a copy of the morfeusz-nkjp-official config:

cd data
cp morfeusz-nkjp-official.ini nkjp-with-additions.ini
# make sure the compiled transducer is also here
cp WHEREITWAS/myitems-clean.tab.fst .

Now take a look at the config (open nkjp-with-additions.ini in a text editor).
The INI file defines a [general] section that specifies tagset and Toki (tokeniser) config to be used. Sections defining particular analysers follow ([ma:…]). These sections just define named analysers that may be incorporated into processing chain.
The processing chain definition is contained within [rule] sections and [default] section. [rule] section may be repeated; it defines processing chain attached to a particular token type as recognised by the employed Toki configuration. Toki's job is to find token and sentence boundaries in input stream. It is also able to perform primitive classification of the tokens using token types.

Here, p is a token type for punctuation and all punctuation tokens are treated with a special processing chain that instantly assign interp tag without even using Morfeusz. This is achieved by referencing the analyser named interp here, whose behaviour is defined in [ma:interp] section. Note that this section employs a ConstAnalyser (class=const). Similar processing chain is used to treat URLs — they all receive subst:sg:nom:m3 tags.

The [default] section defines processsing chain for all token types other than those handled by the rules.

[default]
    ma=morfeusz
    ma=unknown

Here there are two analysers fired sequentially: if the analyser defined in the [ma:morfeusz] section fails, [ma:unknown] is fired.

For more details on writing configs consult doc/Writing_configs.txt file.

Our task will be to let Maca check Morfeusz first and if it fails, check our transducer. If it fails, the fallback should fire — [ma:unknown]. So, we'll modify the [default] section to the following content:

[default]
    ma=morfeusz
    ma=mydict
    ma=unknown

Also, we will add a section dedicated for the transducer-based analyser:

[ma:mydict]
    class=sfst
    tagset=nkjp
    file=myitems-clean.tab.fst
    lower-case=true

Note how we triggered the transducer to be case-insensitive (lower-case=true).

The whole config should look like this: nkjp-with-additions.ini

Testing the config

Now you can test the analyser.

echo 'Spotkanie z pazurogonem jest milsze niż jedzone przez długoogona owady.' | maca-analyse nkjp-with-additions -qs

Note the difference in comparison to the standard config:

echo 'Spotkanie z pazurogonem jest milsze niż jedzone przez długoogona owady.' | maca-analyse morfeusz-nkjp-official -qs

If you encounter the following error:

Error: dlopen error while loading maca plugin 'sfst' (libmaca_sfst.so): libmaca_sfst.so: cannot open shared object file: No such file or directory

…it means that the SFST plugin is not installed properly and you'll need to make sure SFST itself (1.2) is installed and then re-install Maca with SFST support turned on (it requires BUILD_GPL_PLUGINS and SFST_PLUGIN CMake flags turned on). Remember to run ldconfig after installation. Refer to the installation instructions for details.

Using with tagger and chunker

WCRFT (both WCRFT1 and WCRFT2) allows to override standard Maca configuration associated with WCRFT configuartion.

For instance (in WCRFT2):

echo 'Spotkanie z pazurogonem jest milsze niż jedzone przez długoogona owady.' | wcrft-app nkjp_e2 -i txt -m nkjp-with-additions -

Note that the Maca config is referenced without the .ini suffix.

If you plan to use the extended dictionary frequently, it is best to install both the transducer (.fst file) and the new Maca config (.ini) file system-wide. If the licence of the data is permissive and the dataset may be generally useful, consider also adding it to the Maca repository. It may also be the best option then to copy the WCRFT config and change its default Maca config to your Maca config there.
For instance, if you open the standard nkjp_e2 config (e.g. gedit /usr/local/share/wcrft/config/nkjp_e2.ini), you'll notice the macacfg property.

[general]
tagset   = nkjp
; all the attrs
attrs = CLASS,nmb,cas,gnd,asp
; acm,dot could be useful for uknown
macacfg = morfeusz-nkjp-official
defaultmodel = model_nkjp10_wcrft_e2

You can copy the config and use it with your name. This will also allow you to use your extended Maca configuration within tagger configuration, which in turn may be used from other application.

myitems.tab - Example morphological dictionary in TAB format (1,065 KB) Adam Radziszewski, 24 wrz 2014 14:11

myitems-clean.tab - Example morphological dictionary after running tabclean.py (816 B) Adam Radziszewski, 24 wrz 2014 14:25

myitems-clean.tab.fst - Example morphological dictionary compiled into SFST transducer (549 B) Adam Radziszewski, 24 wrz 2014 15:09

nkjp-with-additions.ini Magnifier - Example Maca config based on morfeusz-nkjp-official using a transducer (725 B) Adam Radziszewski, 24 wrz 2014 15:09