Custom dictionaries¶
This tutorial explains how to build a custom dictionary, compile it into an SFST transducer and write a Maca configuration that uses it.
It is assumed that Maca is compiled with SFST support. For details, see installation instructions on the main page. Also note that SFST is licensed under GNU GPL, hence compiling with SFST support will render the whole Maca package licensed under GNU GPL (without SFST the licence is more liberal — GNU LGPL 3.0).
SFST (Stuttgart Finite State Transducer Tools) is a library and set of utils to handle transducers.
More info: http://www.ims.uni-stuttgart.de/projekte/gramotron/SOFTWARE/SFST.html
Using transducers allows to have compact files, lower memory load and efficient processing. It is recommended to use compiled transducers unless prototyping or working with small morphological data files (in such cases, consider using MapAnalyser
). Here we will use SFST.
File format of morphological dictionaries¶
Input files are suitable for both SFSTAnalyser
(class=sfst
in Maca config files) and MapAnalyser
(class=map
or map-case
, hashmap
, hashmap-case
). The file format is simple. Each entry in new line, each line consists of three fields, delimited by white spaces — preferably tab (this may make a difference for some scripts):
form lemma tags
The file MUST be encoded in UTF-8.
Form is the orthographic form (possibly inflected) as generated by the tokeniser or read from tokenised input. Lemma is the dictionary base form. Tags is a specification of a set of tags, possibly representing one or many tags. The simplest way is to list exactly one tag per each entry and use tabclean.py
script to automatically generate the compacted form. We will follow this way here.
In the compacted form, multiple tags are separated by plus character (+
) with no surrounding spaces. Besides, alternative values for one category may be separated by dot character (.). When all possible values of a category (as defined by tagset) are desired, it may be shortened to underscore (_). Examples (assuming nkjp
tagset):
subst:sg.pl:nom.acc:f
will be expanded to 4 tags:subst:sg:nom:f
subst:pl:nom:f
subst:sg:acc:f
subst:pl:acc:f
The same four tags may be represented by subst:_:nom.acc:f
(as sg
and pl
are all the possible values of gender, which is the first attribute for nouns).
Note that it is the same format than may be read by tagset-tool
with -p
switch (e.g. tagset-tool -p nkjp
).
For more examples, see the .txt files included in the data directory distributed with Maca sources.
Note on duplicates: if the morphological data contain duplicated tags (specified explicitely or using wildcard representations that evaluate to duplicates), this will result in duplicated tags in the analyser output. To clean such duplicates, use tabclean.py
script.
Note on case sensitivity: if the morphological data are intended to be used in case-insensitive manner, either prepare it that way (e.g. all forms should be lower case) or use tabclean.py
script to convert the data (all the scripts are inside the tools
directory). Remember to switch the SFSTAnalyser
to lower-case=true
in the config (if using MapAnalyser, you have to choose a class with desired case-sensitivity).
Let's assume we want to add all the inflected forms of two Polish nouns: pazurogon
and długoogon
. We write the following dictionary file: myitems.tab with the following contents (field delimiter is actually TAB character, which might be incorrectly rendered below):
długoogon długoogon subst:sg:nom:m2 długoogona długoogon subst:sg:gen:m2 długoogonowi długoogon subst:sg:dat:m2 długoogona długoogon subst:sg:acc:m2 długoogonie długoogon subst:sg:loc:m2 długoogonem długoogon subst:sg:inst:m2 długoogonie długoogon subst:sg:voc:m2 długoogony długoogon subst:pl:nom:m2 długoogonów długoogon subst:pl:gen:m2 długoogonom długoogon subst:pl:dat:m2 długoogony długoogon subst:pl:acc:m2 długoogonach długoogon subst:pl:loc:m2 długoogonami długoogon subst:pl:inst:m2 długoogony długoogon subst:pl:voc:m2 pazurogon pazurogon subst:sg:nom:m2 pazurogona pazurogon subst:sg:gen:m2 pazurogonowi pazurogon subst:sg:dat:m2 pazurogona pazurogon subst:sg:acc:m2 pazurogonie pazurogon subst:sg:loc:m2 pazurogonem pazurogon subst:sg:inst:m2 pazurogonie pazurogon subst:sg:voc:m2 pazurogony pazurogon subst:pl:nom:m2 pazurogonów pazurogon subst:pl:gen:m2 pazurogonom pazurogon subst:pl:dat:m2 pazurogony pazurogon subst:pl:acc:m2 pazurogonach pazurogon subst:pl:loc:m2 pazurogonami pazurogon subst:pl:inst:m2 pazurogony pazurogon subst:pl:voc:m2
Compacting and cleaning dictionaries¶
Before actual compilation into a transducer you may want to clean up your dictionary, remove duplicates, possibly convert letter case and compress to compact tag representation. All this may be done by using tabclean.py
helper script, which is located in tools
directory of the source distribution (as obtained from the Git repository).
For details, run tabclean.py --help
. Here we'll assume that we don't want to change letter case of forms or lemmas. Also, we will assume that the input file is not sorted (hence we'll be using the -u
switch). Note that for large files it may be better to sort the dictionary beforehand using sort
Unix command and using the script in the default (sorted) mode, since -u
forces the script to read the whole dictionary into memory before actual processing.
We use the following call:
tabclean.py -u myitems.tab myitems-clean.tab
This will generate myitems-clean.tab dictionary in the same TAB format. Note the way entries are compacted by using the dot shorthand, e.g.:
długoogon długoogon subst:sg:nom:m2 długoogona długoogon subst:sg:acc.gen:m2 … długoogonie długoogon subst:sg:loc.voc:m2 długoogonom długoogon subst:pl:dat:m2 długoogonowi długoogon subst:sg:dat:m2 długoogony długoogon subst:pl:acc.nom.voc:m2 …
Compiling into an SFST transducer.¶
We provide a convenience script: tools/tab-to-sfst-now
, which does the job of compilation.
./tab-to-sfst-now work/myitems-clean.tab
This will generate binary file myitems-clean.tab.fst with the transducer.
Note: if you want to add your compiled transducers into standard Maca distribution when the user issues make install
, use .fst extension and place them within the data
subdir of the repository. The same goes for ini
files (Maca configs) and conv
files (tagset conversion routines). If you just want to install them system-wise without adding to the repository, also place those files into the data
subdir and go with the make install
. Without system-wide installation Maca will still be able to find the files in your current directory.
3. Updating analyser config
Let's assume we want to add this dictionary to the standard morfeusz-nkjp-official
config in such a way that Morfeusz dictionary is first consulted and if the sought form is not there then Maca resorts to our additional dictionary.
Let's make a copy of the morfeusz-nkjp-official
config:
cd data cp morfeusz-nkjp-official.ini nkjp-with-additions.ini # make sure the compiled transducer is also here cp WHEREITWAS/myitems-clean.tab.fst .
Now take a look at the config (open nkjp-with-additions.ini
in a text editor).
The INI file defines a [general]
section that specifies tagset and Toki (tokeniser) config to be used. Sections defining particular analysers follow ([ma:…]
). These sections just define named analysers that may be incorporated into processing chain.
The processing chain definition is contained within [rule]
sections and [default]
section. [rule]
section may be repeated; it defines processing chain attached to a particular token type as recognised by the employed Toki configuration. Toki's job is to find token and sentence boundaries in input stream. It is also able to perform primitive classification of the tokens using token types.
Here, p
is a token type for punctuation and all punctuation tokens are treated with a special processing chain that instantly assign interp
tag without even using Morfeusz. This is achieved by referencing the analyser named interp
here, whose behaviour is defined in [ma:interp]
section. Note that this section employs a ConstAnalyser
(class=const
). Similar processing chain is used to treat URLs — they all receive subst:sg:nom:m3
tags.
The [default]
section defines processsing chain for all token types other than those handled by the rules.
[default] ma=morfeusz ma=unknown
Here there are two analysers fired sequentially: if the analyser defined in the [ma:morfeusz]
section fails, [ma:unknown]
is fired.
For more details on writing configs consult doc/Writing_configs.txt
file.
Our task will be to let Maca check Morfeusz first and if it fails, check our transducer. If it fails, the fallback should fire — [ma:unknown]
. So, we'll modify the [default]
section to the following content:
[default] ma=morfeusz ma=mydict ma=unknown
Also, we will add a section dedicated for the transducer-based analyser:
[ma:mydict] class=sfst tagset=nkjp file=myitems-clean.tab.fst lower-case=true
Note how we triggered the transducer to be case-insensitive (lower-case=true
).
The whole config should look like this: nkjp-with-additions.ini
Testing the config¶
Now you can test the analyser.
echo 'Spotkanie z pazurogonem jest milsze niż jedzone przez długoogona owady.' | maca-analyse nkjp-with-additions -qs
Note the difference in comparison to the standard config:
echo 'Spotkanie z pazurogonem jest milsze niż jedzone przez długoogona owady.' | maca-analyse morfeusz-nkjp-official -qs
If you encounter the following error:
Error: dlopen error while loading maca plugin 'sfst' (libmaca_sfst.so): libmaca_sfst.so: cannot open shared object file: No such file or directory
…it means that the SFST plugin is not installed properly and you'll need to make sure SFST itself (1.2) is installed and then re-install Maca with SFST support turned on (it requires
BUILD_GPL_PLUGINS
and SFST_PLUGIN
CMake flags turned on). Remember to run ldconfig
after installation. Refer to the installation instructions for details.
Using with tagger and chunker¶
WCRFT (both WCRFT1 and WCRFT2) allows to override standard Maca configuration associated with WCRFT configuartion.
For instance (in WCRFT2):
echo 'Spotkanie z pazurogonem jest milsze niż jedzone przez długoogona owady.' | wcrft-app nkjp_e2 -i txt -m nkjp-with-additions -
Note that the Maca config is referenced without the .ini
suffix.
If you plan to use the extended dictionary frequently, it is best to install both the transducer (.fst
file) and the new Maca config (.ini
) file system-wide. If the licence of the data is permissive and the dataset may be generally useful, consider also adding it to the Maca repository. It may also be the best option then to copy the WCRFT config and change its default Maca config to your Maca config there.
For instance, if you open the standard nkjp_e2
config (e.g. gedit /usr/local/share/wcrft/config/nkjp_e2.ini
), you'll notice the macacfg
property.
[general] tagset = nkjp ; all the attrs attrs = CLASS,nmb,cas,gnd,asp ; acm,dot could be useful for uknown macacfg = morfeusz-nkjp-official defaultmodel = model_nkjp10_wcrft_e2
You can copy the config and use it with your name. This will also allow you to use your extended Maca configuration within tagger configuration, which in turn may be used from other application.