Toki — a configurable tokeniser

Toki is a software package performing segmentation of running text into tokens and sentences. The tokeniser is targeted at European languages, especially at Polish (the default configuration is for Polish).

Most important features:
1. Configurability. The behaviour of a working tokeniser is defined by a configuration file. The file specifies the rules of tokenisation (as defined by processing layers) and labelling of tokens as well as may point to a file with sentence splitting rules.
2. The output tokens are labelled. It allows to re-use some of the information that was needed to make decisions on segmentation. For instance, knowing that a token is punctuation may be useful for next stages of processing (e.g. morphological analysis). The labels are also useful internally — they allow to define more sophisticated processing rules.
3. Support for SRX standard for sentence splitting. The advantage is working segmentation rules are available for many languages. To the best of our knowledge, this is the first open source C++ implementation of SRX.
4. Unicode support. ICU library is used for this purpose.
5. C++ library with simple interface. Toki has been implemented as a C++ library which facilitates linking with different languages. The API has been kept simple to allow for easy linkage.
6. Simple command-line util to tokenise text (toki-app).

Toki is bundled with Marcin Miłkowski's SRX rules for sentence splitting. These rules (segment.srx) are licensed under GNU LGPL. You may want to check the LanguageTool repo for updates of these rules (we do not guarantee that the bundled version is the latest).

For a description of project background and typical usage scenarios, please refer to the paper: Adam Radziszewski and Tomasz Śniatowski, “Maca: a configurable tool to integrate Polish morphological data”, FreeRBMT11

@inproceedings{maca,
  author = {Adam Radziszewski and Tomasz \'{S}niatowski},
  title = {Maca --- a configurable tool to integrate {P}olish morphological data},
  booktitle = {Proceedings of the Second International Workshop on Free/Open-Source Rule-Based Machine Translation},
  year = {2011},
  location = {Barcelona, Spain}
}

Acknowledgement: this work is financed by Innovative Economy Programme project POIG.01.01.02-14-013/09.

Obtaining the code and installation

Toki has been released under GNU LGPL 3.0. The sources may be obtained from the git repositories:

git clone http://nlp.pwr.wroc.pl/corpus2.git # contains pwrutils library that is needed for building toki
git clone http://nlp.pwr.wroc.pl/toki.git

To build the codes you will need CMake 2.8 or later. Besides, you will need:

  • ICU 4.2
  • Boost 1.41 or later (tested with 1.41 and 1.42)
  • Loki (libloki-dev)
  • libxml++2.6 (for SRX support)
  • libpwrutils from corpus2 repository (its build process is based on CMake, see the project site)

First install the above libraries with headers, then proceed to installation of Toki. For details, see INSTALL in the repository.

Using toki as a standalone utility

After installation, a simple util called toki-app will be available. It is able to read input stream and generate tokenised output. See toki-app --help to get a list of available options and operation modes.

The desired output format can be selected by providing format string. The format string may be defined in configuration files ([debug] section, format=…). If it is not provided there, default one will be used. In either case, you can override this setting by explicitly providing your format string with -f. Format string may contain escape sequences such as backslash-n for newline or backslash-t for tab. Besides, these substitutions will be done:
• $orth → the token's orth
• $type → the token's type
• $bs_01 → 1 if the token has the begins_senctence flag set, else 0
• $bs| → ‘|’ if token begins sentence, empty string otherwise
• $bs → ‘bs’ if token begins sentence, empty string otherwise
• $ws → the preceeding whitespace amount string name, e.g. ‘none’, ‘space’ or ‘newline‘.
• $ws_id → the preceeding whitespace numeric code, 0 for no whitespace
• $ws_any → 0 if there was any whitespace preceeding the token, else 1
• $ws_ws → the qualitative description of whitespace characters that came before the token (actual spaces / newlines)

Example using the default configuration (for Polish, using bundled Marcin Miłkowski's sentence splitting rules):

$ echo "Nie chcemy XML-a. Nie dziś." | toki-app -q -f "\$bs|\t\$type\t\$ws\t\$orth\n" 

|      t    newline     Nie
       t    space       chcemy
       th   space       XML-a
       p    none        .
|      t    space       Nie
       t    space       dziś
       p    none        .

Example using toki-app for sentence splitting only:

toki-app -S config/segment.srx -l pl_one --srx-begin-marker="[[[" --srx-end-marker="]]]" < input.txt > sentenced.txt

This call will load the given SRX file, using the given language id (pl_one here) and mark new sentences with the given markers. If markers are not given, the default ones consist of newlines.

Using toki as a shared library (libtoki)

The library defines a simple API, whose full documentation may be found in the Doxygen format. Below the typical usage scenarios are outlined.
1. To create a working tokeniser, instantiate Toki::LayerTokenizer. There are several constructors available; the simplest one assumes using the default configuration (for Polish). To access a named configuration, use Toki::get_named_config(config_name) and pass the acquired object to Toki::LayerTokenizer constructor.
2. To create a working tokeniser with sentence-splitter, first instantiate a Toki::LayerTokenizer object and then wrap a Toki::SentenceSplitter around it. The sentencer object contains a convenient has_more-get_next_sentence interface. The default config loads sentence-splitting rules so is suitable for this purpose.
NOTE: when using a custom config, check whether it contains working sentence-splitting rules. If it doesn't, Toki::SentenceSplitter will buffer all the input and finally produce an enormous sentence containing all the tokens.

The available configs are in the ‘config’ subdir of the repository. For reference, see Writing_configs.txt.

Reporting bugs

Please help improve Toki by reporting bugs, lacking documentation and feature requests.