Toki is a configurable tokeniser, i.e. a module for segmentation of running text into tokens (word-like units) and sentences.
Some interesting features:
- Unicode support,
- C++ implementation (hence low start-up time)
- tokenisation rules are defined in simple INI-style files
- tokens are preclassified with labels, e.g. to recognise numbers, dates, strings containings hyphens
- each token is attached a qualitative description of preceding whitespace amount and type
- support for SRX sentence segmentation format (probably the first open-source C/C++ implementation)
- confiurations for tokenising Polish are supplied
The sources have been released under GNU GPL 3.0. They may be obtained via our git repository:
git clone http://nlp.pwr.wroc.pl/toki.git
To install the library and the test application, you'll need CMake 2.6.
- ICU (at least 4.2)
- Boost (tested with 1.41and 1.42)
- libxml++2.6 (for SRX support)
- pwrutils from our corpus2 repository
pwrutils sources may be obtained from http://nlp.pwr.wroc.pl/corpus2.git. Note that only libpwrutils subproject is needed. If you plan to use other tools provided by us (e.g. MACA), consider installing the whole corpus2 package with its external CMakeLists.
These libraries must be installed along with their headers.
To install Toki, issue the following:
(or maybe ``CXX=/usr/lib/distcc/g++ CXXFLAGS="-march=i686 -m32" cmake -i'' or similar)
and confirm the proposed default values (when asked for seeing advanced options, confirm No). There should be no errors; if there are any, they probably come from lack of the required libraries. If there are errors and you want to retry the install procedure (having installed the missing libraries), remove the CMakeCache.txt file to make sure CMake hasn't memorised the previous state of your system.
After running CMake successfully, you are ready to proceed with the standard make-and-install procedure:
(wait for it to finish, make sure there are no errors)
sudo make install
make test (if you want to run the unit tests)