MACA (Morphological Analysis Converter and Aggregator) is a configurable module for morphological analysis and tagset conversion.
The implementation consists of:
- shared libraries (C++) with simple API, including the Corpus2 library with Python wrappers (tag and tagset support, corpus I/O)
- set of command-line utils performing simple tasks, such as running different morphological analyser configurations, tagset converter
The utils support pipeline processing, e.g. first analyse plain text into an annotated corpus, then convert the tagset used in the corpus.
MACA is essentially a “morphological analyser shell”. It allows to integrate different sources of morphological data. This allows to:
- analyse plain text or text divided into paragraphs; our util called Toki is used for tokenisation and sentence splitting (supporting SRX rules)
- compile user-supplied dictionary into a working transducer (using SFST library)
- use Morfeusz SGJP or Morfeusz SIaT (both versions are provided as shared libraries; MACA adds the necessary shell to analyse plain text and get valid corpus output)
- building processing pipelines, e.g. user-provided dictionary may override some Morfeusz entries
- different pipelines tied to different tokeniser labels (e.g. strings containing hyphens may be analysed with a specialised dictionary)
- tagset conversion of already analysed text and on-the-fly tagset conversion from Morfeusz output (allows to solve some of segmentation ambiguities)
MACA is targeted for Polish. It should be usable for other languages; the possible obstacles may be:
- it assumes positional tagsets, e.g. word class (POS, grammatical category) comes first, then the attributes (gram. categories) valid for the word class should be given their values
- tags are textually represented as in the IPI PAN Corpus, that is the atomic symbols are separated with colons
- we support XCES corpus format and some simple formats, e.g. a variation of plain text format.
More details and pointers to source codes may be found on the projest site.
The source codes have been released under GNU GPL 3.0. Data are generally provided under Creative Commons ShareAlike, please read the notes on the project site carefully.