WcrftReader

WcrftReader provides Corpus2::TokenReader interface to WCRFT tagger.

Corpus2 is a library that offers a range of corpus readers and writers, data structures and routines related to morpho-syntactic processing of annotated corpora and support for positional tagset. WCRFT itself is based on Corpus2 library. Many other projects also are, e.g. IOBBER (chunker made for Polish), WCCL (morpho-syntactic feature generator). These applications assume that input to be processed is annotated corpus stored in some format. To read each format a type of a reader is used. These applications use Corpus2's convenient reader factory. Thanks to it, a reader may be instantiated by providing its name (also called id, class or just input format). Reader name is a string, e.g. "ccl" designates a reader that is able to read CCL format (an XML-based format that may store both morphosyntactic tagging and shallow syntactic annotation). The name may also be suffixed with additional options, e.g. "ccl,ign" will create a CCL format reader instance that will tolerate unexpected tags and treat them as unknown. The user of an application provides it with such reader name (typically by using -i name switch, e.g. -i ccl).

So far, to user to use these applications against plain text, the user had to tag plain text beforehand and feed the application with tagger output. If tagging was to be integrated into an application, one had to used tagger's API. WCRFT 1.0 was written in Python and offered Python API. It was used in some projects, e.g. IOBBER contains iobber_txt script that connects to WCRFT Python API and tags user-supplied text before chunking (note that this may have already changed as we plan to let IOBBER use WCRFT 2.0 and WcrftReader). The disadvantage was that to add tagging support to an application one needed to write interfacing code.

WcrftReader solves this problem. It compiles as a plugin to Corpus2 and presents itself as a regular reader instantiated with "wcrft" name. The whole tagging procedure is transparent to the application. Therefore any application that may instantiate Corpus2 readers, may already use tagging.

Schema depicting WCRFT-as-reader usage

WcrftReader as a Corpus2 plugin

Corpus2 supports reader and writer plugins. Each plugin is built as a separate shared library, its name is always prefixed with corpus2_. After system-wise installation a plugin is put into system library dir — the same place where corpus2 itself is loaded from. When attempting to create a reader, Corpus2 first checks its native readers and if not succeeded, attempts to load an external library (with a dlopen call). If a compliant library is found, the reader is created and the whole plugin loading process is totally transparent to the user.

WcrftReader as a TokenReader

Note that Corpus2 offers two types of reader access: reading files and reading streams. While most of Corpus2 native readers implement both modes, WcrftReader acts only as a file reader.

WcrftReader's behaviour is controlled via its options. Using reader name (via applications' input_format parameter or TokenReader::create_path_reader reader name/id argument). Option values should be put after reader name, e.g. "wcrft,config:nkjp_s2" will create WcrftReader using nkjp_s2.ini WCRFT config.

WcrftReader may process various input formats. By default it tags plain text (it uses Maca morphological analyser for that purpose). You can override this behaviour using format option, e.g. "wcrft,format:ccl".

By default WcrftReader uses nkjp_e2 config. This config is optimised for speed and low memory usage, though its results are less than optimal. For best tagging accuracy, use @nkjp_s2.ini" config — "wcrft,config:nkjp_s2". Note that you will need to download the trained model for this config separately (it is pretty huge, look for it at the main WCRFT wiki page). You should either put the unpacked model into a standard system location for models (see documentation) or keep it in your preferred location and use "model:your/model/dir" option to tell the WcrftReader where it is, e.g. "wcrft,config:nkjp_s2,model:/home/user/data/model_nkjp10_wcrft_s2".

To test the WcrftReader you can use corpus-get from Corpus2, e.g.:

corpus-get -C -t nkjp -o ccl -i wcrft,config:nkjp_s2,model:/home/user/data/model_nkjp10_wcrft_s2 ~/data/input.txt

Note: all Corpus2 readers need to be specified a tagset upon creation. In case of WcrftReader, the provided tagset must match the tagset used by the employed config. By default, you will need nkjp tagset (it is used by the default configs).

Technical documentation

WcrftReader is implemented in WcrftReader class. It is put into Corpus2 namespace.

WcrftReader inherits from Corpus2::BufferedChunkReader.

Class API in wcrftreader.h:


namespace Corpus2 {

/**
 * @brief @c Wrapper that allows to use tagger as a @c Corpus2::TokenReader.
 *
 * Gives @c Corpus2::TokenReader interface to WCRFT tagger. The reader may be
 * created using @c create_path_reader and using "wcrft" as reader id.
 * WcrftReader may read various input formats. By default it tags plain text.
 * You can override this by setting "format:formatname" option.
 * Default WCRFT configuration may be overridden using "config:configname" 
 * option. Trained tagger model may be overridden using "model:modeldir".
 */
class WcrftReader: public BufferedChunkReader
{
public:
    /**
     * Create WcrftReader ready to process the given file.
     * @param tagset @c tagset that will match tagger config used
     * @param filename file to be tagged or empty string for stdin
     */
    WcrftReader(const Tagset& tagset, const std::string &filename);

    ~WcrftReader();

    /**
      * Allows to set options specific to WCRFT. These include:
      * * "format:formatname" - assumed input format (default: txt),
      * * "config:configname" - tagger config (default: nkjp_e2),
      * * "model:modeldir" - override model dir (default: empty string).
     **/
    void set_option(const std::string& option);

    /**
     * Option inspector. If option is set, will return the whole string
     * than contains its name and its value (e.g. config:nkjp_e2). If unset,
     * will return an empty string and "unknown" if the option is invalid.
     */
    std::string get_option(const std::string& option) const;

    /**
     * Check if the reader is valid, should throw if not. Called after
     * all set_options during factory reader creation.
     */
    virtual void validate();

    static bool registered;

    static const std::string DEFAULT_CONFIG, DEFAULT_FORMAT, DEFAULT_MODEL;

private:
    …
};

}

wcrftreader.png - Schema depicting WCRFT-as-reader usage (84,769 KB) Adam Radziszewski, 03 wrz 2014 11:22

wcrftreader.svg - Schema depicting WCRFT-as-reader usage (Inkscape drawing) (24,172 KB) Adam Radziszewski, 03 wrz 2014 11:24