User guide

This guide describes how to use the command-line utilities provided with WCCL.

Technical note: those utilities are simple programs referencing the corresponding API functions, hence similar functionality may be easily obtained by using the libraries — by using the native C++ API or Python wrappers.

It is assumed here that WCCL has been installed.

WCCL provides the following command-line utils:

  • wccl-run: utility that evaluates functional expressions against the input corpus
  • wccl-features: as above but outputs in Weka ARFF format
  • wccl-rules: utility to process a corpus using a set of disambiguation and/or annotation rules
  • wccl-parser: interactive util to test WCCL expressions

Feature generation with wccl-run

wccl-run allows to report the value of given functional expressions on each token from the input. The following examples will use the example XCES file in-xces.xml that is provided with the sources. The example input contains one sentence and morpho-syntactic annotation in the kipi tagset. The file has been generated with Maca by the following call:

echo "Rodzina pingwinów obejmuje gatunki morskie (na lądzie pojawiają się jedynie w strefie brzegowej) zamieszkujące zimne morza półkuli południowej." | maca-analyse morfo1222-ikipi -o xces -s | maca-convert ikipi2kipi.conv -o xces > in-xces.xml

To get started, issue the following:

wccl-run in-xces.xml class[0] cas[0]

This will generate a multi-column output with each row corresponding to one input token. First sentence and token number is given, then word form, followed by the values of the functional expressions provided by the user. The header may be turned off if desired; the same goes for particular columns, see wccl-run for details.

Note that for some tokens the value of cas[0] is an empty set, while for some others it contains multiple elements. This is because the input was ambiguous (e.g. multiple tag/lemma pairs attached to single tokens). WCCL is suitable for processing both ambiguous and disambiguated input. In the latter case, if the selected tags are marked as “disamb”, use -i xces,disamb_only (WCCL itself doesn't look at the “disamb” markers, so you have to use an input reader configuration that loads only “disamb” interpretations for further processing).

For complex expressions it is recommended to prepare separate file(s) with WCCL expressions. Such files should conform to the WCCL syntax for files. This also allows for loading of external lexicons. For details, consult the language specification as well as the example .ccl files. For demonstration, use the following simple-ops.ccl file:

@s:"orths" (
   orth[-1]; orth[0]; orth[1]
)

@t:"wclass" (
   class[-1]; class[0]; class[1]
)

@b:"agr2" (
   agr(-1,1,{nmb,gnd,cas})
)

and the following call:

wccl-run in-xces.xml simple-ops.ccl

To see simple lexicon usage, try wccl-run in2-xces.xml indecl-ops.ccl
Where simple-ops.ccl consists of the following definitions

TODO: wccl-features: ARFF format

Processing text with wccl-rules

wccl-rules allows to apply:
  • disambiguation rules (syntax similar to JOSKIPI rules as used in TaKIPI, position-centric),
  • chunk annoation rules (“match rules”, matching over whole sentences).

Note that both rule types must be written in the general WCCL file syntax. This also allows to put both rule types in one file. In such situations, disambiguation rules are fired first.

As wccl-rules may introduce new (or alter existing) chunk annotation, its default output format is the “ccl” format. You can set different format by the -o option (e.g. to output in plain XCES with no annotations, use -o xces).

To apply the tagging rules converted from TaKIPI, issue the following:

wccl-rules takipi_rules.ccl in-xces.xml -o xces > out.xml

Note that the discarded interpretations (“lexemes” in XML) are physically removed from input, the “disamb” markers are not used. Similarly, the default input mode assumes loading all interpretations, not respecting the “disamb” markers in input. Some additional options of handling “disamb” in output may be introduced in the future.

To see an example usage of match rules, issue the following:

wccl-rules in2-xces.xml np-match.ccl