KPWr (Polish Corpus of Wrocław University of Technology, pol. Korpus Języka Polskiego Politechniki Wrocławskiej) is a corpus of written and spoken documents available on the Creative Common license. The texts are divided into 15 subcorpuses (blogs, science, stenographic recordings, etc.). The documents are annotated on the level of chunks and selected predicate-argument relations, named entities, relations between named entities, anaphora relations, word senses, events, temporal expressions, spatial relations between entities, keywords and semantic roles within nominal and adjective phrases.

 

Statystyki najnowszej wersji korpusu

Podkorpus Dokumenty Tokeny
liczba % liczba %
blogi 171 10,48% 52793 11,80%
dap 132 8,09% 41181 9,20%
dialog 91 5,58% 30070 6,72%
kap 221 13,55% 34284 7,66%
nauka 87 5,33% 28269 6,32%
popularno-naukowe i podręczniki 73 4,48% 22463 5,02%
proza dawna 86 5,27% 36094 8,06%
proza współczesna 42 2,58% 19101 4,27%
religijne 9 0,55% 5357 1,20%
stenogramy 79 4,84% 32297 7,22%
techniczne 17 1,04% 4373 0,98%
urzędowe 62 3,80% 18890 4,22%
ustawy 80 4,90% 31620 7,06%
wikinews 123 7,54% 28264 6,31%
wikipedia 358 21,95% 62520 13,97%
  1631   447576  

Availability

This work is licenced under a Creative Commons Attribution 3.0 Unported Licence
The corresponding licence agreement can be found at http://creativecommons.org/licenses/by/3.0/legalcode

Content

The corpus is manually annotated on the following layers:

  • shallow syntax: syntactic chunking and selected inter-chunk syntactic
  • relations,
  • named entities and selected semantic relations between them,
  • anaphora (limited to the identity-of-reference type),
  • word senses (for selected lexemes).

The corpus is stored in CCL (*.xml) and REL (*.rel.xml) files, for format specification please consult the following site: CCL_format.

Note: the progress of annotation is different across the layers. Not all documents have been annotated on all layers.

The following files contain a list of documents annotated on the corresponding annotation layer:

index_chunks.txt syntactic chunking,
index_chunks_rel.txt inter-chunk syntactic relations,
index_names.txt named entities,
index_names_rel.txt semantic relations between named entities,
index_anaphora.txt anaphora,
index_wsd.txt word senses.

Releases

KPWr 1.2 (25.04.2016)

Latest release

KPWr 1.1 (26.01.2013)

  • Includes only clean (verified) documents.
  • Increased number of semantic relations — the rare semantic relations were also included.
  • Changes in relation names:
    • "Anaphora" to "Coreference"
    • "ref: nw – nw" to "coreference_pn"
    • "ref: agp – nw (bez zaimków osobowych)" to "coreference_agp"
    • "ref: podmiot zerowy – nw" to "coreference_zero"
    • "ref: zaimki osobowe – nw" to "coreference_pron"
  • Includes semantic relations between "wyznacznik" and names (*_coref relations).
  • The annotations of syntactic chunk heads were converted to token attributes (following CCL specification).
  • 'index_names_rel.txt' changed to 'index_name_rel.txt'
  • 'index_anaphora.txt' changed to 'index_coref.txt'

KPWr 1.0 (26.11.2012)

First official release

Publications