KPWr (Polish Corpus of Wrocław University of Technology, pol. Korpus Języka Polskiego Politechniki Wrocławskiej) is a corpus of written and spoken documents available on the Creative Common license. The texts are divided into 15 subcorpuses (blogs, science, stenographic recordings, etc.). The documents are annotated on the level of chunks and selected predicate-argument relations, named entities, relations between named entities, anaphora relations, word senses, events, temporal expressions, spatial relations between entities, keywords and semantic roles within nominal and adjective phrases.
Statystyki najnowszej wersji korpusu
Podkorpus | Dokumenty | Tokeny | ||
liczba | % | liczba | % | |
blogi | 171 | 10,48% | 52793 | 11,80% |
dap | 132 | 8,09% | 41181 | 9,20% |
dialog | 91 | 5,58% | 30070 | 6,72% |
kap | 221 | 13,55% | 34284 | 7,66% |
nauka | 87 | 5,33% | 28269 | 6,32% |
popularno-naukowe i podręczniki | 73 | 4,48% | 22463 | 5,02% |
proza dawna | 86 | 5,27% | 36094 | 8,06% |
proza współczesna | 42 | 2,58% | 19101 | 4,27% |
religijne | 9 | 0,55% | 5357 | 1,20% |
stenogramy | 79 | 4,84% | 32297 | 7,22% |
techniczne | 17 | 1,04% | 4373 | 0,98% |
urzędowe | 62 | 3,80% | 18890 | 4,22% |
ustawy | 80 | 4,90% | 31620 | 7,06% |
wikinews | 123 | 7,54% | 28264 | 6,31% |
wikipedia | 358 | 21,95% | 62520 | 13,97% |
1631 | 447576 |
Availability
This work is licenced under a Creative Commons Attribution 3.0 Unported Licence
The corresponding licence agreement can be found at http://creativecommons.org/licenses/by/3.0/legalcode
Content
The corpus is manually annotated on the following layers:
- shallow syntax: syntactic chunking and selected inter-chunk syntactic relations,
- named entities and selected semantic relations between them,
- anaphora (limited to the identity-of-reference type),
- word senses (for selected lexemes).
The corpus is stored in CCL (*.xml) and REL (*.rel.xml) files, for format specification please consult the following site: CCL_format.
Note: the progress of annotation is different across the layers. Not all documents have been annotated on all layers.
The following files contain a list of documents annotated on the corresponding annotation layer:
– | index_chunks.txt | — | syntactic chunking, |
– | index_chunks_rel.txt | — | inter-chunk syntactic relations, |
– | index_names.txt | — | named entities, |
– | index_names_rel.txt | — | semantic relations between named entities, |
– | index_anaphora.txt | — | anaphora, |
– | index_wsd.txt | — | word senses. |
Releases
KPWr 1.2 (25.04.2016)
Latest release
KPWr 1.1 (26.01.2013)
- Includes only clean (verified) documents.
- Increased number of semantic relations — the rare semantic relations were also included.
- Changes in relation names:
- "Anaphora" to "Coreference"
- "ref: nw – nw" to "coreference_pn"
- "ref: agp – nw (bez zaimków osobowych)" to "coreference_agp"
- "ref: podmiot zerowy – nw" to "coreference_zero"
- "ref: zaimki osobowe – nw" to "coreference_pron"
- Includes semantic relations between "wyznacznik" and names (*_coref relations).
- The annotations of syntactic chunk heads were converted to token attributes (following CCL specification).
- 'index_names_rel.txt' changed to 'index_name_rel.txt'
- 'index_anaphora.txt' changed to 'index_coref.txt'
KPWr 1.0 (26.11.2012)
First official release
Publications
- Bartosz Broda, Michał Marcińczuk, Marek Maziarz, Adam Radziszewski and Adam Wardyński: KPWr: Towards a Free Corpus of Polish. LREC 2012.
- Radziszewski, Adam, Maziarz, Marek, Jan Wieczorek. 2012. Shallow syntactic annotation in the Corpus of Wrocław University of Technology. Cognitive Studies.
- Marcińczuk, M., Oleksy, M., Bernaś, T., Kocoń, J., & Wolski, M. (2015). Towards an event annotated corpus of Polish. Cognitive Studies| Études cognitives, (15), 253-267.
- Kocoń, J., Marcińczuk, M., Oleksy, M., Bernaś, T., & Wolski, M. (2015). Temporal Expressions in Polish Corpus KPWr. Cognitive Studies| Études cognitives, (15), 293-317.