“Czy wiesz” Question Answering dataset

“Czy wiesz” (pol. “Did you know”) is a set of 4721 questions, each linked to a Wikipedia article that contains the answer.

For 250 questions a detailed manual analysis has been performed. What results is attachment of manually-checked answer-bearing fragments to each of those selected questions. Some questions are assigned multiple fragments.

The data set has been obtained from the Polish “Did you know” wikiproject. The dataset is made to facilitate evaluation and development of Polish QA systems.

The dataset includes:

  1. Pre-processed dump of entire Polish Wikipedia from 22 Jan 2013 (Czywiesz.tar).
  2. A list of correct answers obtained from „Did you know” site (source/questions.txt).
  3. A list of the questions paired with Wikipedia articles (source/questions_links.txt).
  4. A list of the questions paired with answer-bearing documents (source/czywiesz.csv).
  5. A list of questions that have been judged as incorrect/unwanted and hence got rejected (source/removed.txt).
  6. The main set (annotations/results/czywiesz-eva-I-250-approved.json) is based on  250 questions drawn randomly. Each of these was fed through our development QA system. The system output a ranked answer list for each of the questions. The first 200 positions were manually checked.  The JSON file contains the manually approved answers for each of the question. An answer is a 1–5-sentence long fragment.
  7. Logs tracing the process of question acquisition (logs/*).

We hope that sharing both questions and the whole testing collection (entire Wikipedia) will facilitate performing various experiments in a reproducible manner and the results obtained will be comparable.

The dataset is published under Creative Commons Attribution ShareAlike 3.0 licence (CC-BY-SA 3.0).

Download: czywieszki.zip (718 MB).

Institute of Informatics, Wrocław University of Technology, 2013

Contributors on the technical side

Łukasz Burdka
Michał Marcińczuk
Dominik Piasecki
Maciej Piasecki
Marcin Ptak
Adam Radziszewski
Paweł Rychlikowski
Tomasz Zięba

Changelog

2.0
- 250 questions were drawn randomly and subjected to manual answer inspection (annotations/results/czywiesz-eva-I-250-approved.json)
- these  250 questions were fed through our QA system; human annotators assessed 200 system answers per each question and these decisions are stored in the JSON file
- additionally, 1347 questions from development set (this set is disjoint with the 250-question main set) were subjected to manual answer inspection along the same lines, but only 10 first system answers were checked
- development (dev) and final evaluation (eva) sets were swapped; this was unaviodable as we had already started annotating the sets with wrong initial assumptions

1.1.

- added sentence ID in czywiesz.csv
- changed column order in czywiesz.csv (desc. -> ReadMe.txt: l. 48)
- updated ReadMe.txt
- additional division into development and final evaluation set (CSV files in source subdir)

1.0.

- first version


 

Lista frekwencyjna

(Uwaga! Strona w budowie / Under construction)

Na tej stronie można ściągnąc listy frekwencyjne wydobyte z dużych korpusów tekstów. W skład tekstów wchodzą m.in. Korpus IPI PAN, Korpus Rzeczpospolitej, Wikipedię (zrzut z początku 2010 roku) i zbiór dużych dokumentów ściągniętych z Internetu. Razem korpusy mają około 1.8 miliarda tokenów. Do wygenerowania listy frekwencyjnej zostały wykorzystane narzędzia wchodzodzące w skład systemu SuperMatrix (Broda and Piasecki 2011).


Listę frekwencyjną udostępniamy w dwóch postaciach:

  • frequency_list_orth.txt - zawiera klasę gramatyczną, formę podstawową, formę tekstową i częstość w korpusach
  • frequency_list_base.txt - zawiera formy podstawowe słów i ich częstości w korpusach

Uwaga! Dane udostępniamy na licencji Creative Commons (CC BY-NC-SA 3.0)


References



Attachments:

Download this file (frequency_list_base.7z)frequency_list_base.7z[ ]6330 Kb
Download this file (frequency_list_orth.7z)frequency_list_orth.7z[ ]20277 Kb

plWordNet in a nutshell

plWordNet is the biggest Polish wordnet. A wordnet is a semantic network, whose nodes are lexical units, while threads connecting the whole - semantic relations between lexical units. Those semantic relations are for instance

  • hypernymy/hyponymy = a relation which links a word of a more general meaning (kot – 'cat') with a word of a more specific meaning (tygrys – 'tiger') (every tiger belongs to the feline family);
  • meronymy/holonymy = a relation denoting a part and a whole, e.g. zderzak - 'bumper' – samochód - 'car' (cars have bumpers);
  • antonymy - a relation of the semantic opposition, e.g. wejść – 'enter' – wyjść - 'leave', żonaty – 'married' – kawaler -'bachelor' (a detailed list of lexical relations is given here [link].

Lexical units which enter the same lexico-semantic relations (but not the same derivational relations) are treated as synonyms and linked into synsets that is synonym sets. The currently available version of plWordNet 1.6 amounts to 94523 synsets, 133071 lexical units and almost 150000 lexical relations.

The first ever wordnet (WordNet) was built in 1980s at Princeton University. Since that time hundreds of research teams have started the construction of wordnets for different languages, and among them our Languages Technologies Research Group G.4.19 at Wrocław University of Technology. In 2009 the first version of plWordNet was made available on the Internet. Similarly to its predecessor – Princeton WordNet – the current 1.6 version of plWordNet is available on the open licence [link].

Browsing plWordNet

plWordNet is available online in the form of WWW service. With the progress of works the available version will be updated .

Using plWordNet for scientific and commercial purposes

We are are willing to cooperate — both scientifically and commercially. The leaders of Wrocław University of Technology have decided to make plWordNet available free of charge for any applications (including commercial applications) on the basis of the licence modelled on that for Princeton WordNet.
To acquire plWordNet source files, please fill in the registration form given here [link].

Web service access 

Within the part of the Clarin project carried out at Wrocław University of Technology a web service was designed, which allows accessing plWordNet via a simple API. The details are given on the project homepage.

Aplication for the expansion plWordNet

The application is developed by a team of software designers and is often subject to change. Those interested in the use of the application are asked to contact us kontakt.

Lista dystrybucyjnego podobieństwa semantycznego

(Uwaga! Strona w budowie / Under construction)

Dystrybucyjne Podobieństwo Semantyczne (DPS, ang. Measure of Semantic Relatedness) obrazuje podobieństwo pomiędzy parami wyrazów na podstawie  analizy ich współwystępowania w korpusach tekstów. Ogólną sposób wydobywania podobieństwa można przedstawić następująco. W pierwszej kolejności wszystkie konkteksty interesujących słów są analizowane pod kątem współwystępowania z innymi słowami. Na podstawie częstości współwystąpień budowana jest macierz M, w której wiersze odopowiedają słowom, dla których liczone jest podobieństwo. Kolumny wyrażają cechy słów, które w najprostszym ująciu są  słowami występującymi w kontekstach słów z wierszy. Macierz M jest macierzą rzadką o bardzo dużych rozmiarach (dziesiątki tysięcy wierszy, setki tysięcy kolumn).  W następnym korku wartości w macierzy są filtrowane i ważone. Krok ten ma na celu usunięcie przypadkowych  współwystąpień jak i służy rozróżneinieu pomiędzy istotną informacją zawartą w macierzy a akcydentalną. Jedną z wag dobrze sprawdzających się w tym zadaniu jest np. punktowa informacja wzajemna.  Wiersze przetransformowanej macierzy można już porównywać wykorzystująć np. miarę konsunusową. 

Współwystępowanie można rozumieć w różny sposób: od prostego odnotowania słów w oknie tekstowym o ustalonym rozmiarze, poprzez sprawdzawdzanie ograniczeń składniowych pomiędzy słowami (np. uzgodnienia pomiędzy rzeczownikiem a przymiotnikiem), po wykorzystanie relacji składniowych z parserów zależnościowych. Udostępnione na tej stronie listy wykorzystują podejście oparte na ograniczeniach morfo-syntaktycznych.  Dokładniejszy opis wykorzystanego podejścia można znaleźć w pracach: (Piasecki, Szpakowicz and Broda 2007), (Broda et al 2008), (Piasecki, Szpakowicz and Broda 2009) i (Broda and Piasecki 2011).


Na liście dystrybucyjnego podobieństwa semantycznego dla każdego opisanego słowa zostają wypisane k najbardziej podobnych słów do niego. Listy takie można pozyskać używając systemu SuperMatrix. Na dole strony można pobrać  dwie listy podobieństwa: dla rzeczowników (kgr4_pmi_cos_filtered_TF100_20best.7z)  i czasowników (kgr3_verbsim_lincos_TF100_20best.7z). Po rozpakowaniu listy mają prosty format tekstowy. Dla przykładu:

subst:truskawka
    0.396929       subst:pomidor
    0.374989       subst:winogrono
    0.36221       subst:brzoskwinia
    0.359661       subst:ananas
    0.358338       subst:czereśnia
    0.347417       subst:porzeczka
    0.343161       subst:jabłko
    0.340363       subst:wiśnia
    0.333139       subst:śliwka
    0.321351       subst:filogeneza
    0.314859       subst:malina
    0.313577       subst:seler
    0.308124       subst:papryka
    0.30514       subst:warzywo
    0.302994       subst:melon
    0.301603       subst:figa
    0.301409       subst:kalafior
    0.299205       subst:marchew
    0.298587       subst:kukurydza
    0.297907       subst:pomarańcza


Powyższy zapis pokazuje 20 najbardziej podobnych wyrazów dla  słowa truskawka. Liczby po lewej stornie oznaczają podobieństwo - im wyższa liczby, tym wyraz jest bardziej podobny do truskawka.


Uwaga! Dane udostępniamy na licencji Creative Commons (CC BY-NC-SA 3.0). W wypadku wykorzystania list podobieństwa uprzejmie prosimy o cytowanie pracy: (Broda and Piasecki 2011).



References



 

Corpus of manually lemmatised Polish noun and adjective phrases

Licence and credits

The dataset is licensed under Creative Commons Attribution 3.0.

The corpus is based on a random subset of syntactically anotated documents taken from Polish Corpus of Wrocław University of Technology (KPWr) version 1.1 (http://nlp.pwr.wroc.pl/kpwr).

Annotators: Marcin Oleksy and Jan Wieczorek.

Project coordinator: Adam Radziszewski (name dot surname at pwr.wroc.pl).

Some rights reserved. Wrocław University of Technology, 2013.

Assumptions

The corpus consits of those documents taken from KPWr corpus that had already been annotated with syntactic chunks. Two chunk types were considered: NP and AdjP. NP are in fact both actual noun phrases and prepositional phrases (preposition + NP). AdjP are adjective phrases, which are annotated only when not part of a larger NP.
For details on the assumed syntactic annotation principles please consult the paper available at KPWr website.

AdjP are infrequent, thus the number of instances may be too small to perform reliable experiments.

Phrase lemmatisation is understood as assignment to each phrase instance (of NP and AdjP type) their base forms (lemmas). Phrase lemma is an instance of same-type phrase that could appear in a dictionary or as a keyphrase.

(With some exception regarding proper names) lemmatisation requires that the syntactic head of the phrase be in nominative case. Often number is changed to singular, sometimes the gender is changed to nominative. Correct lemmatisation ofter requires changing of more word forms than just head, e.g. head adjective modifiers.

In the case of prepositional phrases, lemmatisation requires removal of phrase-initial prepositions, thus prepositional phrases are “lemmatised to real noun phrases”.

Data format

The corpus consists of a number of documents. This package preserves our division into development (dev) and evaluation (eva) data. Both directiories contain a few subdirectories, corresponding to original directory structure of KPWr (e.g. there is a subdirectory named blogi that contains documents belonging to blogs subcorpus of KPWr). The documents are stored in XML files.

Each XML file is stored in the CCL format (specs here) and contains the unchanged original KPWr annotation enhanced with lemmatisation information. This package uses the same file naming scheme as in KPWr 1.1, thus filenames (document ids) may be mapped to the original KPWr documents 1:1. The original annotation that was present in the CCL files is kept intact. Even if we spotted mislabelled chunk boundaries, we did not correct this at this stage. Instead, the annotators were told to come up with a lemma that would correspond to the actual boundaries.
Note: the original KPWr 1.1 package also contains .rel.xml files that describe inter-chunk and inter-NE relations. We did not copy the files here, although if you need the relation-level annotation, just copy the original .rel.xml files — they will still be valid.

Lemmatisation information is stored in token-level properties. Phrase lemmas are assigned to those tokens that are marked as chunk heads. E.g, in the following fragment, the token “otwarciu” is marked as NP head (<ann chan="chunk_np" head="1">1</ann>) and it is assigned NP lemma (<prop key="chunk_np:lemma">otwarcie WTZ</prop>).

   <tok>
    <orth>W</orth>
    <lex disamb="1"><base>w</base><ctag>prep:loc:nwok</ctag></lex>
    <ann chan="chunk_agp">1</ann>
    <ann chan="chunk_np">1</ann>
    <ann chan="chunk_vp">0</ann>
    <prop key="lem_pattern">p</prop>
   </tok>
   <tok>
    <orth>otwarciu</orth>
    <lex disamb="1"><base>otwarcie</base><ctag>subst:sg:loc:n</ctag></lex>
    <ann chan="chunk_agp" head="1">1</ann>
    <ann chan="chunk_np" head="1">1</ann>
    <ann chan="chunk_vp">0</ann>
    <prop key="chunk_np:lemma">otwarcie WTZ</prop>
    <prop key="lem_pattern">cas=nom</prop>
   </tok>
   <tok>
    <orth>WTZ</orth>
    <lex disamb="1"><base>wtz</base><ctag>subst:sg:nom:n</ctag></lex>
    <ann chan="chunk_agp">1</ann>
    <ann chan="chunk_np">1</ann>
    <ann chan="chunk_vp">0</ann>
    <prop key="lem_pattern">=</prop>
   </tok>

Besides information on lemmatisation, we decided to keep automatically induced transformations (under the better working configuration — without the ‘lem’ transformation). The trasformations for both NP and AdjP chunks are stored under the lem_pattern key (they should be prefixed with chunk/channel name, but aren't in this version, sorry for that). Note: the transformations induced in the dev part were subjected to manual correction where the induction procedure completely failed (it doesn't guarantee that they are always valid, anyway). No manual intervention took place in the evaluation data.