“Czy wiesz” Question Answering dataset
“Czy wiesz” (pol. “Did you know”) is a set of 4721 questions, each linked to a Wikipedia article that contains the answer.
For 250 questions a detailed manual analysis has been performed. What results is attachment of manually-checked answer-bearing fragments to each of those selected questions. Some questions are assigned multiple fragments.
The data set has been obtained from the Polish “Did you know” wikiproject. The dataset is made to facilitate evaluation and development of Polish QA systems.
The dataset includes:
- Pre-processed dump of entire Polish Wikipedia from 22 Jan 2013 (Czywiesz.tar).
- A list of correct answers obtained from „Did you know” site (source/questions.txt).
- A list of the questions paired with Wikipedia articles (source/questions_links.txt).
- A list of the questions paired with answer-bearing documents (source/czywiesz.csv).
- A list of questions that have been judged as incorrect/unwanted and hence got rejected (source/removed.txt).
- The main set (annotations/results/czywiesz-eva-I-250-approved.json)
is based on 250 questions drawn randomly. Each of these was fed through our development QA system. The system output a ranked answer list for each of the questions. The first 200 positions were manually checked. The JSON file contains the manually approved answers for each of the question. An answer is a 1–5-sentence long fragment.
- Logs tracing the process of question acquisition (logs/*).
We hope that sharing both questions and the whole testing collection (entire Wikipedia) will facilitate performing various experiments in a reproducible manner and the results obtained will be comparable.
The dataset is published under Creative Commons Attribution ShareAlike 3.0 licence (CC-BY-SA 3.0).
Download: czywieszki.zip (718 MB).
Institute of Informatics, Wrocław University of Technology, 2013
Contributors on the technical side
- 250 questions were drawn randomly and subjected to manual answer inspection (annotations/results/czywiesz-eva-I-250-approved.json)
- these 250 questions were fed through our QA system; human annotators assessed 200 system answers per each question and these decisions are stored in the JSON file
- additionally, 1347 questions from development set (this set is disjoint with the 250-question main set) were subjected to manual answer inspection along the same lines, but only 10 first system answers were checked
- development (dev) and final evaluation (eva) sets were swapped; this was unaviodable as we had already started annotating the sets with wrong initial assumptions
- added sentence ID in czywiesz.csv
- changed column order in czywiesz.csv (desc. -> ReadMe.txt: l. 48)
- updated ReadMe.txt
- additional division into development and final evaluation set (CSV files in source subdir)
- first version