Treatment of unknown words
|Status:||Zamknięty||Start date:||14 Dec 2011|
|Assignee:||Adam Radziszewski||% Done:|
Two options here:
1. Find hapaxes, train a model for them, add completely guess new tags instead of igns.
2. Gather all tags from the training data, use it as the analyser for igns.
#2 Updated by Adam Radziszewski almost 12 years ago
- % Done changed from 0 to 60
Guessing seems done. New default config (unktagfreq=3) yields
AVG weak corr lower bound 89.4234% AVG weak corr upper bound 89.7481%
AVG weak corr lower bound 88.5507% AVG weak corr upper bound 88.8754%
Still to do: test with -A, train model on the whole m.a. NKJP and put on site.
#4 Updated by Adam Radziszewski over 11 years ago
- Status changed from Nowy to Zamknięty
Final solution: gathering tags from tokens marked "unknown" (+ign), usning this closed list first for unknown words when performing, then using separate case bases for such tokens.
Significant improvement noticed.