Zadanie #3353
Treatment of unknown words
Status: | Zamknięty | Start date: | 14 Dec 2011 | ||
---|---|---|---|---|---|
Priority: | Normalny | Due date: | |||
Assignee: | Adam Radziszewski | % Done: | 80% | ||
Category: | - | ||||
Target version: | - |
Description
Two options here:
1. Find hapaxes, train a model for them, add completely guess new tags instead of igns.
2. Gather all tags from the training data, use it as the analyser for igns.
History
#1 Updated by Adam Radziszewski almost 12 years ago
First is more universal, makes sense to use affixes & regexes as features to get sort of morpho analysis.
#2 Updated by Adam Radziszewski almost 12 years ago
- % Done changed from 0 to 60
Guessing seems done. New default config (unktagfreq=3) yields
AVG weak corr lower bound 89.4234% AVG weak corr upper bound 89.7481%
Previously:
AVG weak corr lower bound 88.5507% AVG weak corr upper bound 88.8754%
Still to do: test with -A, train model on the whole m.a. NKJP and put on site.
#3 Updated by Adam Radziszewski almost 12 years ago
- % Done changed from 60 to 80
-A works more-or-less ok (outputs loads of possible tags, but is correct), training in progress
#4 Updated by Adam Radziszewski over 11 years ago
- Status changed from Nowy to Zamknięty
Final solution: gathering tags from tokens marked "unknown" (+ign), usning this closed list first for unknown words when performing, then using separate case bases for such tokens.
Significant improvement noticed.