Zadanie #3353

Treatment of unknown words

Added by Adam Radziszewski almost 12 years ago. Updated over 11 years ago.

Status:ZamkniętyStart date:14 Dec 2011
Priority:NormalnyDue date:
Assignee:Adam Radziszewski% Done:

80%

Category:-
Target version:-

Description

Two options here:
1. Find hapaxes, train a model for them, add completely guess new tags instead of igns.
2. Gather all tags from the training data, use it as the analyser for igns.

History

#1 Updated by Adam Radziszewski almost 12 years ago

First is more universal, makes sense to use affixes & regexes as features to get sort of morpho analysis.

#2 Updated by Adam Radziszewski almost 12 years ago

  • % Done changed from 0 to 60

Guessing seems done. New default config (unktagfreq=3) yields

AVG weak corr lower bound       89.4234%
AVG weak corr upper bound       89.7481%

Previously:

AVG weak corr lower bound       88.5507%
AVG weak corr upper bound       88.8754%

Still to do: test with -A, train model on the whole m.a. NKJP and put on site.

#3 Updated by Adam Radziszewski almost 12 years ago

  • % Done changed from 60 to 80

-A works more-or-less ok (outputs loads of possible tags, but is correct), training in progress

#4 Updated by Adam Radziszewski over 11 years ago

  • Status changed from Nowy to ZamkniÄ™ty

Final solution: gathering tags from tokens marked "unknown" (+ign), usning this closed list first for unknown words when performing, then using separate case bases for such tokens.
Significant improvement noticed.

Also available in: Atom PDF