|
Síðan var textinn markaður með fjórum mörkurum: fnTBL, MXPOST (Ratnaparkhi, 1996), TriTagger sem er hluti af IceNLP-hugbúnaðinum og er endurgerð af Markov-markaranum (HMM) TnT (Brants, 2000) og IceTagger (Hrafn Loftsson, 2008) sem er reglumarkari og er einnig hluti af IceNLP-hugbúnaðinum. Markararnir fnTBL, MXPOST og TriTagger eru námfúsir markarar og voru þjálfaðir á textum Íslenskar orðtíðnibókar.
|
|
The corpus was tagged by automatic means. The software used, CorpusTagger, was developed for the work on the MIM-GOLD corpus (Hrafn Loftsson et al., 2010). The text was segmented into sentences and tokenized with the IceNLP software. The text was tagged with four taggers: fnTBL, MXPOST (Ratnaparkhi, 1996), TriTagger which is a part of the IceNLP software and is a re-implementation of the well known Hidden Markov Model (HMM) tagger TnT (Brants,2000) and IceTagger (Loftsson, 2008) which is a rule-based tagger and also a part of the IceNLP software. The taggers fnTBL, MXPOST og TriTagger are all data-driven taggers that were trained on the IFD corpus. The IFD corpus was also used for the development of the rule-bassed tagger IceTagger. Finally the software CombiTagger was used to vote between the tags. The MÍM corpus is thus tagged with the tagset of the IFD corpus with the exception that proper names are not classified as personal names, place names and other proper names. The text was lemmatized with the tool Lemmald (Anton Ingason o.fl., 2008) which also is a part of the IceNLP software. The automatic morphosyntactic tagging accuracy has bee estimated as 88,1-95,1% depending on text type (Hrafn Loftsson o.fl., 2010) and the lemmatization accuracy is estimated as approximately 90%.
|