[Corpora-List] R: tagset for latin in tree-tagger

Passarotti Marco Carlo marco.passarotti at unicatt.it
Tue Mar 19 12:54:00 UTC 2013


Hi Eva,

in this paper the results on an experiment on PoS-tagging Latin with TreeTagger are reported.

Bamman, D. & Crane, G. (2008). Building a Dynamic Lexicon from a Digital Library. In Proceedings of the 8th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL 2008).

The authors use the tagset of the Perseus Digital Library.
The training set features Classical Latin texts.

But this is not the training set used to train the Latin parameter file available on the website of TreeTagger.
No tagset documentation on Latin is reported on TreeTagger homepage. From the parameter file, it seems like it uses the tagset of William Whitaker's Words: but I am not sure.
The Latin TreeTagger was trained using resources (treebanks) that share the same syntactic annotation style, but feature different morphological tagsets. Further, the language of the three Latin resources used to train the tagger is pretty different (Classical Latin, Late Latin, Medieval Latin; prose-poetry; different authors).
>>From our experience, I can say that genre, author and era are features that affect very much the performances of PoS taggers (at least for ancient languages). Thus, maybe it is better to train a tool with less data, but more homogeneous.

It depends on "which kind of Latin" you want to tag.
If you are interested in tagging Medieval Latin, I can provide you with the Index Thomisticus Treebank and you can train the HunPos tagger by yourself (it works very well with our data).

Hope it helps.

Best,

Marco

Da: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] Per conto di BOFÍAS ALBERCH, EVA
Inviato: martedì 19 marzo 2013 13.06
A: corpora at uib.no
Oggetto: [Corpora-List] tagset for latin in tree-tagger

Hi,
I am using the Tree-Tagger for tagging a Latin corpus. I haven't been able to find the tagset. Does any one have it or know where to find documentation related to the tags they use in Latin?

Thanks
Eva Bofias
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20130319/4a0e0d7b/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list