[Corpora-List] Looking for Free Manually-verified Morphologically Segmented and Tagged Corpora

Martin Mueller martin.mueller at mac.com
Sat Oct 15 14:46:45 UTC 2011


If you're looking for morphologically rich data, how about ancient Greek?
You can have my TEI encoded and morphologically annotated (and
disambiguated) corpus of Early Greek epic. The Perseus corpus contains a
lot of ancient Greek with partial morphological analysis, and a lot of
work is underway to disambiguate ambiguous surface forms.

MM

On 10/15/11 8:22 AM, "Kais Dukes" <sckd at leeds.ac.uk> wrote:

>Hi,
>
>I'm doing some work on evaluating algorithms for segmenting individual
>words in a text corpus, and then tagging each word-segment with a
>part-of-speech tag and multiple morphological features (e.g. lemma,
>person, gender, number, etc). I¹m looking for tagged corpora for
>morphologically rich languages. Right now, the algorithms I'm looking to
>develop/evaluate would be for Arabic, but as part of the research, it
>would be great to see how such algorithms perform on other
>morphologically rich languages (e.g. there are many European languages
>which are considered to be morphologically rich).
>
>For training and testing of statistical algorithms, I¹m looking for
>*free* corpora available for download and offline analysis, that have
>been segmented into morpheme groups, and have had each segment tagged.
>Any pointers to resources of this nature would be very appreciated.
>Please note that I¹m not after lexicons, dictionaries, analyzers or
>untagged text. I¹m looking for annotated textual data of sentences with
>morphological segmentation and tagging.
>
>After a quick search, I could only the find the following resource (which
>I myself have been involved in):
>
>The Quranic Arabic Corpus
>(http://corpus.quran.com/documentation/morphologicalfeatures.jsp) ­
>contains 77,430 words of Quranic Arabic, each divided into morphological
>segments and tagged with multiple features. Freely available.
>
>Surely there must be more for other languages ­ or all such resources
>closed / non-free only?
>
>Any suggestions?
>
>It would be great if there were a few links to such downloadable
>resources that could be used to train and evaluate statistical
>morphological segmentation and tagging algorithms. Suggestions for any
>languages would be very appreciated.
>
>Kind Regards,
>
>Kais Dukes (sckd at leeds.ac.uk)
>Institute for Artificial Intelligence
>University of Leeds
>United Kingdom
>http://www.kaisdukes.com
>
>_______________________________________________
>UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>Corpora mailing list
>Corpora at uib.no
>http://mailman.uib.no/listinfo/corpora



_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list