[Corpora-List] Looking for Free Manually-verified Morphologically Segmented and Tagged Corpora

DJamé Seddah djame.seddah at free.fr
Mon Oct 17 12:16:34 UTC 2011


Hi Kais,
The Modern  Hebrew Treebank is freely available (http://www.mila.cs.technion.ac.il/mila/eng/resources_treebank.html), the Seijong Korean treebank
as well (upon request I think) and  I think that the Turkish treebank is also freely available. All of those contain rich morphological annotation at the morpheme level (writing this, I have a doubt
for the Hebrew treebank, but I'm sure someone will correct this if I'm wrong).


Best,
Djamé


 
Le 15 oct. 2011 à 15:22, Kais Dukes a écrit :

> Hi,
> 
> I'm doing some work on evaluating algorithms for segmenting individual words in a text corpus, and then tagging each word-segment with a part-of-speech tag and multiple morphological features (e.g. lemma, person, gender, number, etc). I’m looking for tagged corpora for morphologically rich languages. Right now, the algorithms I'm looking to develop/evaluate would be for Arabic, but as part of the research, it would be great to see how such algorithms perform on other morphologically rich languages (e.g. there are many European languages which are considered to be morphologically rich).
> 
> For training and testing of statistical algorithms, I’m looking for *free* corpora available for download and offline analysis, that have been segmented into morpheme groups, and have had each segment tagged. Any pointers to resources of this nature would be very appreciated. Please note that I’m not after lexicons, dictionaries, analyzers or untagged text. I’m looking for annotated textual data of sentences with morphological segmentation and tagging.
> 
> After a quick search, I could only the find the following resource (which I myself have been involved in):
> 
> The Quranic Arabic Corpus (http://corpus.quran.com/documentation/morphologicalfeatures.jsp) – contains 77,430 words of Quranic Arabic, each divided into morphological segments and tagged with multiple features. Freely available.
> 
> Surely there must be more for other languages – or all such resources closed / non-free only?
> 
> Any suggestions?
> 
> It would be great if there were a few links to such downloadable resources that could be used to train and evaluate statistical morphological segmentation and tagging algorithms. Suggestions for any languages would be very appreciated.
> 
> Kind Regards,
> 
> Kais Dukes (sckd at leeds.ac.uk)
> Institute for Artificial Intelligence
> University of Leeds
> United Kingdom
> http://www.kaisdukes.com
> 
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora


_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list