[Corpora-List] Looking for Free Manually-verified Morphologically Segmented and Tagged Corpora

Kiril Simov kivs at bultreebank.org
Sat Oct 15 16:35:24 UTC 2011


Dear Kais,

Our morphologically annotated corpus (manually annotated by two or more 
annotators) is freely available. You have only to fill  the user agreement 
for it.
The link is:

http://www.bultreebank.org/btbmorf/

The current size of the corpus that we distribute is near 600 000 words.

With best regards,

Kiril

----- Original Message ----- 
From: "Kais Dukes" <sckd at leeds.ac.uk>
To: <corpora at uib.no>
Sent: Saturday, October 15, 2011 4:22 PM
Subject: [Corpora-List] Looking for Free Manually-verified Morphologically 
Segmented and Tagged Corpora


Hi,

I'm doing some work on evaluating algorithms for segmenting individual words 
in a text corpus, and then tagging each word-segment with a part-of-speech 
tag and multiple morphological features (e.g. lemma, person, gender, number, 
etc). I’m looking for tagged corpora for morphologically rich languages. 
Right now, the algorithms I'm looking to develop/evaluate would be for 
Arabic, but as part of the research, it would be great to see how such 
algorithms perform on other morphologically rich languages (e.g. there are 
many European languages which are considered to be morphologically rich).

For training and testing of statistical algorithms, I’m looking for *free* 
corpora available for download and offline analysis, that have been 
segmented into morpheme groups, and have had each segment tagged. Any 
pointers to resources of this nature would be very appreciated. Please note 
that I’m not after lexicons, dictionaries, analyzers or untagged text. I’m 
looking for annotated textual data of sentences with morphological 
segmentation and tagging.

After a quick search, I could only the find the following resource (which I 
myself have been involved in):

The Quranic Arabic Corpus 
(http://corpus.quran.com/documentation/morphologicalfeatures.jsp) – contains 
77,430 words of Quranic Arabic, each divided into morphological segments and 
tagged with multiple features. Freely available.

Surely there must be more for other languages – or all such resources closed 
/ non-free only?

Any suggestions?

It would be great if there were a few links to such downloadable resources 
that could be used to train and evaluate statistical morphological 
segmentation and tagging algorithms. Suggestions for any languages would be 
very appreciated.

Kind Regards,

Kais Dukes (sckd at leeds.ac.uk)
Institute for Artificial Intelligence
University of Leeds
United Kingdom
http://www.kaisdukes.com

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list