[Corpora-List] Looking for Free Manually-verified Morphologically Segmented and Tagged Corpora
Kiril Simov
kivs at bultreebank.org
Sat Oct 15 16:35:24 UTC 2011
Dear Kais,
Our morphologically annotated corpus (manually annotated by two or more
annotators) is freely available. You have only to fill the user agreement
for it.
The link is:
http://www.bultreebank.org/btbmorf/
The current size of the corpus that we distribute is near 600 000 words.
With best regards,
Kiril
----- Original Message -----
From: "Kais Dukes" <sckd at leeds.ac.uk>
To: <corpora at uib.no>
Sent: Saturday, October 15, 2011 4:22 PM
Subject: [Corpora-List] Looking for Free Manually-verified Morphologically
Segmented and Tagged Corpora
Hi,
I'm doing some work on evaluating algorithms for segmenting individual words
in a text corpus, and then tagging each word-segment with a part-of-speech
tag and multiple morphological features (e.g. lemma, person, gender, number,
etc). I’m looking for tagged corpora for morphologically rich languages.
Right now, the algorithms I'm looking to develop/evaluate would be for
Arabic, but as part of the research, it would be great to see how such
algorithms perform on other morphologically rich languages (e.g. there are
many European languages which are considered to be morphologically rich).
For training and testing of statistical algorithms, I’m looking for *free*
corpora available for download and offline analysis, that have been
segmented into morpheme groups, and have had each segment tagged. Any
pointers to resources of this nature would be very appreciated. Please note
that I’m not after lexicons, dictionaries, analyzers or untagged text. I’m
looking for annotated textual data of sentences with morphological
segmentation and tagging.
After a quick search, I could only the find the following resource (which I
myself have been involved in):
The Quranic Arabic Corpus
(http://corpus.quran.com/documentation/morphologicalfeatures.jsp) – contains
77,430 words of Quranic Arabic, each divided into morphological segments and
tagged with multiple features. Freely available.
Surely there must be more for other languages – or all such resources closed
/ non-free only?
Any suggestions?
It would be great if there were a few links to such downloadable resources
that could be used to train and evaluate statistical morphological
segmentation and tagging algorithms. Suggestions for any languages would be
very appreciated.
Kind Regards,
Kais Dukes (sckd at leeds.ac.uk)
Institute for Artificial Intelligence
University of Leeds
United Kingdom
http://www.kaisdukes.com
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list