[Corpora-List] a learner corpus of Czech on line
Alexandr Rosen
alexandr.rosen at gmail.com
Tue May 27 14:24:16 UTC 2014
Dear Colleagues,
We are happy to announce the release of the CzeSL-SGT corpus ("Czech as a Second Language with Spelling, Grammar and Tags"). The 1 mil. corpus includes 8,617 short essays, written by nearly 1,965 foreign students of Czech with 54 different first languages.
Most texts are equipped with metadata about the author and the text (30 items). Word forms are tagged by word class, morphological categories and lemmas. Some forms are corrected by an automatic proofreader and the resulting texts are tagged again. Original and corrected forms are compared and error labels assigned. All the annotation is done automatically.
The corpus is available for on-line searching using a web interface (https://kontext.korpus.cz/run.cgi/first?corpname=czesl-sgt) and for download as the entire data set (http://hdl.handle.net/11858/00-097C-0000-0023-95B1-E). See http://utkl.ff.cuni.cz/learncorp/ for more details and links.
Please let us know about any issues. We'll be happy to answer questions and grateful for any comments.
On behalf of the team
Alexandr Rosen, Charles University, Prague
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list