[Corpora-List] a learner corpus of Czech on line

Alexandr Rosen alexandr.rosen at gmail.com
Tue May 27 14:24:16 UTC 2014


Dear Colleagues,

We are happy to announce the release of the CzeSL-SGT corpus ("Czech as a Second Language with Spelling, Grammar and Tags"). The 1 mil. corpus includes 8,617 short essays, written by nearly 1,965 foreign students of Czech with 54 different first languages. 

Most texts are equipped with metadata about the author and the text (30 items). Word forms are tagged by word class, morphological categories and lemmas. Some forms are corrected by an automatic proofreader and the resulting texts are tagged again. Original and corrected forms are compared and error labels assigned. All the annotation is done automatically. 

The corpus is available for on-line searching using a web interface (https://kontext.korpus.cz/run.cgi/first?corpname=czesl-sgt) and for download as the entire data set (http://hdl.handle.net/11858/00-097C-0000-0023-95B1-E). See http://utkl.ff.cuni.cz/learncorp/ for more details and links.

Please let us know about any issues. We'll be happy to answer questions and grateful for any comments.

On behalf of the team

Alexandr Rosen, Charles University, Prague
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list