Corpora: Morphologically Analyzed and Disambiguated Turkish News Text Available

Kemal Oflazer ko at cs.bilkent.edu.tr
Thu Apr 27 08:25:34 UTC 2000


Dear All,

We have made available for download, morphologically analyzed and
disambiguated Turkish news text.   The disambiguation has been performed
with a statistical disambiguator but  no manual corrections have been
attempted. 

 A morphological parse is represented as a sequence of features with
derivations being marked by the symbol ^DB.   Morphological analysis has
been performed by the Turkish analyzer developed using XRCE Finite State
Tools.  Unknown words have been analyzed with an unknown word processor and
the resulting candidate parses for those have also been disambiguated.

A typical sentence is tagged as follows with the first token on the line
being the word and the subsequent portion is the disambiguated morphological
analysis. 
  
<S> <S>+BSTag
Ežitim ežitim+Noun+A3sg+Pnon+Nom
hizmetlerinin hizmet+Noun+A3pl+P3sg+Gen
ülkenin ülke+Noun+A3sg+Pnon+Gen
her her+Det
ki?isine ki?i+Noun+A3sg+P3sg+Dat
ve ve+Conj
kö?esine kö?e+Noun+A3sg+P3sg+Dat
ula?týrýlmý? ula?+Verb^DB+Verb+Caus^DB+Verb+Pass+Pos+Narr+A3sg
olmasý ol+Verb+Pos^DB+Noun+Inf+A3sg+P3sg+Nom
bunlardan bu+Pron+DemonsP+A3pl+Pnon+Abl
birisidir biri+Pron+A3sg+P3sg+Nom^DB+Verb+Zero+Pres+Cop+A3sg
</S> </S>+ESTag

  

  

CAVEAT: On small test sets we have seen an accuracy of 94% (over 95% if one
ignores some semantic markers).  We expect a similar accuracy on this
corpus, but we have no idea how it fares.  Originally the text had about 2
morphological parses per token.  When you notice any errors, please let us
know and we will update the copies on the server.

Turkish has been coded using ISO-LATIN 5 encoding.   The text of about 1M
words can be retrieved either as a single file, or as a batch of shorter
files.  For more details on the explanation of morphological symbols used,
and downloading see

http://www/nlp.cs.bilkent.edu.tr/Center/Corpus/


Please let us know of any problems.
-- 
Kemal Oflazer                   e-mail: ko at cs.bilkent.edu.tr
                                http://www.cs.bilkent.edu.tr/~ko/ko.html
Bilkent University              tel: (90-312) 266-4133 (Sec)
Dept. of Computer Engineering                 290-1258 (Office)
Bilkent, ANKARA, 06533 TURKEY        (90-532) 447-8978 (Mobile)
                                fax: (90-312) 266-4126        



More information about the Corpora mailing list