[Corpora-List] traditional chinese

Alexander Yeh asy at mitre.org
Fri May 6 22:50:48 UTC 2011


Daniel Zeman wrote:
> Hi Stefan,
>
> the Academia Sinica treebank (used also in CoNLL-X and CoNLL 2007 shared
> tasks data sets) comes from Taiwan and thus it contains the traditional
> characters.

Hong Kong also uses traditional characters,
and one corpora that is used for machine translation is the Hong Kong 
legislature proceedings, which is bi-lingual in Chinese and English.
I have seen pieces of this.
I do not recall exactly where it came from, but I think it was LDC.

Also, I have seen some free from the web electronic Chinese dictionaries 
(I forget where) that have both the simplified and traditional versions 
of the characters (in UTF8) side-by-side.

Hope this helps
-Alex

>
> Hope this helps
> Dan
>
> Dne 6.5.2011 16:12, Stefan Bordag napsal(a):
>> Dear all,
>>
>> I have been doing corresponding google searches but nothing clear
>> comes out of the murky waters of the internet... Is there some corpus
>> of traditional chinese to be had, be it under a commercial or free
>> license?
>> Or for the lack of it, at least a tool that can tokenizse traditional
>> chinese into words? I am aware of the existing tools for simplified
>> chinese such as IK Analyzer - and I know that they would likely work
>> from traditional chinese as well, provided some word lists - which
>> leads me to the first question.
>>
>> Thank you in advance,
>> Stefan Bordag
>>
>



_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list