[Corpora-List] Mandarin corpus with Pinyin and sentence contexts?

Alexander Yeh asy at mitre.org
Mon Sep 1 20:00:26 UTC 2014


Stephen Politzer-Ahles wrote:
> Hello all,
>
> I am looking for a corpus that meets the following criteria:
>
> 1) includes the actual raw sentences (not just frequency counts)
> 2) has Pinyin as well as characters
> 3) can be downloaded in full (not just queried via a web interface)
>
> So far I'm only aware of the Lancaster corpus; some other corpora, like
> the Academica Sinica corpus and the HKUST telephone corpus, might also
> meet my needs but they're not free so I don't know what they're like.

For Pinyin, corpora from mainland China are probably the best bet.
* Taiwan used to use a system other than Pinyin (BoPoMoFo?) for 
pronunciation training, though I believe that they use Pinyin now.
* Internally, Hong Kong uses the Cantonese dialect rather than Mandarin, 
and Pinyin will give the wrong pronunciation for Cantonese.
* Not sure about Singapore.

Another possible complication: mainland China uses simplified 
characters, while Taiwan and Hong Kong uses traditional characters.
The different versions of characters will mean the same thing (possibly 
some exceptions) but do appear as different Unicode codes (as opposed to 
just different fonts for the same Unicode code).

Hope this helps
-Alex Yeh

>
> Any suggestions would be greatly appreciated!
>
> Best,
> Steve
>
>
> Stephen Politzer-Ahles
> New York University, Abu Dhabi
> Neuroscience of Language Lab
> http://www.nyu.edu/projects/politzer-ahles/
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>



_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list