25.2587, FYI: Sinica Chinese Core Vocabulary (version 1.0)

The LINGUIST List linguist at linguistlist.org
Tue Jun 17 09:16:36 UTC 2014


LINGUIST List: Vol-25-2587. Tue Jun 17 2014. ISSN: 1069 - 4875.

Subject: 25.2587, FYI: Sinica Chinese Core Vocabulary (version 1.0)

Moderators: Damir Cavar, Eastern Michigan U <damir at linguistlist.org>
            Malgorzata E. Cavar, Eastern Michigan U <gosia at linguistlist.org>

Reviews: reviews at linguistlist.org
Anthony Aristar <aristar at linguistlist.org>
Helen Aristar-Dry <hdry at linguistlist.org>
Mateja Schuck, U of Wisconsin Madison

Homepage: http://linguistlist.org

Do you want to donate to LINGUIST without spending an extra penny? Bookmark
the Amazon link for your country below; then use it whenever you buy from
Amazon!

USA: http://www.amazon.com/?_encoding=UTF8&tag=linguistlist-20
Britain: http://www.amazon.co.uk/?_encoding=UTF8&tag=linguistlist-21
Germany: http://www.amazon.de/?_encoding=UTF8&tag=linguistlistd-21
Japan: http://www.amazon.co.jp/?_encoding=UTF8&tag=linguistlist-22
Canada: http://www.amazon.ca/?_encoding=UTF8&tag=linguistlistc-20
France: http://www.amazon.fr/?_encoding=UTF8&tag=linguistlistf-21

For more information on the LINGUIST Amazon store please visit our
FAQ at http://linguistlist.org/amazon-faq.cfm.

Editor for this issue: Uliana Kazagasheva <uliana at linguistlist.org>
================================================================  


Date: Tue, 17 Jun 2014 05:15:46
From: Shu-Chuan Tseng [tsengsc at gate.sinica.edu.tw]
Subject: Sinica Chinese Core Vocabulary (version 1.0)

E-mail this message to a friend:
http://linguistlist.org/issues/emailmessage/verification.cfm?iss=25-2587.html&submissionid=34212358&topicid=6&msgnumber=1
 
The Sinica Chinese Core Vocabulary (version 1.0) consists of 1,121 Chinese words that are derived from the intersection of the top 2000 (most frequently used) words in the Sinica Balanced Corpus and in the Taiwan Mandarin Conversational Corpus. 

The Sinica Balanced Corpus contains mainly Chinese texts, approximately 4.7 millions of Chinese words after some minor modifications on the original data, whereas the Taiwan Mandarin Conversational Corpus contains free conversations, task- and topic-oriented dialogues, approximately 500K of transcribed Chinese words. 

The Sinica Chinese Core Vocabulary was produced based on the “Word List with Accumulated Word Frequency in Sinica Balanced Corpus 3.0” released by the Chinese Knowledge and Information Processing Group (CKIP) via the Association for Computational Linguistics and Chinese Language Processing (ACLCLP) and the “Chinese Spoken Wordlist” released by Dr. Shu-Chuan Tseng. Words were segmented and POS-tagged by the CKIP automatic word segmentation and tagging system. The Sinica Chinese Core Vocabulary puts together the most frequently used Chinese words appearing in both of the written and spoken forms. It covers 57.6% of word tokens in the Sinica Balanced Corpus and 86.1% in the Taiwan Mandarin Conversational Corpus. 

The Sinica Chinese Core Vocabulary consists of word information about part of speech, frequency, ranking in both of the corpora as well as the corresponding English glossaries with Chinese examples and English translations. All Chinese characters are transcribed in Pinyin. Words written in identical characters, but belonging to different POS tags as well as words that have multiple writing conventions are regarded as different lexical units. Users can also find a list with a subset of the top 2000 words of the Sinica Balanced Corpus that do not appear in the core vocabulary. This list contains 879 words that are frequently used in the written language only, covering 13.1% of word tokens in the Sinica Balanced Corpus. Another list contains a subset of the top 2000 words of the Taiwan Mandarin Conversational Corpus that do not appear in the core vocabulary. 699 conversation-only high-frequency words make up 7.6% of the Taiwan Mandarin Conversational Corpus. 

Please note that due to the setting of corpus scenario some proper nouns in the conversational corpus are corpus-specific and should not be regarded as high-frequency words in conversation. For this reason, 180 words were excluded from the final conversation-only list. In addition, a set of 1,235 basic Chinese characters, covering the core, text-, and conversation-only vocabulary lists, is derived from the aforementioned three wordlists.

To access the Sinica Chinese Core Vocabulary (version 1.0), please see:
http://www.aclclp.org.tw/use_sccv.php
http://mmc.sinica.edu.tw/resources_e_01.html 



Linguistic Field(s): Computational Linguistics
                     Language Acquisition

Subject Language(s): Chinese, Mandarin (cmn)





 






----------------------------------------------------------
LINGUIST List: Vol-25-2587	
----------------------------------------------------------



More information about the LINGUIST mailing list