Language codes

Brian MacWhinney macw at cmu.edu
Tue Jul 20 19:27:17 UTC 2010


Dear Chi-Bolts,
   We have recently shifted the form of the language codes from the two-letter standard to the fuller three-letter standard.  Specifically, we are now following the ISO 639-3 standard.  For the languages in the database, these abbreviations can be found in a file called ISO-639.cut in the /lib/fixes folder of CLAN.  We did this to be more compatible with various standards, including the IMDI system from the MPI at Nijmegen. Although there is still no really final word on this standard, it appears that ISO-639-3 is getting closer, so we felt a need to stick with what is becoming a bit more standard.  
  In this regard, we have also refined our material on language marking for multilingual corpora in section 5.2 of the CHAT manual.  The following is the newer material.

-- Brian MacWhinney

@Languages:

This is the second visible header; it tells the programs which language is being used in the dialogues. Here is an example of this line for a bilingual transcript using Swedish and Portuguese.

@Languages: swe, por

The language codes come from the international ISO 639-3 standard. For the languages currently in the database, these three-letter codes and extended codes are used:

Table 1: ISO Language Codes

Language

Code

Language

Code

Language

Code

Afrikaans

afr

German

deu

Polish

pol

Arabic

ara

Greek

ell

 

 

Basque

eus

Hebrew

heb

Portuguese

por

Cantonese

zho-yue

Hungarian

hun

Punjabi

pan

Catalan

cat

Icelandic

isl

Romanian

ron

Chinese

zho

Indonesian

ind

Russian

rus

 

 

Irish

gle

Spanish

spa

Croatian

hrv

Italian

ita

Swahili

swa

Czech

ces

Japanese

jpn

Swedish

swe

Danish

dan

Javanese

jav

Tagalog

tag

Dutch

nld

Kannada

kan

Taiwanese

zho-min

English

eng

Kikuyu

kik

Tamil

tam

Estonian

est

Korean

kor

Thai

tha

Farsi

fas

Lithuanian

lit

Turkish

tur

Finnish

sun

Norwegian

nor

Vietnamese

vie

French

fra

 

 

Welsh

cym

Galician

glg

 

 

Yiddish

yid

We continually update this list, and CLAN relies on a file in the lib/fixes directory called ISO-639.cut that lists the current languages. In multilingual corpora, several codes can be combined on the @Languages line.  It is assumed, by default, that the first code given is for the primary language of the transcript and that deviations from this language are marked by the @New Language header described below. Individual utterances in a second or third languages can be marked with precodes as in this example:

*CHI:     [- eng] this is my juguete at s.

In this example, Spanish is the default language, but the particular sentence is marked as English.  The @Languages header lists spa for Spanish, and then eng for English.  Within this sentence, the use of a Spanish word is then marked as @s.  When the @s is used in the main body of the transcript without the [- eng], then it indicates a shift to English, rather than to Spanish.

The @s code may also be used to explicitly mark the use of a particular language, even if it is not included in the @Languages header.  For example, the code schlep at s:yid can be used to mark the inclusion of the Yiddish word “schlep” in any text.  The @s code can also be further elaborated to mark code-blended words.  The form well at s:eng&cym indicates that the word “well” could be either an English or a Welsh word.  The combination of a stem from one language with an inflection from another can be marked using the plus sign as in swallowni at s:eng+hun for an English stem with a Hungarian infinitival marking.  All of these codes can be followed by a code with the $ to explicitly mark the parts of speech.  Thus, the form recordar at s$v:inf indicates that this Spanish word is an infinitive.

Tone languages like Cantonese, Mandarin, and Thai are allowed to have word forms that include tones and numbers for polysemes.

-- 
You received this message because you are subscribed to the Google Groups "chibolts" group.
To post to this group, send email to chibolts at googlegroups.com.
To unsubscribe from this group, send email to chibolts+unsubscribe at googlegroups.com.
For more options, visit this group at http://groups.google.com/group/chibolts?hl=en.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/chibolts/attachments/20100720/fada764e/attachment.htm>


More information about the Chibolts mailing list