Language codes
Brian MacWhinney
macw at cmu.edu
Tue Jul 20 19:27:17 UTC 2010
Dear Chi-Bolts,
We have recently shifted the form of the language codes from the two-letter standard to the fuller three-letter standard. Specifically, we are now following the ISO 639-3 standard. For the languages in the database, these abbreviations can be found in a file called ISO-639.cut in the /lib/fixes folder of CLAN. We did this to be more compatible with various standards, including the IMDI system from the MPI at Nijmegen. Although there is still no really final word on this standard, it appears that ISO-639-3 is getting closer, so we felt a need to stick with what is becoming a bit more standard.
In this regard, we have also refined our material on language marking for multilingual corpora in section 5.2 of the CHAT manual. The following is the newer material.
-- Brian MacWhinney
@Languages:
This is the second visible header; it tells the programs which language is being used in the dialogues. Here is an example of this line for a bilingual transcript using Swedish and Portuguese.
@Languages: swe, por
The language codes come from the international ISO 639-3 standard. For the languages currently in the database, these three-letter codes and extended codes are used:
Table 1: ISO Language Codes
Language
Code
Language
Code
Language
Code
Afrikaans
afr
German
deu
Polish
pol
Arabic
ara
Greek
ell
Basque
eus
Hebrew
heb
Portuguese
por
Cantonese
zho-yue
Hungarian
hun
Punjabi
pan
Catalan
cat
Icelandic
isl
Romanian
ron
Chinese
zho
Indonesian
ind
Russian
rus
Irish
gle
Spanish
spa
Croatian
hrv
Italian
ita
Swahili
swa
Czech
ces
Japanese
jpn
Swedish
swe
Danish
dan
Javanese
jav
Tagalog
tag
Dutch
nld
Kannada
kan
Taiwanese
zho-min
English
eng
Kikuyu
kik
Tamil
tam
Estonian
est
Korean
kor
Thai
tha
Farsi
fas
Lithuanian
lit
Turkish
tur
Finnish
sun
Norwegian
nor
Vietnamese
vie
French
fra
Welsh
cym
Galician
glg
Yiddish
yid
We continually update this list, and CLAN relies on a file in the lib/fixes directory called ISO-639.cut that lists the current languages. In multilingual corpora, several codes can be combined on the @Languages line. It is assumed, by default, that the first code given is for the primary language of the transcript and that deviations from this language are marked by the @New Language header described below. Individual utterances in a second or third languages can be marked with precodes as in this example:
*CHI: [- eng] this is my juguete at s.
In this example, Spanish is the default language, but the particular sentence is marked as English. The @Languages header lists spa for Spanish, and then eng for English. Within this sentence, the use of a Spanish word is then marked as @s. When the @s is used in the main body of the transcript without the [- eng], then it indicates a shift to English, rather than to Spanish.
The @s code may also be used to explicitly mark the use of a particular language, even if it is not included in the @Languages header. For example, the code schlep at s:yid can be used to mark the inclusion of the Yiddish word “schlep” in any text. The @s code can also be further elaborated to mark code-blended words. The form well at s:eng&cym indicates that the word “well” could be either an English or a Welsh word. The combination of a stem from one language with an inflection from another can be marked using the plus sign as in swallowni at s:eng+hun for an English stem with a Hungarian infinitival marking. All of these codes can be followed by a code with the $ to explicitly mark the parts of speech. Thus, the form recordar at s$v:inf indicates that this Spanish word is an infinitive.
Tone languages like Cantonese, Mandarin, and Thai are allowed to have word forms that include tones and numbers for polysemes.
--
You received this message because you are subscribed to the Google Groups "chibolts" group.
To post to this group, send email to chibolts at googlegroups.com.
To unsubscribe from this group, send email to chibolts+unsubscribe at googlegroups.com.
For more options, visit this group at http://groups.google.com/group/chibolts?hl=en.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/chibolts/attachments/20100720/fada764e/attachment.htm>
More information about the Chibolts
mailing list