[sw-l] Re: Country and Language Codes, and standardization with ISO

Valerie Sutton sutton at SIGNWRITING.ORG
Mon Sep 27 20:19:43 UTC 2004


SignWriting List
September 27, 2004

Dear SW List:
Here is a wonderful article on the web by Joe Clark. I do not know Joe. 
You will notice he mentions the SGN code for sign languages, and says 
there is no way to specify a sign language at this time, but he just 
doesn't realize that more has been done...But he mentions the coding of 
regional dialects below...Or go to his web page: 
http://www.joeclark.org/book/sashay/serialization/AppendixB.html


----------------------

AUTHOR’S NOTE – You’re reading the HTML version of a chapter from the 
book Building Accessible Websites  (ISBN 0-7357-1150-X). Copyright © 
Joe Clark, 2002 (about the author). All rights reserved. ¶ Back to 
Contents

Under the Web Content Accessibility Guidelines, you are required to 
specify changes in the “natural” or human language used in documents. 
You do this by adding the lang="languagecode" attribute to virtually 
any tag (like <p></p>, <span></span>, <cite></cite>, or <hx></hx>. 
Also, in order to specify a change in language, you must already have 
declared the default, base, or original language, which you do by 
adding lang="languagecode" to the <body> or (preferably) <html> tags, 
like so:
	• 	 body lang="en"
	• 	html lang="fr-ca"

So just what are those language codes? They’re two-letter 
abbreviations, optionally followed by a hyphen and some other 
qualifier. In the second example above, French is specified (fr), but 
of the Canadian variety (ca).

  The exact specification is ISO 639-1, “Codes for the Representation of 
Names of Languages,” whose homepage resides at the Library of Congress: 
lcweb.loc.gov/standards/iso639-2/. (Yes, that URL says “iso639-2”; you 
have to hunt around at the site to find the 639-1 section, which is a 
bit outdated.)

  Note that the companion standard, ISO 639-2, provides three-letter 
codes for languages – and for a vastly wider range of languages, at 
that. Online, however, we must stick with the two-letter codes. At 
least, this is my interpretation. A page at the World Wide Web 
Consortium Internationalization site tells us:
According to RFC 3066, for languages with both a two-letter and a 
three-letter code, the two-letter code must be used. This also solves 
the problem of those languages that have two different three-letter 
codes, because all of them also have a two-letter code.

  So this “solves the problem,” does it? I don’t see a lot of problems 
that are actually “solved” here. The RFC (request for comment) 
mentioned in this citation merely refers back to ISO 639-1 and tells 
us, in effect, that the only three-letter language codes we may use are 
those that do not have a two-language code. But there are somewhat 
complex rules in place governing when a three-letter code may be coined 
without creating a corresponding two-letter code.

  From an accessibility perspective, this restriction will eventually 
have to be lifted. Textual media are not the only kind available on the 
Web, and as more and more video becomes available, more and more sign 
languages will be available, and all sign-language names exist in the 
three-character specification (under sgn). It is technically impossible 
to specify a sign language on a Website as the standards currently 
exist.

  My recommendation? Damn the torpedoes! If you have to specify a 
language with a three-letter code because you cannot find a two-letter 
code, do it. Such a practice appears to be permitted anyway and is the 
only one that makes sense.

  Let’s start with the two-letter codes. Now, hundreds of languages have 
been defined, and I’m not going to list every single one of them here 
because the super-obscure language codes have no practical value to my 
audience. (It’s nice to know that Faroese has its own language code, 
but how many readers of a book on Web accessibility will have cause to 
design Websites in Faroese? And won’t such designers already know that 
Faroese’s language code is fo?) Besides, the ISO 639-1 specs are all 
online and provide all the codes for you.

  I have not found a truly reliable source for the Top Ten languages 
used online (after English – the Top Eleven, really). I have 
synthesized various lists into the following somewhat longer 
compilation – not quite Top Forty, but close.

  Very-widely-used languages online
Japanese
ja
German
de
Chinese
zh
French
fr
Spanish
es
Italian
it
Dutch
nl
Portuguese
pt
Finnish
fi
Swedish
sv
Norwegian
no
Danish
da
Korean
ko
Polish
pl
Russian
ru
Hebrew
he
Hungarian
hu
Greek
el
Turkish
tr
Czech
cs
Thai
th
Arabic
ar
Icelandic
is

Confusable codes

Note that country codes and language codes are often just different 
enough to get you into trouble if you’re not eagle-eyed.
	• 	 Japan is jp, but Japanese is ja.
	• 	China is cn, but Chinese is zh.
	• 	The Netherlands is nl, and so is Dutch.
	• 	Sweden is se, but Swedish is sv.
	• 	Denmark is dk, but Danish is da.
	• 	Greece is gr, but Greek is el.

Dialects

Some dialect names are standardized under ISO 639-1, while others, 
usually of a more fanciful nature (Cockney, Newfoundland, joual) are 
not. Both types are permitted; it is up to the browser or device to 
interpret the codes correctly.

  It is possible and legal, for example, to specify all these variants 
of English:
	• 	 en (English: No specified variant)
	• 	en-us (United States English)
	• 	en-au-tas (Tasmanian English, Australia)
	• 	en-in (Indian English)
	• 	en-uk-Cockney-Rhyming-Slang

You must not assume, however, that browsers or devices will be able to 
understand or represent anything beyond the first dash.

  In rather more important cases, like the two variations of Norwegian, 
Bokmål and Nynorsk, enough social importance is given to the dialects 
that they have their own codes.
	• 	 no (Norwegian: No specified variant)
	• 	nb (Norwegian Bokmål)
	• 	nn (Norwegian Nynorsk)

Authors writing in Norwegian will likely know which dialect they are 
using and can cite it appropriately. Authors who merely quote Norwegian 
text or make some other casual use of it may not know which is which; 
that’s what the generic no tag is for.

  If you’re wondering about Chinese (no doubt you are), Mandarin and 
Cantonese are not the only recognized dialects, but all of them are 
subsumed under zh. You must use dialect codes for Mandarin (zh-guoyu) 
and Cantonese (zh-yue) if you wish to differentiate them. (The 
distinction is nearly meaningless on Websites that do not use voice 
given that the two dialects use the same writing system.) There is no 
difference in language code between Traditional and Simplified Chinese; 
arguably there should be.

  Take my word for this as a linguist and an accessibility obsessif: 
This stuff is more detailed and pedantic than trainspotting, and almost 
as addictive to susceptible personalities. Just keep in mind that 
dinner-party guests are never really as interested in this topic as we 
are.

  Previous   ¶   Contents   ¶    Next 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: text/enriched
Size: 20941 bytes
Desc: not available
URL: <http://listserv.linguistlist.org/pipermail/sw-l/attachments/20040927/f32c6a13/attachment.bin>


More information about the Sw-l mailing list