[sw-l] Re: Country and Language Codes, and standardization with ISO
sutton at SIGNWRITING.ORG
Mon Sep 27 20:19:43 UTC 2004
September 27, 2004
Dear SW List:
Here is a wonderful article on the web by Joe Clark. I do not know Joe.
You will notice he mentions the SGN code for sign languages, and says
there is no way to specify a sign language at this time, but he just
doesn't realize that more has been done...But he mentions the coding of
regional dialects below...Or go to his web page:
AUTHOR’S NOTE – You’re reading the HTML version of a chapter from the
book Building Accessible Websites (ISBN 0-7357-1150-X). Copyright ©
Joe Clark, 2002 (about the author). All rights reserved. ¶ Back to
Under the Web Content Accessibility Guidelines, you are required to
specify changes in the “natural” or human language used in documents.
You do this by adding the lang="languagecode" attribute to virtually
any tag (like <p></p>, <span></span>, <cite></cite>, or <hx></hx>.
Also, in order to specify a change in language, you must already have
declared the default, base, or original language, which you do by
adding lang="languagecode" to the <body> or (preferably) <html> tags,
• body lang="en"
• html lang="fr-ca"
So just what are those language codes? They’re two-letter
abbreviations, optionally followed by a hyphen and some other
qualifier. In the second example above, French is specified (fr), but
of the Canadian variety (ca).
The exact specification is ISO 639-1, “Codes for the Representation of
Names of Languages,” whose homepage resides at the Library of Congress:
lcweb.loc.gov/standards/iso639-2/. (Yes, that URL says “iso639-2”; you
have to hunt around at the site to find the 639-1 section, which is a
Note that the companion standard, ISO 639-2, provides three-letter
codes for languages – and for a vastly wider range of languages, at
that. Online, however, we must stick with the two-letter codes. At
least, this is my interpretation. A page at the World Wide Web
Consortium Internationalization site tells us:
According to RFC 3066, for languages with both a two-letter and a
three-letter code, the two-letter code must be used. This also solves
the problem of those languages that have two different three-letter
codes, because all of them also have a two-letter code.
So this “solves the problem,” does it? I don’t see a lot of problems
that are actually “solved” here. The RFC (request for comment)
mentioned in this citation merely refers back to ISO 639-1 and tells
us, in effect, that the only three-letter language codes we may use are
those that do not have a two-language code. But there are somewhat
complex rules in place governing when a three-letter code may be coined
without creating a corresponding two-letter code.
From an accessibility perspective, this restriction will eventually
have to be lifted. Textual media are not the only kind available on the
Web, and as more and more video becomes available, more and more sign
languages will be available, and all sign-language names exist in the
three-character specification (under sgn). It is technically impossible
to specify a sign language on a Website as the standards currently
My recommendation? Damn the torpedoes! If you have to specify a
language with a three-letter code because you cannot find a two-letter
code, do it. Such a practice appears to be permitted anyway and is the
only one that makes sense.
Let’s start with the two-letter codes. Now, hundreds of languages have
been defined, and I’m not going to list every single one of them here
because the super-obscure language codes have no practical value to my
audience. (It’s nice to know that Faroese has its own language code,
but how many readers of a book on Web accessibility will have cause to
design Websites in Faroese? And won’t such designers already know that
Faroese’s language code is fo?) Besides, the ISO 639-1 specs are all
online and provide all the codes for you.
I have not found a truly reliable source for the Top Ten languages
used online (after English – the Top Eleven, really). I have
synthesized various lists into the following somewhat longer
compilation – not quite Top Forty, but close.
Very-widely-used languages online
Note that country codes and language codes are often just different
enough to get you into trouble if you’re not eagle-eyed.
• Japan is jp, but Japanese is ja.
• China is cn, but Chinese is zh.
• The Netherlands is nl, and so is Dutch.
• Sweden is se, but Swedish is sv.
• Denmark is dk, but Danish is da.
• Greece is gr, but Greek is el.
Some dialect names are standardized under ISO 639-1, while others,
usually of a more fanciful nature (Cockney, Newfoundland, joual) are
not. Both types are permitted; it is up to the browser or device to
interpret the codes correctly.
It is possible and legal, for example, to specify all these variants
• en (English: No specified variant)
• en-us (United States English)
• en-au-tas (Tasmanian English, Australia)
• en-in (Indian English)
You must not assume, however, that browsers or devices will be able to
understand or represent anything beyond the first dash.
In rather more important cases, like the two variations of Norwegian,
Bokmål and Nynorsk, enough social importance is given to the dialects
that they have their own codes.
• no (Norwegian: No specified variant)
• nb (Norwegian Bokmål)
• nn (Norwegian Nynorsk)
Authors writing in Norwegian will likely know which dialect they are
using and can cite it appropriately. Authors who merely quote Norwegian
text or make some other casual use of it may not know which is which;
that’s what the generic no tag is for.
If you’re wondering about Chinese (no doubt you are), Mandarin and
Cantonese are not the only recognized dialects, but all of them are
subsumed under zh. You must use dialect codes for Mandarin (zh-guoyu)
and Cantonese (zh-yue) if you wish to differentiate them. (The
distinction is nearly meaningless on Websites that do not use voice
given that the two dialects use the same writing system.) There is no
difference in language code between Traditional and Simplified Chinese;
arguably there should be.
Take my word for this as a linguist and an accessibility obsessif:
This stuff is more detailed and pedantic than trainspotting, and almost
as addictive to susceptible personalities. Just keep in mind that
dinner-party guests are never really as interested in this topic as we
Previous ¶ Contents ¶ Next
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Size: 20941 bytes
Desc: not available
More information about the Sw-l