entity names for VN o+ and u+

Sun Apr 29 21:07:33 UTC 2001

>I understand that there is a version of iso8859 that can be
>used for Vietnamese...

>So my best guess is that the entity names
>would probably be &ohorn; and &uhorn;. Does this ring a
>bell with anybody, either as being correct or totally wrong?

Why not just make your encoding UTF-8, and forget about entity names
altogether.

>I think the biggest problem in HTML is that they followed the
>principle of MS Word, where every combination of basic character
>plus diacritic is an entity on its own.

No, that's not due to the design of HTML. HTML is only relying on some
character set, and it's the character set that determines whether
precomposed or decomposed characters are used. HTML uses the Unicode
character set (though it allows characters to be encoded in terms of a
legacy encoding such is iso8859-x), and Unicode allows either decomposed
representations of accented Latin characters, or for certain combinations
it also allow precomposed combinations. In the case of Vietnamese, I
believe it includes precomposed combinations for all combinations that are
needed. Where both precomposed and decomposed representations are provided
in Unicode, the W3C recommendation (read "requirement") is to use the
precomposed representation. (This is done for a number of reasons; for
details see the Character Model draft at www.w3c.org.)

If you specify your charset to be UTF-8, then you don't need to use
entities for any character at all. Any software that is compliant with HTML
4.0 is supposed to be able to handle it.

- Peter

---------------------------------------------------------------------------
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <peter_constable at sil.org>