[Corpora-List] Encoding of apostrophes and quotes

Ron Artstein artstein at essex.ac.uk
Fri Jun 30 16:58:32 UTC 2006


> As someone who has always taken the above statements to be true, 
> I have been amazed and disappointed to learn that Unicode advise 
> the encoding of apostrophes and right single quotes as the same 
> character (U+2019).

My understanding is that Unicode tends to unify characters that 
always look the same. Since an apostrophe and a closing quote use 
identical glyphs whatever the font, they get the same character; 
in contrast, a comma and a baseline quote may have identical glyphs 
in some fonts but distinct glyphs in other fonts, so they get 
separate characters.

One thing that has always baffled me was why Unicode decided to 
assign the two characters U+05F3 Hebrew punctuation geresh and 
U+05F4 Hebrew punctuation gershayim. Geresh (dual: gershayim) is 
the Hebrew name for a punctuation mark similar to an apostrophe 
which is used for marking abbreviations; in modern usage these have 
identical glyphs to single and double quotes. I haven't found an 
explanation why U+05F3 and U+05F4 are distinct from standard 
punctuation marks, and whether they're intended just for 
abbreviations or also for quotes.

My guess is that separate code points were needed because Hebrew 
apostrophes and quotes are quite distinct in shape from Latin ones; 
a mixed font could share code points (and glyphs) for most 
punctuation marks, but using the Latin glyphs for quotes and 
apostrophes in Hebrew would look very odd. If this is indeed the 
rationale behind the code points U+05F3 and U+05F4, then these 
characters should be used for both apostrophes and quotes in 
Hebrew. 

-Ron.



More information about the Corpora mailing list