[Corpora-List] Encoding of apostrophes and quotes

Merle Tenney merlet at microsoft.com
Sat Jul 1 00:47:43 UTC 2006


My, Ciarán, what a thread you have spawned!  :-)

Seth, your comment about the back-slanted apostrophe (actually the noncombining grave accent) being used as an open quote touches on an interesting point in the development of Latin character encodings.  Why is it that ASCII includes the grave, circumflex, and tilde diacritics, but not the acute, dieresis, or cedilla diacritics?  The character repertoire of ASCII seems completely idiosyncratic, but it is actually motivated by a reasonable attempt to support more than just English, considering the world of computing when ASCII was created.

The story is a bit long, so you are free to leave now and not waste any more of your valuable time, but it does give you an interesting piece of trivia to share if the subject of the historical origins of modern character encoding idiosyncrasies comes up at your next cocktail party.  :-)  Here we go....

ASCII is designed first and foremost to support English.  It is the American Standard Code for Information Interchange, after all.  Nevertheless, English has incorporated loanwords from a number of languages, many of which retain their original accents, such as 'sauté' and 'piñata'.  Furthermore, people who use English also often need to use other languages, and some of the more commonly used come from Western Europe-French, German, Spanish, Italian, and Portuguese, for starters.

These languages need a wide range of diacritical marks-acute, grave, circumflex, tilde, dieresis, and cedilla.  So, repeating my original question, why are the grave, circumflex, and tilde the only diacritics encoded in ASCII?  Don't forget, of course, that ASCII begat Latin-1, Latin-1 begat Unicode, Unicode begat UTF-8, and so on.  A thousand years from now, Brits may drive on the right side of the road and Americans may use the metric system, but you can be sure that the first 128 characters of whatever character encoding we are using then will look exactly like ASCII!

The answer actually has three parts.  First, if you don't have enough code points for combined letters and diacritics, you have to be content with separate letters and diacritics.  Then you have to get them together on the printed page.  That is what the BACKSPACE character in ASCII does: ` BACKSPACE e or e BACKSPACE ` gets you è (used in native English words, by the way, such as learnèd).  Grave, circumflex, and tilde are used in precisely this way to get a number of letter and diacritical combinations.  That is the first part.

The second part is that certain code points in ASCII were overloaded, i.e., they had two meanings.  According to the original ASCII specification, which I actually read once, certain characters had two semantics-one if they were either preceded or followed by a BACKSPACE and the other meaning used everywhere else.  A comma, for example, was considered a comma everywhere, unless it was preceded or followed by a BACKSPACE, and then it became-TA DA!-a cedilla.  So , BACKSPACE c or c BACKSPACE , becomes ç.

Now, you and I both know that typing a comma on top of a c doesn't look exactly like a ç, but it is reasonably close.  Bear in mind that this all happened decades before WYSIWYG, laser printers, and desktop publishing.  Characters had one and only one shape back then.  Think teletype machines and line printers, and, later on, daisy wheel and dot matrix printers.  A little context is helpful in thinking about the early ASCII days.  Typists were sometimes cautioned not to type a lowercase 'el' if they intended the numeral 'one'.  They were coming from an era when some typewriters actually left off the 'one' key and had you use the 'el' key for both, so the goal for output was far from typesetting quality.

And that brings us to the third part of the answer.  Since some characters were serving double and triple duty, they had to be designed to work, kinda, for all their intended uses.  So we get the "hyphus" character (hyphen-minus), which is too skinny for a true hyphen and too short for a minus sign or an en dash or an em dash.

And here is where acute and dieresis enter the picture.  The double quote character looks a bit like a dieresis character, so " BACKSPACE u or u BACKSPACE " per the ASCII specification gives you ü.  To make this work, you have to make sure that the double quote that is output is short and stubby.  Also, since the double quote is being used for open and close quotations, it has to be neutral, not curved to the left or the right.  And that exactly describes the double quote as it appeared in output devices of the time and as it appears in ASCII tables to this day.  And that, by the way, is why Smart Quotes had to be invented.

Likewise, the single quote/apostrophe was drafted to serve as the acute accent.  So, ' BACKSPACE e or e BACKSPACE ' became é.  Since a single character was serving as apostrophe, single prime, and acute accent, it had to look a little like all of them, and it ended up being a high, short, straight line.  In fact, it was the mirror image of the grave character.  It looked fine as a grave accent, and it worked, more or less, as an apostrophe.

But it wasn't great as an apostrophe, and people use apostrophes a lot.  With the advent of desktop publishing, users wanted correctly shaped apostrophes.  So this led, in turn, to two other developments.  First, as I mentioned, Smart Quotes were invented, and people came to believe that hitting one key for both open and close quotes was the only natural way to enter these characters, unlike, say, using a shift key to differentiate lowercase and uppercase letters.

Second, with the advent of Smart Quotes, the shape usually associated with the apostrophe became vertical, and the corresponding ASCII code point was assumed to be a "neutral" quote, the meaning it has for most of us today.  Chances are that the ASCII apostrophe you saw two paragraphs back was vertical, not slanted to the right, as it would have been with earlier output devices.

And now, Seth, we come full circle.  To this day you will occasionally see the ASCII code points for grave and apostrophe used for their primary original purpose, as open and close quotes, especially with legacy protocols.  They can be single or double quotes, and they always look a bit odd:  `single' and ``double''.  For what it's worth, they look a little less odd with the monospacing fonts, which were also standard at the time.  Who remembers pica and elite and print pitch?

The fact that you might use two single quotes instead of a double quote is really not that unusual in itself.  Ignoring the obvious explanation that ASCII doesn't have opening and closing single or double quotes, typographers have traditionally typeset two single quotes, open or close, for a double quote.  They don't usually do that nowadays, with modern encodings and typesetting technologies, but that was certainly the state of affairs in their craft when ASCII was invented.

If you stuck with me to the end, I hope you weren't disappointed.  We have evolved a lot in our character encodings and document creation technologies since the early days of ASCII.  You can think of the missing diacritic marks and the convoluted system of quote marks that remain as a sort of vestigial typographical tail.  :-)

Merle

-----Original Message-----

2006/6/30, Seth Grimes <grimes at altaplana.com>:
> This may not concern any of you, but for what it's worth --
>
> In certain computer-programming shells (command-line interfaces), the
> back-slanted apostrophe, `, is used to contain a command fragment for
> execution.  Here's a usage example:
>
>         a=`ls -l`
>
> sets the value of the shell variable "a" to a directory listing produced
> by the command "ls -l".  So if you're parsing certain texts and see a
> back-slanted apostrophe (left single quote), don't assume it starts a
> quotation that will be terminated by a forward-slanted apostrophe (right
> single quote).
>
>                                                 Seth
>
> --
> Seth Grimes   Alta Plana Corp, analytical computing & data management
>               Intelligent Enterprise magazine (CMP), Contributing Editor
> grimes at altaplana.com       http://altaplana.com    301-270-0795



More information about the Corpora mailing list