[Corpora-List] Encoding of apostrophes and quotes

Mike Maxwell maxwell at ldc.upenn.edu
Fri Jun 30 11:50:41 UTC 2006


Ciarán Ó Duibhín wrote:

> 1. Even though they look the same, apostrophe and single right quote behave
> as different characters and require different encoding.

Similarly, the period character (full stop for you British types :-)) 
has at least the following uses in English:

1) end of declarative sentence

2) end of abbreviation

3) decimal point

4) character in ellipsis (...)

Sometimes a single period has more than one of the above functions, e.g. 
when an abbreviation ends a sentence.  This is very common with the 
abbreviation etc.

Only (4) has a separate representation in Unicode (and some other 
encodings), namely as an ellipsis (i.e. all three dots as a single 
character).

But I can't imagine people having to use a separate character for the 
other three functions (and perhaps still another character for when the 
period has more than one function).

The characters are for the benefit of the reader, not for corpus 
linguists.  We have to make do with whatever the readers do.

   Mike Maxwell
   CASL/ U MD



More information about the Corpora mailing list