[Corpora-List] Encoding of apostrophes and quotes

Geoffrey Sampson grs2 at sussex.ac.uk
Mon Jul 3 09:26:55 UTC 2006

Elision and indication of possession are not really separate uses for
the apostrophe.  I have always understood, and it sounds plausible, that
the reason why we write "John's" as the genitive of "John" is because in
centuries past, when less was known than today about language history,
people mistakenly believed that the genitive form "John's" had arisen as
a reduction of "John his" (and it was sometimes written out like that in
full).  -- No, I don't know how they explained "Mary's" either.

The question of tokenization and encoding seems to me not to be an issue
for which there is one "right answer"; surely it is a matter for
different researchers to answer differently in terms of their particular
needs.  So far as I am aware the apostrophe and single right inverted
comma are _never_ distinguished graphically, so it seem quite reasonable
to me for Unicode to assign them the same code.  They are logically
distinct, but it isn't Unicode's job to delve into the logic of written
symbols -- I don't think it would be practical to require that.

Geoffrey Sampson

     Prof. Geoffrey Sampson  MA PhD MBCS CITP ILTM

     author of "The 'Language Instinct' Debate"

     Department of Informatics, University of Sussex
     Falmer, Brighton BN1 9QH, England

     www.grsampson.net     +44 1273 678525

More information about the Corpora mailing list