[Corpora-List] Encoding of apostrophes and quotes
Thierry Fontenelle
thierryf at microsoft.com
Fri Jun 30 18:15:36 UTC 2006
I fully agree with Lou that elision is by no means the only use of the apostrophe. It's also used in Irish names like "O'Connors", "O'Hara"... Cases like "rock 'n roll" are also interesting... In French, it's indeed sometimes a marker of an elision ("l'école"), but it's also sometimes part of the token ("aujourd'hui", "prud'homme"...). We've even noticed that some people were using it to replace accents when they don't have a French keyboard (especially in instant messages: Ren'e instead of René). The decision to treat apostrophes as breaking or non-breaking characters has interesting implications for tools like spell-checkers (the same is true of hyphens, of course) and, like Marco Baroni yesterday, I'm glad to see that these crucial issues are discussed here and taken seriously... I wrote something about that on our blog a few months ago, for those of you who are interested...
http://blogs.msdn.com/correcteurorthographiqueoffice/archive/2005/12/07/500807.aspx
Thierry
Thierry Fontenelle
Microsoft Speech & Natural Language
> 2. An apostrophe is generally used to indicate elision or (in English)
> possession:
> don't, 'tis, sayin', John's, James', c'est, geht's.
This is true, in English, certainly. But by no means the only use.
Consider the (infamous) use of the apostrophe to indicate plurals for example ("PC's") or its use in French to indicate something about pronunciation ("pin's") or its use in Italian to double up for an accent ("Forli'").
-----Original Message-----
From: owner-corpora at lists.uib.no [mailto:owner-corpora at lists.uib.no] On Behalf Of Lou Burnard
Sent: Friday, June 30, 2006 12:52 AM
To: corpora at uib.no
Subject: Re: [Corpora-List] Encoding of apostrophes and quotes
wrote:
> Would list members agree with the following statements:
>
> 1. Even though they look the same, apostrophe and single right quote
> behave as different characters and require different encoding.
>
I would say rather that the same graphic symbol has multiple applications. There *is* a different character available for representing "single right quote", of course, the one that looks like a curly "smart quote".
> 2. An apostrophe is generally used to indicate elision or (in English)
> possession:
> don't, 'tis, sayin', John's, James', c'est, geht's.
This is true, in English, certainly. But by no means the only use.
Consider the (infamous) use of the apostrophe to indicate plurals for example ("PC's") or its use in French to indicate something about pronunciation ("pin's") or its use in Italian to double up for an accent ("Forli'").
Historically, I think, the apostrophe has the semantics of elision: we use it in geneitive forms in English because of a (possibly mistaken) etymological assumption ("man's" standing for "mannes" eg)
> In tokenization, the
> apostrophe is not to be dropped, but is retained as part of the token;
> and a token break may be considered somewhere in its vicinity.
>
Probably. In BNC our practice is to regard things like "That's" as two tokens "That" and "'s" so yes, we would certainly consider the apostrophe to be part of the second token. But others might treat this differently. We have exactly the same set of issues with the hyphen, of course.
a) it is sometimes used in place of the mdash
b) If "tea-pot" is treated as two tokens (rather than as a variant form of "teapot"), to which one does the hyphen belong?
> 3. A right single quote is used, in conjunction with a left single quote, to
> delimit a stretch of text. In tokenization, such marks (like punctuation
> in general) become separate tokens, and in many applications (such as
> word-lists) they are simply dropped.
>
>
Yes, but this is a different usage of the punctuation mark -- and one
which some (partly because of the ambiguity introduced) would castigate
as mistaken!
> As someone who has always taken the above statements to be true, I have been
> amazed and disappointed to learn that Unicode advise the encoding of
> apostrophes and right single quotes as the same character (U+2019). Their
> explanation is that people in general will find it too difficult to
> understand the difference.
>
>
Well, I am amazed and disappointed to learn that you would expect
Unicode (who or whatever you mean by that) to legislate for such usage
rules. It's no part of their brief to tell us how to use glyphs which
have a long and (dis)honourable tradition of ambiguous usage!
> If I had followed this advice and used U+2019 for both apostrophe and right
> single quote, all the corpus analysis which I have successfully undertaken
> would have been made impossibly difficult.
Indeed, but then you've constructed the entirely accurate observation
that the apostrophe is often used ambiguously into a recommendation that
it should be!
I would say that the kind of usage you're talking about here (e.g. to
mark titles) ought to be carried out by proper descriptive markup. But
then I would, wouldn't I.
> In fact, even the simplest text
> processing exercise becomes impossible, see
> http://www.smo.uhi.ac.uk/~oduibhin/apostrophe.htm <http://www.smo.uhi.ac.uk/~oduibhin/apostrophe.htm> .
>
> I would be interested to know what people think of Unicode's advice, and how
> they deal with this situation in practice.
>
> Ciarán Ó Duibhín.
>
> For completeness, though it doesn't affect the point above, I ought to add
> that Unicode *do* make a distinction between what they call "punctuation
> apostrophes" (the kind I have been talking about), and "letter apostrophes".
> They assign a character (U+02BC) to the latter, to be used in cases where an
> apostrophe look-alike is used to represent a sound (often, the glottal
> stop).
>
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20060630/ceb227d3/attachment.htm>
More information about the Corpora
mailing list