<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">

<head>

<meta http-equiv=Content-Type content="text/html; charset=iso-8859-1">

<meta name=Generator content="Microsoft Word 12 (filtered medium)">

<style>

<!--

 /* Font Definitions */

 @font-face

        {font-family:"Cambria Math";

        panose-1:2 4 5 3 5 4 6 3 2 4;}

@font-face

        {font-family:Calibri;

        panose-1:2 15 5 2 2 2 4 3 2 4;}

@font-face

        {font-family:Consolas;

        panose-1:2 11 6 9 2 2 4 3 2 4;}

 /* Style Definitions */

 p.MsoNormal, li.MsoNormal, div.MsoNormal

        {margin:0in;

        margin-bottom:.0001pt;

        font-size:11.0pt;

        font-family:"Calibri","sans-serif";}

a:link, span.MsoHyperlink

        {mso-style-priority:99;

        color:blue;

        text-decoration:underline;}

a:visited, span.MsoHyperlinkFollowed

        {mso-style-priority:99;

        color:purple;

        text-decoration:underline;}

p.MsoPlainText, li.MsoPlainText, div.MsoPlainText

        {mso-style-priority:99;

        mso-style-link:"Plain Text Char";

        margin:0in;

        margin-bottom:.0001pt;

        font-size:10.5pt;

        font-family:Consolas;}

span.PlainTextChar

        {mso-style-name:"Plain Text Char";

        mso-style-priority:99;

        mso-style-link:"Plain Text";

        font-family:Consolas;}

.MsoChpDefault

        {mso-style-type:export-only;}

@page Section1

        {size:8.5in 11.0in;

        margin:1.0in 1.0in 1.0in 1.0in;}

div.Section1

        {page:Section1;}

-->

</style>

<!--[if gte mso 9]><xml>

 <o:shapedefaults v:ext="edit" spidmax="1026" />

</xml><![endif]--><!--[if gte mso 9]><xml>

 <o:shapelayout v:ext="edit">

  <o:idmap v:ext="edit" data="1" />

 </o:shapelayout></xml><![endif]-->

</head>

<body lang=EN-US link=blue vlink=purple>

<div class=Section1>

<p class=MsoPlainText>I fully agree with Lou that elision is by no means the

only use of the apostrophe. It's also used in Irish names like "O'Connors",

"O'Hara"... Cases like "rock 'n roll" are also

interesting... In French, it's indeed sometimes a marker of an elision ("l'école"),

but it's also sometimes part of the token ("aujourd'hui", "prud'homme"...).

We've even noticed that some people were using it to replace accents when they

don't have a French keyboard (especially in instant messages: Ren'e instead of

René). The decision to treat apostrophes as breaking or non-breaking characters

has interesting implications for tools like spell-checkers (the same is true of

hyphens, of course) and, like Marco Baroni yesterday, I'm glad to see that

these crucial issues are discussed here and taken seriously... I wrote

something about that on our blog a few months ago, for those of you who are

interested...<o:p></o:p></p>

<p class=MsoPlainText><o:p> </o:p></p>

<p class=MsoPlainText><a

href="http://blogs.msdn.com/correcteurorthographiqueoffice/archive/2005/12/07/500807.aspx">http://blogs.msdn.com/correcteurorthographiqueoffice/archive/2005/12/07/500807.aspx</a>

<o:p></o:p></p>

<p class=MsoPlainText><o:p> </o:p></p>

<p class=MsoPlainText><o:p> </o:p></p>

<p class=MsoPlainText>Thierry<o:p></o:p></p>

<p class=MsoPlainText><o:p> </o:p></p>

<p class=MsoPlainText>Thierry Fontenelle<o:p></o:p></p>

<p class=MsoPlainText>Microsoft <i>Speech & Natural Language</i><o:p></o:p></p>

<p class=MsoPlainText><o:p> </o:p></p>

<p class=MsoPlainText><o:p> </o:p></p>

<p class=MsoPlainText>> 2. An apostrophe is generally used to indicate

elision or (in English)<o:p></o:p></p>

<p class=MsoPlainText>> possession:<o:p></o:p></p>

<p class=MsoPlainText>> don't, 'tis, sayin', John's, James', c'est, geht's. <o:p></o:p></p>

<p class=MsoPlainText>This is true, in English, certainly. But by no means the

only use. <o:p></o:p></p>

<p class=MsoPlainText>Consider the (infamous) use of the apostrophe to indicate

plurals for example ("PC's") or its use in French to indicate

something about pronunciation ("pin's") or its use in Italian to

double up for an accent ("Forli'").<o:p></o:p></p>

<p class=MsoPlainText><o:p> </o:p></p>

<p class=MsoPlainText>-----Original Message-----<br>

From: owner-corpora@lists.uib.no [mailto:owner-corpora@lists.uib.no] On Behalf

Of Lou Burnard<br>

Sent: Friday, June 30, 2006 12:52 AM<br>

To: corpora@uib.no<br>

Subject: Re: [Corpora-List] Encoding of apostrophes and quotes<o:p></o:p></p>

<p class=MsoPlainText><o:p> </o:p></p>

<p class=MsoPlainText><o:p> </o:p></p>

<p class=MsoPlainText> wrote:<o:p></o:p></p>

<p class=MsoPlainText>> Would list members agree with the following

statements:<o:p></o:p></p>

<p class=MsoPlainText>><o:p> </o:p></p>

<p class=MsoPlainText>> 1. Even though they look the same, apostrophe and

single right quote <o:p></o:p></p>

<p class=MsoPlainText>> behave as different characters and require different

encoding.<o:p></o:p></p>

<p class=MsoPlainText>>   <o:p></o:p></p>

<p class=MsoPlainText><o:p> </o:p></p>

<p class=MsoPlainText>I would say rather that the same graphic symbol has

multiple applications. There *is* a different character available for

representing "single right quote", of course, the one that looks like

a curly "smart quote".<o:p></o:p></p>

<p class=MsoPlainText>> 2. An apostrophe is generally used to indicate

elision or (in English)<o:p></o:p></p>

<p class=MsoPlainText>> possession:<o:p></o:p></p>

<p class=MsoPlainText>> don't, 'tis, sayin', John's, James', c'est, geht's. <o:p></o:p></p>

<p class=MsoPlainText>This is true, in English, certainly. But by no means the

only use. <o:p></o:p></p>

<p class=MsoPlainText>Consider the (infamous) use of the apostrophe to indicate

plurals for example ("PC's") or its use in French to indicate

something about pronunciation ("pin's") or its use in Italian to

double up for an accent ("Forli'").<o:p></o:p></p>

<p class=MsoPlainText><o:p> </o:p></p>

<p class=MsoPlainText>Historically, I think, the apostrophe has the semantics

of elision: we use it in geneitive forms in English because of a (possibly

mistaken) etymological assumption ("man's" standing for

"mannes" eg)<o:p></o:p></p>

<p class=MsoPlainText>>  In tokenization, the<o:p></o:p></p>

<p class=MsoPlainText>> apostrophe is not to be dropped, but is retained as

part of the token; <o:p></o:p></p>

<p class=MsoPlainText>> and a token break may be considered somewhere in its

vicinity.<o:p></o:p></p>

<p class=MsoPlainText>>   <o:p></o:p></p>

<p class=MsoPlainText>Probably. In BNC our practice is to regard things like

"That's" as two tokens  "That" and "'s" so yes,

we would certainly consider the apostrophe to be part of the second token. But

others might treat this differently. We have exactly the same set of issues with

the hyphen, of course.<o:p></o:p></p>

<p class=MsoPlainText><o:p> </o:p></p>

<p class=MsoPlainText>a) it is sometimes used in place of the mdash<o:p></o:p></p>

<p class=MsoPlainText>b) If "tea-pot" is treated as two tokens

(rather than as a variant form of "teapot"), to which one does the

hyphen belong?<o:p></o:p></p>

<p class=MsoPlainText><o:p> </o:p></p>

<p class=MsoPlainText><o:p> </o:p></p>

<p class=MsoPlainText><o:p> </o:p></p>

<p class=MsoPlainText>> 3. A right single quote is used, in conjunction with

a left single quote, to<o:p></o:p></p>

<p class=MsoPlainText>> delimit a stretch of text.   In tokenization, such

marks (like punctuation<o:p></o:p></p>

<p class=MsoPlainText>> in general) become separate tokens, and in many

applications (such as<o:p></o:p></p>

<p class=MsoPlainText>> word-lists) they are simply dropped.<o:p></o:p></p>

<p class=MsoPlainText>><o:p> </o:p></p>

<p class=MsoPlainText>>   <o:p></o:p></p>

<p class=MsoPlainText>Yes, but this is a different usage of the punctuation

mark -- and one <o:p></o:p></p>

<p class=MsoPlainText>which some (partly because of the ambiguity introduced)

would castigate <o:p></o:p></p>

<p class=MsoPlainText>as mistaken!<o:p></o:p></p>

<p class=MsoPlainText><o:p> </o:p></p>

<p class=MsoPlainText><o:p> </o:p></p>

<p class=MsoPlainText>> As someone who has always taken the above statements

to be true, I have been<o:p></o:p></p>

<p class=MsoPlainText>> amazed and disappointed to learn that Unicode advise

the encoding of<o:p></o:p></p>

<p class=MsoPlainText>> apostrophes and right single quotes as the same

character (U+2019).  Their<o:p></o:p></p>

<p class=MsoPlainText>> explanation is that people in general will find it

too difficult to<o:p></o:p></p>

<p class=MsoPlainText>> understand the difference.<o:p></o:p></p>

<p class=MsoPlainText>><o:p> </o:p></p>

<p class=MsoPlainText>>   <o:p></o:p></p>

<p class=MsoPlainText>Well, I am amazed and disappointed to learn that you

would expect <o:p></o:p></p>

<p class=MsoPlainText>Unicode (who or whatever you mean by that) to legislate

for such usage <o:p></o:p></p>

<p class=MsoPlainText>rules. It's no part of their brief to tell us how to use

glyphs which <o:p></o:p></p>

<p class=MsoPlainText>have a long and (dis)honourable tradition of ambiguous

usage!<o:p></o:p></p>

<p class=MsoPlainText><o:p> </o:p></p>

<p class=MsoPlainText><o:p> </o:p></p>

<p class=MsoPlainText>> If I had followed this advice and used U+2019 for

both apostrophe and right<o:p></o:p></p>

<p class=MsoPlainText>> single quote, all the corpus analysis which I have

successfully undertaken<o:p></o:p></p>

<p class=MsoPlainText>> would have been made impossibly difficult.<o:p></o:p></p>

<p class=MsoPlainText><o:p> </o:p></p>

<p class=MsoPlainText>Indeed, but then you've constructed the entirely accurate

observation <o:p></o:p></p>

<p class=MsoPlainText>that the apostrophe is often used ambiguously into a

recommendation that <o:p></o:p></p>

<p class=MsoPlainText>it should be!<o:p></o:p></p>

<p class=MsoPlainText><o:p> </o:p></p>

<p class=MsoPlainText>I would say that the kind of usage you're talking about

here (e.g. to <o:p></o:p></p>

<p class=MsoPlainText>mark titles) ought to be carried out by proper

descriptive markup. But <o:p></o:p></p>

<p class=MsoPlainText>then I would, wouldn't I.<o:p></o:p></p>

<p class=MsoPlainText><o:p> </o:p></p>

<p class=MsoPlainText><o:p> </o:p></p>

<p class=MsoPlainText>>   In fact, even the simplest text<o:p></o:p></p>

<p class=MsoPlainText>> processing exercise becomes impossible, see<o:p></o:p></p>

<p class=MsoPlainText>> <a

href="http://www.smo.uhi.ac.uk/~oduibhin/apostrophe.htm"><span

style='color:windowtext;text-decoration:none'>http://www.smo.uhi.ac.uk/~oduibhin/apostrophe.htm</span></a>.<o:p></o:p></p>

<p class=MsoPlainText>><o:p> </o:p></p>

<p class=MsoPlainText>> I would be interested to know what people think of

Unicode's advice, and how<o:p></o:p></p>

<p class=MsoPlainText>> they deal with this situation in practice.<o:p></o:p></p>

<p class=MsoPlainText>><o:p> </o:p></p>

<p class=MsoPlainText>> Ciarán Ó Duibhín.<o:p></o:p></p>

<p class=MsoPlainText>><o:p> </o:p></p>

<p class=MsoPlainText>> For completeness, though it doesn't affect the point

above, I ought to add<o:p></o:p></p>

<p class=MsoPlainText>> that Unicode *do* make a distinction between what

they call "punctuation<o:p></o:p></p>

<p class=MsoPlainText>> apostrophes" (the kind I have been talking about),

and "letter apostrophes".<o:p></o:p></p>

<p class=MsoPlainText>> They assign a character (U+02BC) to the latter, to

be used in cases where an<o:p></o:p></p>

<p class=MsoPlainText>> apostrophe look-alike is used to represent a sound

(often, the glottal<o:p></o:p></p>

<p class=MsoPlainText>> stop).<o:p></o:p></p>

<p class=MsoPlainText>><o:p> </o:p></p>

<p class=MsoPlainText>><o:p> </o:p></p>

<p class=MsoPlainText>><o:p> </o:p></p>

<p class=MsoPlainText>><o:p> </o:p></p>

<p class=MsoPlainText>><o:p> </o:p></p>

<p class=MsoPlainText>>   <o:p></o:p></p>

<p class=MsoPlainText><o:p> </o:p></p>

<p class=MsoPlainText><o:p> </o:p></p>

</div>

</body>

</html>