[Corpora-List] Encoding of apostrophes and quotes

Fri Jun 30 06:55:08 UTC 2006

Hi there.

First of all, I am really glad that for once we discuss this kind of
"low-level" processing issues that are so fundamental to getting high
quality language data, but that are often not taken seriously as dignified 
research topics...

> As someone who has always taken the above statements to be true, I have been
> amazed and disappointed to learn that Unicode advise the encoding of
> apostrophes and right single quotes as the same character (U+2019).  Their
> explanation is that people in general will find it too difficult to
> understand the difference.

I think that, if the people who produce the texts we parse do not make a 
distinction coherently, we might as well forget about it, as it will just 
create more noise (I myself have just found out now how to produce a single 
quote on my keyboard -- never typed a single quote character before...)

If I get a text to tokenize, unless I have a lot of reliable information 
about how it was produced (which in my experience is never the case), I 
just merge all single quote/apostrophe-like characters, and then use 
various heuristics to decide which ones are apostrophes, which ones are 
single quotes, and which ones mark an accent on the previous vowel (since 
this is another way in which the apostrophe is used in electronic Italian).

Add to that that a lot of standard tools to process Western European text 
(such as the IMS treetaggers)  expect latin1 input, and thus they will not 
be able to make the distinction anyway (last time I checked, at least...)

My pessimistic 2 cents.

Regards,

Marco

-- 
Marco Baroni
SSLMIT, University of Bologna
http://sslmit.unibo.it/~baroni

Leadership is a form of evil. No one needs to lead you to do something
that is obviously good for you.

(Scott Adams)