[Corpora-List] Encoding of apostrophes and quotes

Ciar án Ó Duibhín ciaran at oduibhin.freeserve.co.uk
Fri Jul 7 19:03:44 UTC 2006


In reply to my questions, a number of people have said they think it is
reasonable that Unicode should assign the same codepoint to apostrophe and
right single quote, on the grounds that many people will be unwilling to
make the distinction.

The reason I asked is that Unicode differentiates between characters and
glyphs, and describes itself as a coded character set, not a coded glyph
list.  But outside of symbols which are clearly alphabetic, Unicode seems
ready to encode glyphs not characters, on "practical" grounds.  In
particular, where a glyph is ambiguous between a lexical and a non-lexical
function (apostrophe vs right single quote), Unicode encodes the glyph, not
the characters.  What this means is that such a basic processing operation
as tokenization is not possible on
Unicode-encoded text (without markup).

An attraction of Unicode for me is the reduction in the need for
character-level markup, thanks to the greatly-increased character
repertoire.  I'm concerned that Unicode is not living up to its promise for
text processing here, with its readiness to deviate from the character-glyph
model at the least difficulty.  I just thought I'd see if this view had any
support among the corpus community, who would be among the most likely (I
thought) to benefit from a more consistent encoding of characters rather
than glyphs in Unicode, but it seems not.

I accept that encoders in general may have limited willingness to
distinguish characters with similar appearances, even when they have very
different functions, but I don't see that as an argument for denying the use
of an encoding distinction to those who are prepared to take the trouble
over it in preparing their corpus, in return for the processing benefits.

Ciarán Ó Duibhín.



More information about the Corpora mailing list