[Corpora-List] Encoding of apostrophes and quotes

John F. Sowa sowa at bestweb.net
Fri Jul 7 19:56:02 UTC 2006


That is true:

 > I accept that encoders in general may have limited willingness
 > to distinguish characters with similar appearances, even when
 > they have very different functions, but I don't see that as
 > an argument for denying the use of an encoding distinction to
 > those who are prepared to take the trouble over it in preparing
 > their corpus, in return for the processing benefits.

I have no objections to having such a distinction available.
But those of use who discussed many of the problems with badly
coded data were simply making the point that *all* data encoded
by humans would tend to be highly error prone (even when coded
by highly trained people).

And by the way, the users' familiarity with something is no
guarantee of accuracy.  For the TLG (Thesaurus Linguae Graecae),
the coding done in Greece by people who were familiar with the
alphabet was the most error prone.  The coding done in Korea
and Taiwan was much more accurate.

Therefore, some kind of automated or semi-automated tools will
be necessary to check and enforce any system of encoding.

John Sowa



More information about the Corpora mailing list