[Corpora-List] Encoding of apostrophes and quotes

Niels Ott niels at drni.de
Wed Jul 5 08:10:32 UTC 2006


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

E Tonkin wrote:
> Throughout this drama, people have been ordering Beck's Bier with fine
> disregard of any neue deutsche Rechtschreibung!

Even worse, there exist plural forms that cannot even motivated by an
English-style standard: What about Ampel'n (traffic lights)?

> Of some relevance to this discussion, though I don't know how accurate it
> is, is the note on Wikipedia suggesting that a common side-effect of
> Apostrophitis is the use of a diacritical mark in place of the apostrophe
> itself.

Plus, on the web, using the diacritics from the windows-1252 character
set but specifying iso-8859-1.

These are things people are concerned with who are creating corpora from
the WWW. Spelling can be very "generic" out there.

Maybe this should be considered in corpora exploration software by
having options on fuzzy matching. (If one can't correct the errors, one
can possibly work around them as corpus user.)

Best,

   Niels

- --
Me & Myself & All The Rest: http://www.drni.de/
Auf dem Baum, da sitzt ein Specht, der Baum ist hoch, dem Specht ist
schlecht.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)

iD8DBQFEq3P3bosnVosUgx0RAmomAJ9GifNAhqIyRFmOl8sd6K+rvTlm/gCgmLqd
+Ikh7Esf7I7mxnX2F9fwZfA=
=uwUF
-----END PGP SIGNATURE-----



More information about the Corpora mailing list