Corpora: control chars

Tom Emerson tree at basistech.com
Fri Jun 7 14:00:54 UTC 2002


Gil Graf writes:
> is there any encoding, except utf16, which uses the
> control range (0-31) in a way different than ASCII ?
> more specifically, is it safe to cut off text at 10
> (normally newline) or 32 (normally space) bytes?

The question presumes you are looking at characters in terms of 8-bit
bytes instead of abstract character units consisting of one or more
bytes.

There are some C0 code points you may want to keep:

0x09  Horizontal Tab
0x0A  Line Feed
0x0D  Carriage Return

I presume you are using a multibyte character encoding in your data:
in that case all instances I can think of (including UTF-8) share the
C0 range. The two- and four-byte encodings of Unicode also have the C0
code points, but at a byte-level these may have leading or trailing
0x00 depending on the endianness of the machine you are on.

If you are working with C and are using the wchar_t type, then it is
possible that the system is using UTF-32/UCS-4 as the underlying
character type, in which case the encoding is less of an issue and you
can think only in terms of codepoint.

HTH(tm),

    -tree

--
Tom Emerson                                          Basis Technology Corp.
Sr. Computational Linguist                         http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"



More information about the Corpora mailing list