Corpora: control chars

Tom Emerson tree at
Fri Jun 7 14:00:54 UTC 2002

Gil Graf writes:
> is there any encoding, except utf16, which uses the
> control range (0-31) in a way different than ASCII ?
> more specifically, is it safe to cut off text at 10
> (normally newline) or 32 (normally space) bytes?

The question presumes you are looking at characters in terms of 8-bit
bytes instead of abstract character units consisting of one or more

There are some C0 code points you may want to keep:

0x09  Horizontal Tab
0x0A  Line Feed
0x0D  Carriage Return

I presume you are using a multibyte character encoding in your data:
in that case all instances I can think of (including UTF-8) share the
C0 range. The two- and four-byte encodings of Unicode also have the C0
code points, but at a byte-level these may have leading or trailing
0x00 depending on the endianness of the machine you are on.

If you are working with C and are using the wchar_t type, then it is
possible that the system is using UTF-32/UCS-4 as the underlying
character type, in which case the encoding is less of an issue and you
can think only in terms of codepoint.



Tom Emerson                                          Basis Technology Corp.
Sr. Computational Linguist               
  "Beware the lollipop of mediocrity: lick it once and you suck forever"

More information about the Corpora mailing list