[sw-l] challenge for programmers - SSS-ID mapping onto Unicode
Tomas Klapka
tomas.klapka at RUCE.CZ
Wed Jun 22 23:33:12 UTC 2005
Hi all ;)
I like this discussion, it is very interesting.
If I say "IMWA finished" I don't mean the IMWA finished finished :)
I think it could be mapped onto Unicode before it is "finished" because the
term "finished" is very relative.
If I understand it well, IMWA is meaned to be an alphabet for all possible
movements. It will take ages to finish it.
I agree that we can take some "stable" part of the IMWA and it can be mapped
onto it.
There can be white spaces, which can be mapped later it is called "reserved".
Well, I think it is possible to map the IMWA onto Unicode soon and it could be
the part of IMWA which is stable and it is supposed to be never changed.
Every new Unicode standard comes with additions and new characters.
Val, I don't think it is needed to map characters by frequency in languages.
My opinion is, that it should be mapped by stability of the symbol. If there
is a group of stable symbols, it is possible to map them onto Unicode and it
doesn't matter if it is in a SSS order or frequency order because of the
SSS-ID from/to Unicode convert table, which has to be used if there are only
16 bits supposed to be for IMWA now. 16 bits is not enough to map SSS-ID.
Is 16 bits the maximum given tu IMWA? I think the Unicode has the mechanism to
encode more bits.
But if there is only 16 bits there is no way IMHO.
I don't think it is needed to map IMWA in the order of SSS. Plenty of
characters isn't mapped by any order in the Unicode (neither Czech alphabet -
special characters /with diacritical marks are in EXTENDED LATIN chart and
there is a czech letter 'ch' which is ordered between 'h' and 'i' and not
somewhere close to 'c' letter {between 'cg' and 'ci'}/ and I think it is not
mapped onto Unicode because it is just linear sequence of two existing Unicode
symbols 'c' and 'h').
Well I have more ideas and opinions which came on my mind.
Now, I don't think there should be rotation mapped onto Unicode. Because the
purpose of the Unicode is to give unique number to a symbol. If you mirror the
symbol, or if you rotate the symbol in 90 degrees it is still the same symbol
and can be used the same font (it is easy if fonts are vector). I can rotate a
latin text in whatever angle in a standard word processor.
If x, y coordinates aren't supposed to be in Unicode, why there should be a
rotation?
Well if we have 65,536 values for SignWriting in the Unicode to map, we can
count...
There are 6 Fills, which are needed to be mapped:
65,536 / 6 = 10,922 values for base-symbols (Category-Group-Symbol-Variation)
6 different Fills can be mapped onto 3 bits, which can be used to map 8 values
(2 more not used values).
I think those 2 values can be reserved for any adittional Fill invented in the
future or it could be used by a special chars or there could be adopted any
other script.
If we use those 3 bits for Fill, there are 13 bits left for base-symbols.
In 13 bits there can be 8,190 values (2^13 or 65,536 / 8).
Now there is 425 base-symbols used in IMWA 2004.
I think there will never be more than 8,190 base-symbols in IMWA, and if yes,
it is in a far future and those symbols could be mapped in another Unicode
layer (or there could be used the reserved space I mentioned together with
giving 3 bits to Fills... there are 2 more Fills which arn't used /now, but
maybe later?/ in 3 bits and if we use those 2 values to indicate there are 3
other bits for Fills, there could be stored 2,048 /1,024 for every of the not
used Fill - 7 and 8/ more values /10,240 values for base-symbols and 682
values which can be reserved for special purposes/).
Now if it is used as I say... there is need for SSS-ID from/to Unicode convert
table.
We can have a table of 10,922 values (or 8,190, or 10,240 - depends on the
mapping).
If SSS-ID without rotation (which is not supposed by me to be in the Unicode)
is represented by xx-xx-xxx-xx-xx mask, it is 99-99-999-99-99 with highest values,
so it is 100-100-1000-100-100 (with zero value).
100 can be saved in 7 bits and 1000 can be saved in 10 bits-
it is 7-7-10-7-7 bits = 38 bits = 5 bytes (40 bits).
Our Unicode has 16 bits = 2 Bytes.
Well the row of the convert table can be stored in 5 bytes.
And 10,922 rows * 5 Bytes = 54,610 Bytes (54 kB) large convert table.
It can be saved more economically if the table is saved in a bit level (not in
the Byte level as I've written above)
There are two blank bits in SSS-ID representation (without rotation) - we need
38, but to fill up the Byte there is need of 40 bits.
If there are 3 last bits for Fill, those bits are the same 3 bits in Unicode,
so there is no need to convert these 3 bits. So it is 31 bits for SSS-ID and
13 bits for Unicode which is 44 bits.
44 bits * 8,190 values = 360,360 bits and it is 45,045 Bytes (44 kB)
It is minimal space saving, so I think bit level tabel is useless, because it
could be more slow to manipulate on the bit level.
If we use textual representation for SSS-ID (still without rotation), I mean
"ccggsssvvf" where cc is for category, gg is for group, etc. (for example
0102024013 is 01-02-025-01-03) it has 10 bytes plus 2 bytes for Unicode. It is
131,064 Bytes (128 kB) for the table.
Well, it is on the renderer how it converts SSS-ID from/to Unicode. I think
54,610 Bytes is fine convert table, which could be fast to search/seek/manage.
But maybe I am wrong ;-)
Some base-symbols has less than 6 Fills... those not used Fills would be blank
(sometimes seen in the current Unicode).
I think it is important to map IMWA onto Unicode because of the
standardization and implementation.
Sure, the Unicode is not the only solution for SW expansion, but I just feel
it as I feel it ;o)
Thanks,
Tomas
More information about the Sw-l
mailing list