[sw-l] challenge for programmers - SSS-ID mapping onto unicode
Tomáš Klapka
Tomas.Klapka at RUCE.CZ
Wed Jun 22 16:17:07 UTC 2005
Hi all,
if there is 8 categories, 10 groups, 50 symbols, 5 variations, 6 fills
and 16 rotations, it makes 8*10*50*5*6*16 = 1,900,000 possible
combinations. But in IMWA there is not so many combinations used.
Let's see on the number of combinations depending on number of bits...
16 bits has 2^16 = 65,536 combinations
HERE is SSS-ID without rotation = 120,000 combinations
17 bits has 2^17 = 131072 combinations
18 bits has 2^18 = 262,144 combinations
19 bits has 2^19 = 525,632 combinations
20 bits has 2^20 = 1,048,576 combinations
HERE is complete SSS-ID = 1,900,000 combinations
21 bits has 2^21 = 2,097,152 combinations
Well, it is impossible to put all 1,900,000 combinations into 16 bits.
We need 21 bits.
But if we don't use rotation, there is only 120,000 combinations and it
can be in 17 bits and it is very close.
OPTIMIZATION
There are combinations of SSS-ID which is never used, for example SSS-ID
01-01-042-02-03-14. It is one of the 1,900,000's combination but there
is not more than 13 symbols in category 1 and group 1.
There is a way to make an optimization if you know, how many groups is
in each category...
category 1 has 10 groups
category 2 has 10 groups
category 3 has 10 groups
category 4 has 5 groups
category 5 has 5 groups
category 6 has 2 groups
category 7 has 4 groups
category 8 has 4 groups
well, it is 50 groups at all and not 80 (8 categories * 10 groups).
this information is small table with 8 rows so it is easy to implement
and to use..
well it is 50(groups in 8 categories)*50*5*6*16 = 1,200,000 combinations.
without rotation it is 50*50*5*6 = 75,000 combinations.
It is still too much.
if you know how many symbols is in each of these 50 groups, you can make
a (convert) table
group 1 has 13 symbols
group 2 has 12 symbols
group 3 has 21 symbols
group 4 has 7 symbols
...and so on...
It is
13+12+21+07+50+21+13+14+33+11+
01+11+15+04+15+14+12+15+15+13+
01+01+01+01+01+01+01+01+01+01+
01+01+02+02+01+
01+01+01+04+04+
05+04+
01+02+01+02+
01+02+03+02 =
361 symbols at all
361*5*6*16 = 173,280 combinations if we have a convert table with 50 rows.
361*5*6 = 10830 combinations without rotation!
well, lets make a table of symbols with 361 rows (it is still usable)
in categories 1, 3, 4, 5 and 8, there is no symbol with more than 1
variety (it is )
in category 2 there is 01+11+00+00+00+00+15+00+13 with the only one variety
in category 2 there is 00+00+10+04+09+05+00+05+00 with 2 varieties
00+00+07+04+05+04+00+00 with 3 varieties, 00+00+01+00+01+00+00+00 with 4
varieties. = 55 more variations.
in category 6 there is the only symbol with more than one variety and it
is 5 varieties which is 4 more variations.
in category 7 there are 3 symbols with 2 varieties and 2 symbols with 3
varieties which is 5 more variations.
it is 64 more variations. so it is 361 + 64 variations at all and it is
425!!!. it is only 425? didn't I make a mistake somewhere?
well, 425*6*16 is 40,800 combinations (2550 without rotation)!
425 rows is still not so large convert table so there could be optimized
in one more step.
if I list all the SSSs I can find there is only 1884 SSS-ID's with
rotation 01. (1867 with rotation 02, ...
rotation number of SSS-ID's
01 1884
02 1867
03 1781
04 1835
05 1730
06 1724
07 1660
08 1711
09 1481
10 1480
11 1465
12 1474
13 1471
14 1471
15 1462
16 1477)
LET'S GO BACKWARDS
Now I see, it would be better to go backwards on the SSS-ID.
I have 65536 combinations in 16 bits.
all 16 rotations are frequently used, so it is not economical to
optimize it.
65536/16 is 4096 symbols with all variations and fills.
So do Fill is often used and without optimization it is (if 6 is the
highest value)...
4096/6 is 682 possible variatons.
Now there is 425 variations in the IMWA (if I count right).
It seems you are right, Steve, that SSS-ID numbers can not be properly
mapped onto Unicode.
It can be mapped onto unicode by shortened SSS-ID xxx-x-xx which is
Variation-Fill-Rotation which is with highest values 682-6-16
And we can have table with 682 variations where variation
001 has value 01-01-001-01 (which is category, group, symbol and
variation in SSS-ID)
002 has value 01-01-002-01
.
.
.
207 has value 02-02-011-01
208 has value 02-03-001-01
209 has value 02-03-001-02
.
.
.
424 has value 08-04-001-01
425 has value 08-04-002-01
426 is the first free value and there is 256 more free codes.
682 rows is pretty small table for remapping of 65536 possible symbols
onto unicode. This table has to be made when IMWA is finished, because
of the order sequence.
But if I imagine a sequence of unicode... it is just linear sequence of
IMWA symbols. It must be more complicated format which uses unicode
(because of the encoding and font compatibility) with position of the
symbol and control signs (end of sign, space, color, etc.).
There is no more space to map x, y position of the symbol onto unicode
and I don't think it is not the purpose of the Unicode. Rendering of the
sign is up to render software.
Here is the question if 682 variations is enough (if 425 variations is
used now)?
Well, I hope, my mathematic is useful :)
Tomas
Steve Slevinski wrote:
> Hi Val,
>
> My 2 cents, SVG is the next step. It is required for quality
> publishing of SignWriting documents with the IMWA. However, if done
> right, it will be compatible with our current work so there is no
> hurry. Unless someone else works on it first, I will get to it
> eventually.
>
> For SVG we will need to convert every IMWA symbol from a static
> graphic into a vector graphic. There are applications that may be
> able to do the conversion automatically. However, we would need to
> verify every symbol. We would then need to verify the current IMWA
> based signs. Since the IMWA has around 26 thousand symbols, this
> could take a while.
>
> We do not need Unicode. I believe that Unicode could harm the IMWA if
> done too soon.
>
> If we are concerned about document size, we need binary. Binary will
> change the SSS-IDs from an 18 character string into a binary
> equivalent using 1/6 the amount of data.
>
> Since this is a challenge for programmers, I'll get right down to the
> bits and the SSS-ID numbers and explain why the SSS-ID numbers can not
> be properly mapped onto Unicode.
>
> **** Warning, this is entirely too much information! ****
>
> What is an SSS-ID number?
> The SSS-ID number is a unique character string for every symbol of the
> IMWA. The SSS-ID number has the format of "xx-xx-xxx-xx-xx-xx" where
> x is a number from 0 to 9. The SSS-ID number has 6 parts (Catagory -
> Group - Symbol - Variation - Fill - Rotation). If we look at the
> first symbol of the IMWA this should make more sense.
>
>
> 01-01-001-01-01-01
>
> Catagory 01
> Group 01
> Symbol 001
> Variation 01
> Fill 01
> Rotation 01
>
> What is a bit?
> A bit is 1 or 0. It is the smallest value a computer can work with.
> It is called an on / off switch.
>
> 1 bit can represent 2 values
> -------------------
> 0
> 1
>
> 2 bits can represent 4 values
> --------------------
> 00
> 01
> 10
> 11
>
> 3 bits can represent 8 values
> --------------------
> 000
> 001
> 010
> 011
> 100
> 101
> 110
> 111
>
> Basic ASCII uses 7 bits. 7 bits can represent 128 values (2^7 or
> 2*2*2*2*2*2*2). The letter A is 65, or "0100001" in binary.
>
> Unicode was designed with 16-bits. 16 bits can represent over 65
> thousand values. Originally this was thought to be enough. It was
> not. Unicode was extented to have multiple layers, but each layer
> still only has 16-bits.
>
> The IMWA has around 26 thousand symbols. This should be able to fit
> on one layer of Unicode (layer 3 would be perfect), however the IMWA
> is still growing so this is a problem for encoding. If we squeeze the
> symbols too close, we won't be able to add new symbols. If we don't
> squeeze them close enough, we run out of room.
>
> Let's take a specific example to help clear this up.
>
> Here is the first symbol of the IMWA again.
>
> 01-01-001-01-01-01
>
> If this symbol could be placed in Unicode, it would use 16 bits.
> Since it is first in the alphabet, it would have the value of 1 or
> "0000000000000001" in binary.
>
> If we store this symbol using the SSS-ID, we would use 18 characters
> (01-01-001-01-01-01). Since each character uses 8 bits, we would be
> using 144 bits. This is much bigger than 16 bits, but it is very clear.
>
> So we need a mapping from SSS-ID number to a specific number of bits.
> Since the SSS-ID number is very regular, we can state a maximum number
> of bits possible.
>
> Catagory - Group - Symbol - Variation - Fill - Rotation
>
> Every part of the SSS-ID uses 2 numbers except for the Symbol part
> which uses 3 numbers. 99 is the largest value for 2 numbers which
> would be covered by 7 bits (2^7 = 128). 999 is the largest value for
> 3 numbers which would be covered by 10 bits (2^10 = 1032). So....
> 7 bits - 7 bits - 10 bits - 7 bits - 7 bits - 7 bits = 45 bits.
>
> If we analyze the current IMWA, we can get a max number for each
> position in the SSS-ID number.
>
> Highest values in the current IMWA.
> Catagory - 8
> Group - 10
> Symbol - 50
> Variation - 5
> Fill - 6
> Rotation - 16
>
> So a bit number optimized for the current IMWA would be...
> 3 bits - 4 bits - 6 bits - 3 bits - 3 bits - 4 bits = 23 bits.
>
> So if we used 45 bits, we would never have a problem with new symbols
> being added to the IMWA. And we could save half the space again if we
> optimized the bits for the current IMWA.
>
> Unicode uses 16 bits so we would need an additional optimization to
> squeeze the IMWA number system from 23 bits into 16 bits. However,
> since the IMWA is still growing, we don't know where the new symbols
> will show up. Since Unicode is not allowed to change once it has been
> defined, any optimization could lead to potential problems. For that
> reason, I think the 45 bit option would be prefered.
>
> And that's just for the symbols themselves. We still have the XY
> coordinates and color for each symbol. But that's enough for now.
>
> -Steve
>
>
> Valerie Sutton wrote:
>
>> SignWriting List
>> June 21, 2005
>>
>>> On Jun 21, 2005, at 4:40 PM, Stuart Thiessen wrote:
>>> A clarification on this: I completely agree that SWML is a valuable
>>> step to making SW searchable and easily transported. However, SWML
>>> as such does not handle the display of SW, only the storage. So
>>> computer software that reads SWML will have to use some kind of
>>> display process to make the SW data visual. This display process
>>> could use SVG images, PNG images, or a Unicode font to provide the
>>> displayed images depending on the program. So, we need to separate
>>> the roles of SWML and display. SWML only has to do with storage
>>> and retrieval of data, but not display.
>>
>>
>> I see. Thanks for explaining this to me! So when Steve is using SWML
>> to store data in SignPuddle, he is using PNGs to do the visual
>> display of what the SWML says should be displayed? I wasn't aware of
>> this...I am glad to know this...
>>
>>> Until SW is finally in Unicode, SW is just graphics because that is
>>> the only display mechanism we have for SW. The value of SWML is
>>> that we are now able to search it with a variety of programs. SW-
>>> DOS by comparison probably could have been equally as searchable
>>> but because of its binary format, that made it much more difficult
>>> compared to SWML. But search capability and display capabilities
>>> are two different "animals". The value of Unicode is simply this:
>>> hearing people will probably not fully appreciate SW until it is
>>> available in Unicode and it is able to be composed just like spoken
>>> languages (in a manner of speaking). This is simply because it
>>> takes much less room to store Unicode symbols than it does to store
>>> graphic images. The display happens either way, but I'm talking
>>> here more about "political" respect or the perceived reality of
>>> SW's status as a genuine writing system.
>>
>>
>> OK. What about SVG? I remember years ago, Antonio Carlos came to
>> visit me from Brazil, and was eager to explain both SWML and SVG to
>> me...I remember feeling amazed at the possibilities when he showed
>> me a SignWriting symbol being drawn on the web in front of my eyes
>> in SVG...Now that we see that SWML is really becoming important, I
>> wonder if SVG isn't next?
>>
>> That does not mean that I don't think Unicode is a terrific
>> idea...it is just that Unicode takes money and time, and if PNG
>> display is the only alternative right now, then maybe SVG could be
>> another alternative until Unicode is available for SignWriting?
>>
>> Did you know that the French have interest in developing a way to
>> apply SignWriting to Unicode? I wonder if Mr. Dalle and Mr. Aznar
>> from France wouldn't be interested in working with SIL on the
>> Unicode project? Do you think SIL could be interested?...
>>
>>> Also, the use of Unicode will not make SWML obsolete. In fact, I
>>> think that SWML will be even more useful because instead of having
>>> special code numbers in the markup, we can actually embed the
>>> Unicode character for that SW symbol. This will make SWML files
>>> more compact and more easily read and further enhance its
>>> usefulness. But that is a little more down the road until funding
>>> and resources become available. Once funding is available, we can
>>> certainly begin work on it and then just wait on a final submission
>>> until we feel the IMWA is more stable.
>>
>>
>> I see. Very interesting, Stuart! You know so much! ;-)
>>
>> Thanks for your patience with me and all those symbols in the
>> IMWA!...I actually am not necessarily in favor of placing the whole
>> IMWA into Unicode. I think we should do a Symbol-Frequency test on
>> dictionaries to pin down the symbols that you really are using, and
>> then use the Language-specific symbolset to be the first SignWriting
>> Unicode...in other words...Unicode US, Unicode NO, etc...based on
>> only those SignWriting symbols used in one language...why slow down
>> the Unicode development for SignWriting, just because DanceWriting
>> has not been entered into the IMWA yet? And is there really a
>> Unicode for music sounds? No. So why should DanceWriting be in
>> Unicode?...Unicode should be for SignWriting specific to one sign
>> language...
>>
>> Just a thought. I will leave Unicode development to you and the next
>> generation!
>>
>> Val ;-)
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/sw-l/attachments/20050622/7a99ec2c/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/jpeg
Size: 842 bytes
Desc: not available
URL: <http://listserv.linguistlist.org/pipermail/sw-l/attachments/20050622/7a99ec2c/attachment.jpe>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/jpeg
Size: 842 bytes
Desc: not available
URL: <http://listserv.linguistlist.org/pipermail/sw-l/attachments/20050622/7a99ec2c/attachment-0001.jpe>
More information about the Sw-l
mailing list