[sw-l] challenge for programmers - SSS-ID mapping onto unicode

Wed Jun 22 21:55:31 UTC 2005

Hi Tomas,

Sorry, about my previous email.  You math is correct.  Your conversion
idea would work just fine.  Very elegant.

However, I don't like the idea of closing the IMWA.  Because, once the
IMWA is entered into Unicode it is not allowed to change.

In the future, if someone wanted to use the IMWA to document unusual
handshapes used in Tibet meditation and prayer, I would hate to have to
tell them "Sorry, you're out of luck.  Just last week we closed the IMWA
because we wanted to assign a unique number from 1 to 65,536 for the
existing symbols."

If we keep using the SSS ID numbers to identify the symbols, we'll never
have to close the IMWA.  While that idea may make Val groan, the
flexibility it offers is worth it.

-Steve

Tomáš Klapka wrote:

> Hi all,
>
> if there is 8 categories, 10 groups, 50 symbols, 5 variations, 6 fills
> and 16 rotations, it makes 8*10*50*5*6*16 = 1,900,000 possible
> combinations. But in IMWA there is not so many combinations used.
>
> Let's see on the number of combinations depending on number of bits...
>
> 16 bits has 2^16 = 65,536 combinations
> HERE is SSS-ID without rotation = 120,000 combinations
> 17 bits has 2^17 = 131072 combinations
> 18 bits has 2^18 = 262,144 combinations
> 19 bits has 2^19 = 525,632 combinations
> 20 bits has 2^20 = 1,048,576 combinations
> HERE is complete SSS-ID = 1,900,000 combinations
> 21 bits has 2^21 = 2,097,152 combinations
>
> Well, it is impossible to put all 1,900,000 combinations into 16 bits.
> We need 21 bits.
> But if we don't use rotation, there is only 120,000 combinations and
> it can be in 17 bits and it is very close.
>
> OPTIMIZATION
> There are combinations of SSS-ID which is never used, for example
> SSS-ID 01-01-042-02-03-14. It is one of the 1,900,000's combination
> but there is not more than 13 symbols in category 1 and group 1.
>
> There is a way to make an optimization if you know, how many groups is
> in each category...
>
> category 1 has 10 groups
> category 2 has 10 groups
> category 3 has 10 groups
> category 4 has 5 groups
> category 5 has 5 groups
> category 6 has 2 groups
> category 7 has 4 groups
> category 8 has 4 groups
>
> well, it is 50 groups at all and not 80 (8 categories * 10 groups).
>
> this information is small table with 8 rows so it is easy to implement
> and to use..
>
> well it is 50(groups in 8 categories)*50*5*6*16 = 1,200,000 combinations.
> without rotation it is 50*50*5*6 = 75,000 combinations.
>
> It is still too much.
>
> if you know how many symbols is in each of these 50 groups, you can
> make a (convert) table
> group 1 has 13 symbols
> group 2 has 12 symbols
> group 3 has 21 symbols
> group 4 has 7 symbols
> ...and so on...
>
> It is
> 13+12+21+07+50+21+13+14+33+11+
> 01+11+15+04+15+14+12+15+15+13+
> 01+01+01+01+01+01+01+01+01+01+
> 01+01+02+02+01+
> 01+01+01+04+04+
> 05+04+
> 01+02+01+02+
> 01+02+03+02 =
>   361 symbols at all
>
> 361*5*6*16 = 173,280 combinations if we have a convert table with 50 rows.
> 361*5*6 = 10830 combinations without rotation!
>
> well, lets make a table of symbols with 361 rows (it is still usable)
> in categories 1, 3, 4, 5 and 8, there is no symbol with more than 1
> variety (it is )
> in category 2 there is 01+11+00+00+00+00+15+00+13 with the only one
> variety
>
> in category 2 there is 00+00+10+04+09+05+00+05+00 with 2 varieties
> 00+00+07+04+05+04+00+00 with 3 varieties, 00+00+01+00+01+00+00+00 with
> 4 varieties. = 55 more variations.
> in category 6 there is the only symbol with more than one variety and
> it is 5 varieties which is 4 more variations.
> in category 7 there are 3 symbols with 2 varieties and 2 symbols with
> 3 varieties which is 5 more variations.
>
> it is 64 more variations. so it is 361 + 64 variations at all and it
> is 425!!!. it is only 425? didn't I make a mistake somewhere?
>
> well, 425*6*16 is 40,800 combinations (2550 without rotation)!
>
> 425 rows is still not so large convert table so there could be
> optimized in one more step.
>
> if I list all the SSSs I can find there is only 1884 SSS-ID's with
> rotation 01. (1867 with rotation 02, ...
> rotation   number of SSS-ID's
> 01   1884
> 02   1867
> 03   1781
> 04   1835
> 05   1730
> 06   1724
> 07   1660
> 08   1711
> 09   1481
> 10   1480
> 11   1465
> 12   1474
> 13   1471
> 14   1471
> 15   1462
> 16   1477)
>
> LET'S GO BACKWARDS
> Now I see, it would be better to go backwards on the SSS-ID.
> I have 65536 combinations in 16 bits.
> all 16 rotations are frequently used, so it is not economical to
> optimize it.
> 65536/16 is 4096 symbols with all variations and fills.
> So do Fill is often used and without optimization it is (if 6 is the
> highest value)...
> 4096/6 is 682 possible variatons.
>
> Now there is 425 variations in the IMWA (if I count right).
>
> It seems you are right, Steve, that SSS-ID numbers can not be properly
> mapped onto Unicode.
>
> It can be mapped onto unicode by shortened SSS-ID xxx-x-xx which is
> Variation-Fill-Rotation which is with highest values 682-6-16
> And we can have table with 682 variations where variation
> 001 has value 01-01-001-01 (which is category, group, symbol and
> variation in SSS-ID)
> 002 has value 01-01-002-01
> .
> .
> .
> 207 has value 02-02-011-01
> 208 has value 02-03-001-01
> 209 has value 02-03-001-02
> .
> .
> .
> 424 has value 08-04-001-01
> 425 has value 08-04-002-01
> 426 is the first free value and there is 256 more free codes.
>
> 682 rows is pretty small table for remapping of 65536 possible symbols
> onto unicode. This table has to be made when IMWA is finished, because
> of the order sequence.
> But if I imagine a sequence of unicode... it is just linear sequence
> of IMWA symbols. It must be more complicated format which uses unicode
> (because of the encoding and font compatibility) with position of the
> symbol and control signs (end of sign, space, color, etc.).
>
> There is no more space to map x, y position of the symbol onto unicode
> and I don't think it is not the purpose of the Unicode. Rendering of
> the sign is up to render software.
>
> Here is the question if 682 variations is enough (if 425 variations is
> used now)?
>
> Well, I hope, my mathematic is useful :)
>
> Tomas
>
> Steve Slevinski wrote:
>
>> Hi Val,
>>
>> My 2 cents, SVG is the next step.  It is required for quality
>> publishing of SignWriting documents with the IMWA.  However, if done
>> right, it will be compatible with our current work so there is no
>> hurry.  Unless someone else works on it first, I will get to it
>> eventually.
>>
>> For SVG we will need to convert every IMWA symbol from a static
>> graphic into a vector graphic.  There are applications that may be
>> able to do the conversion automatically.  However, we would need to
>> verify every symbol.  We would then need to verify the current IMWA
>> based signs.  Since the IMWA has around 26 thousand symbols, this
>> could take a while.
>>
>> We do not need Unicode.  I believe that Unicode could harm the IMWA
>> if done too soon.
>>
>> If we are concerned about document size, we need binary.  Binary will
>> change the SSS-IDs from an 18 character string into a binary
>> equivalent using 1/6 the amount of data.
>>
>> Since this is a challenge for programmers, I'll get right down to the
>> bits and the SSS-ID numbers and explain why the SSS-ID numbers can
>> not be properly mapped onto Unicode.
>>
>> **** Warning, this is entirely too much information!  ****
>>
>> What is an SSS-ID number?
>> The SSS-ID number is a unique character string for every symbol of
>> the IMWA.  The SSS-ID number has the format of "xx-xx-xxx-xx-xx-xx"
>> where x is a number from 0 to 9.  The SSS-ID number has 6 parts
>> (Catagory - Group - Symbol - Variation - Fill - Rotation).  If we
>> look at the first symbol of the IMWA this should make more sense.
>>
>>
>> 01-01-001-01-01-01
>>
>> Catagory 01
>> Group  01
>> Symbol 001
>> Variation 01
>> Fill 01
>> Rotation 01
>>
>> What is a bit?
>> A bit is 1 or 0.  It is the smallest value a computer can work with.
>> It is called an on / off switch.
>>
>> 1 bit can represent 2 values
>> -------------------
>> 0
>> 1
>>
>> 2 bits can represent 4 values
>> --------------------
>> 00
>> 01
>> 10
>> 11
>>
>> 3 bits can represent 8 values
>> --------------------
>> 000
>> 001
>> 010
>> 011
>> 100
>> 101
>> 110
>> 111
>>
>> Basic ASCII uses 7 bits.  7 bits can represent 128 values (2^7 or
>> 2*2*2*2*2*2*2).  The letter A is 65, or "0100001" in binary.
>>
>> Unicode was designed with 16-bits.  16 bits can represent over 65
>> thousand values.  Originally this was thought to be enough.  It was
>> not.  Unicode was extented to have multiple layers, but each layer
>> still only has 16-bits.
>>
>> The IMWA has around 26 thousand symbols.  This should be able to fit
>> on one layer of Unicode (layer 3 would be perfect), however the IMWA
>> is still growing so this is a problem for encoding.  If we squeeze
>> the symbols too close, we won't be able to add new symbols.  If we
>> don't squeeze them close enough, we run out of room.
>>
>> Let's take a specific example to help clear this up.
>>
>> Here is the first symbol of the IMWA again.
>>
>> 01-01-001-01-01-01
>>
>> If this symbol could be placed in Unicode, it would use 16 bits.
>> Since it is first in the alphabet, it would have the value of 1 or
>> "0000000000000001" in binary.
>>
>> If we store this symbol using the SSS-ID, we would use 18 characters
>> (01-01-001-01-01-01).  Since each character uses 8 bits, we would be
>> using 144 bits.  This is much bigger than 16 bits, but it is very clear.
>>
>> So we need a mapping from SSS-ID number to a specific number of
>> bits.  Since the SSS-ID number is very regular, we can state a
>> maximum number of bits possible.
>>
>> Catagory - Group - Symbol - Variation - Fill - Rotation
>>
>> Every part of the SSS-ID uses 2 numbers except for the Symbol part
>> which uses 3 numbers.  99 is the largest value for 2 numbers which
>> would be covered by 7 bits (2^7 = 128).  999 is the largest value for
>> 3 numbers which would be covered by 10 bits (2^10 = 1032).  So....
>> 7 bits - 7 bits - 10 bits - 7 bits - 7 bits - 7 bits = 45 bits.
>>
>> If we analyze the current IMWA, we can get a max number for each
>> position in the SSS-ID number.
>>
>> Highest values in the current IMWA.
>> Catagory - 8
>> Group - 10
>> Symbol - 50
>> Variation - 5
>> Fill - 6
>> Rotation - 16
>>
>> So a bit number optimized for the current IMWA would be...
>> 3 bits - 4 bits - 6 bits - 3 bits - 3 bits - 4 bits = 23 bits.
>>
>> So if we used 45 bits, we would never have a problem  with new
>> symbols being added to the IMWA.  And we could save half the space
>> again if we optimized the bits for the current IMWA.
>>
>> Unicode uses 16 bits so we would need an additional optimization to
>> squeeze the IMWA number system from 23 bits  into 16 bits.  However,
>> since the IMWA is still growing, we don't know where the new symbols
>> will show up.  Since Unicode is not allowed to change once it has
>> been defined, any optimization could lead to potential problems.  For
>> that reason, I think the 45 bit option would be prefered.
>>
>> And that's just for the symbols themselves.  We still have the XY
>> coordinates and color for each symbol.  But that's enough for now.
>>
>> -Steve
>>
>>
>> Valerie Sutton wrote:
>>
>>> SignWriting List
>>> June 21, 2005
>>>
>>>> On Jun 21, 2005, at 4:40 PM, Stuart Thiessen wrote:
>>>> A clarification on this: I completely agree that SWML is a
>>>> valuable  step to making SW searchable and easily transported.
>>>> However, SWML  as such does not handle the display of SW, only the
>>>> storage.  So  computer software that reads SWML will have to use
>>>> some kind of  display process to make the SW data visual.  This
>>>> display process  could use SVG images, PNG images, or a Unicode
>>>> font to provide the  displayed images depending on the program.
>>>> So, we need to separate  the roles of SWML and display.  SWML only
>>>> has to do with storage  and retrieval of data, but not display.
>>>
>>>
>>> I see. Thanks for explaining this to me! So when Steve is using
>>> SWML  to store data in SignPuddle, he is using PNGs to do the
>>> visual  display of what the SWML says should be displayed? I wasn't
>>> aware of  this...I am glad to know this...
>>>
>>>> Until SW is finally in Unicode, SW is just graphics because that
>>>> is  the only display mechanism we have for SW.  The value of SWML
>>>> is  that we are now able to search it with a variety of programs.
>>>> SW- DOS by comparison probably could have been equally as
>>>> searchable  but because of its binary format, that made it much
>>>> more difficult  compared to SWML. But search capability and display
>>>> capabilities  are two different "animals".  The value of Unicode is
>>>> simply this:  hearing people will probably not fully appreciate SW
>>>> until it is  available in Unicode and it is able to be composed
>>>> just like spoken  languages (in a manner of speaking). This is
>>>> simply because it  takes much less room to store Unicode symbols
>>>> than it does to store  graphic images.  The display happens either
>>>> way, but I'm talking  here more about "political" respect or the
>>>> perceived reality of  SW's status as a genuine writing system.
>>>
>>>
>>> OK. What about SVG? I remember years ago, Antonio Carlos came to
>>> visit me from Brazil, and was eager to explain both SWML and SVG to
>>> me...I remember feeling amazed at the possibilities when he showed
>>> me  a SignWriting symbol being drawn on the web in front of my eyes
>>> in  SVG...Now that we see that SWML is really becoming important, I
>>> wonder if SVG isn't next?
>>>
>>> That does not mean that I don't think Unicode is a terrific
>>> idea...it  is just that Unicode takes money and time, and if PNG
>>> display is the  only alternative right now, then maybe SVG could be
>>> another  alternative until Unicode is available for SignWriting?
>>>
>>> Did you know that the French have interest in developing a way to
>>> apply SignWriting to Unicode? I wonder if Mr. Dalle and Mr. Aznar
>>> from France wouldn't be interested in working with SIL on the
>>> Unicode  project? Do you think SIL could be interested?...
>>>
>>>> Also, the use of Unicode will not make SWML obsolete.  In fact, I
>>>> think that SWML will be even more useful because instead of having
>>>> special code numbers in the markup, we can actually embed the
>>>> Unicode character for that SW symbol. This will make SWML files
>>>> more compact and more easily read and further enhance its
>>>> usefulness.  But that is a little more down the road until funding
>>>> and resources become available.  Once funding is available, we can
>>>> certainly begin work on it and then just wait on a final
>>>> submission  until we feel the IMWA is more stable.
>>>
>>>
>>> I see. Very interesting, Stuart! You know so much! ;-)
>>>
>>> Thanks for your patience with me and all those symbols in the
>>> IMWA!...I actually am not necessarily in favor of placing the whole
>>> IMWA into Unicode. I think we should do a Symbol-Frequency test on
>>> dictionaries to pin down the symbols that you really are using, and
>>> then use the Language-specific symbolset to be the first
>>> SignWriting  Unicode...in other words...Unicode US, Unicode NO,
>>> etc...based on  only those SignWriting symbols used in one
>>> language...why slow down  the Unicode development for SignWriting,
>>> just  because DanceWriting  has not been entered into the IMWA yet?
>>> And is there really a Unicode  for music sounds? No. So why should
>>> DanceWriting be in  Unicode?...Unicode should be for SignWriting
>>> specific to one sign  language...
>>>
>>> Just a thought. I will leave Unicode development to you and the
>>> next  generation!
>>>
>>> Val ;-)
>>>
>>>
>>