[sw-l] challenge for programmers - SSS-ID mapping onto unicode
Steve Slevinski
slevin at SIGNPUDDLE.NET
Wed Jun 22 20:04:02 UTC 2005
Hi Tomas,
Love the math. We must really be driving people crazy.
There are a few problems with your numbers. Not all base symbols use
all 6 fills and 16 rotations.
You are staring along the same lines of reasoning that I started with
SignMaker. SignMaker is able to use the entire IMWA because of a key
file. This key file is specific for each IMWA version. When Val
releases a new version of the IMWA, I run a script to make a new key
file. I am able to use any IMWA version within minutes.
You can look at this key file yourself to help you understand the IMWA.
It is Javascript.
http://signbank.org/signpuddle/sgn-US/keyIMWA.js
One thing you will notice is that I am not directly mapping fills or
rotations.
keys = new Array(50)
...
keys[0] = new Array(13)
keys[1] = new Array(12)
keys[2] = new Array(21)
keys[3] = new Array(7)
...
keys[0][0] = "01-01-001-01"
keys[0][1] = "01-01-002-01"
keys[0][2] = "01-01-003-01"
keys[0][3] = "01-01-004-01"
However, there is a little bit of magic that Val helped me with. We
decided that for any base symbol, every fill will have the same
rotations. Then I used some binary math and SignMaker works magic.
If you look at base symbol 02-08-001-01 you will notice the following...
Notice that only the first 3 fills (columns) are used. Also notice that
rotations (rows) are skipped. However, every fill that is used has the
same rotations.
Anyway, take a look at the key file if you're interested. Make
statements or ask questions and I'll respond.
Enjoy,
-Steve
Tomáš Klapka wrote:
> Hi all,
>
> if there is 8 categories, 10 groups, 50 symbols, 5 variations, 6 fills
> and 16 rotations, it makes 8*10*50*5*6*16 = 1,900,000 possible
> combinations. But in IMWA there is not so many combinations used.
>
> Let's see on the number of combinations depending on number of bits...
>
> 16 bits has 2^16 = 65,536 combinations
> HERE is SSS-ID without rotation = 120,000 combinations
> 17 bits has 2^17 = 131072 combinations
> 18 bits has 2^18 = 262,144 combinations
> 19 bits has 2^19 = 525,632 combinations
> 20 bits has 2^20 = 1,048,576 combinations
> HERE is complete SSS-ID = 1,900,000 combinations
> 21 bits has 2^21 = 2,097,152 combinations
>
> Well, it is impossible to put all 1,900,000 combinations into 16 bits.
> We need 21 bits.
> But if we don't use rotation, there is only 120,000 combinations and
> it can be in 17 bits and it is very close.
>
> OPTIMIZATION
> There are combinations of SSS-ID which is never used, for example
> SSS-ID 01-01-042-02-03-14. It is one of the 1,900,000's combination
> but there is not more than 13 symbols in category 1 and group 1.
>
> There is a way to make an optimization if you know, how many groups is
> in each category...
>
> category 1 has 10 groups
> category 2 has 10 groups
> category 3 has 10 groups
> category 4 has 5 groups
> category 5 has 5 groups
> category 6 has 2 groups
> category 7 has 4 groups
> category 8 has 4 groups
>
> well, it is 50 groups at all and not 80 (8 categories * 10 groups).
>
> this information is small table with 8 rows so it is easy to implement
> and to use..
>
> well it is 50(groups in 8 categories)*50*5*6*16 = 1,200,000 combinations.
> without rotation it is 50*50*5*6 = 75,000 combinations.
>
> It is still too much.
>
> if you know how many symbols is in each of these 50 groups, you can
> make a (convert) table
> group 1 has 13 symbols
> group 2 has 12 symbols
> group 3 has 21 symbols
> group 4 has 7 symbols
> ...and so on...
>
> It is
> 13+12+21+07+50+21+13+14+33+11+
> 01+11+15+04+15+14+12+15+15+13+
> 01+01+01+01+01+01+01+01+01+01+
> 01+01+02+02+01+
> 01+01+01+04+04+
> 05+04+
> 01+02+01+02+
> 01+02+03+02 =
> 361 symbols at all
>
> 361*5*6*16 = 173,280 combinations if we have a convert table with 50 rows.
> 361*5*6 = 10830 combinations without rotation!
>
> well, lets make a table of symbols with 361 rows (it is still usable)
> in categories 1, 3, 4, 5 and 8, there is no symbol with more than 1
> variety (it is )
> in category 2 there is 01+11+00+00+00+00+15+00+13 with the only one
> variety
>
> in category 2 there is 00+00+10+04+09+05+00+05+00 with 2 varieties
> 00+00+07+04+05+04+00+00 with 3 varieties, 00+00+01+00+01+00+00+00 with
> 4 varieties. = 55 more variations.
> in category 6 there is the only symbol with more than one variety and
> it is 5 varieties which is 4 more variations.
> in category 7 there are 3 symbols with 2 varieties and 2 symbols with
> 3 varieties which is 5 more variations.
>
> it is 64 more variations. so it is 361 + 64 variations at all and it
> is 425!!!. it is only 425? didn't I make a mistake somewhere?
>
> well, 425*6*16 is 40,800 combinations (2550 without rotation)!
>
> 425 rows is still not so large convert table so there could be
> optimized in one more step.
>
> if I list all the SSSs I can find there is only 1884 SSS-ID's with
> rotation 01. (1867 with rotation 02, ...
> rotation number of SSS-ID's
> 01 1884
> 02 1867
> 03 1781
> 04 1835
> 05 1730
> 06 1724
> 07 1660
> 08 1711
> 09 1481
> 10 1480
> 11 1465
> 12 1474
> 13 1471
> 14 1471
> 15 1462
> 16 1477)
>
> LET'S GO BACKWARDS
> Now I see, it would be better to go backwards on the SSS-ID.
> I have 65536 combinations in 16 bits.
> all 16 rotations are frequently used, so it is not economical to
> optimize it.
> 65536/16 is 4096 symbols with all variations and fills.
> So do Fill is often used and without optimization it is (if 6 is the
> highest value)...
> 4096/6 is 682 possible variatons.
>
> Now there is 425 variations in the IMWA (if I count right).
>
> It seems you are right, Steve, that SSS-ID numbers can not be properly
> mapped onto Unicode.
>
> It can be mapped onto unicode by shortened SSS-ID xxx-x-xx which is
> Variation-Fill-Rotation which is with highest values 682-6-16
> And we can have table with 682 variations where variation
> 001 has value 01-01-001-01 (which is category, group, symbol and
> variation in SSS-ID)
> 002 has value 01-01-002-01
> .
> .
> .
> 207 has value 02-02-011-01
> 208 has value 02-03-001-01
> 209 has value 02-03-001-02
> .
> .
> .
> 424 has value 08-04-001-01
> 425 has value 08-04-002-01
> 426 is the first free value and there is 256 more free codes.
>
> 682 rows is pretty small table for remapping of 65536 possible symbols
> onto unicode. This table has to be made when IMWA is finished, because
> of the order sequence.
> But if I imagine a sequence of unicode... it is just linear sequence
> of IMWA symbols. It must be more complicated format which uses unicode
> (because of the encoding and font compatibility) with position of the
> symbol and control signs (end of sign, space, color, etc.).
>
> There is no more space to map x, y position of the symbol onto unicode
> and I don't think it is not the purpose of the Unicode. Rendering of
> the sign is up to render software.
>
> Here is the question if 682 variations is enough (if 425 variations is
> used now)?
>
> Well, I hope, my mathematic is useful :)
>
> Tomas
>
> Steve Slevinski wrote:
>
>> Hi Val,
>>
>> My 2 cents, SVG is the next step. It is required for quality
>> publishing of SignWriting documents with the IMWA. However, if done
>> right, it will be compatible with our current work so there is no
>> hurry. Unless someone else works on it first, I will get to it
>> eventually.
>>
>> For SVG we will need to convert every IMWA symbol from a static
>> graphic into a vector graphic. There are applications that may be
>> able to do the conversion automatically. However, we would need to
>> verify every symbol. We would then need to verify the current IMWA
>> based signs. Since the IMWA has around 26 thousand symbols, this
>> could take a while.
>>
>> We do not need Unicode. I believe that Unicode could harm the IMWA
>> if done too soon.
>>
>> If we are concerned about document size, we need binary. Binary will
>> change the SSS-IDs from an 18 character string into a binary
>> equivalent using 1/6 the amount of data.
>>
>> Since this is a challenge for programmers, I'll get right down to the
>> bits and the SSS-ID numbers and explain why the SSS-ID numbers can
>> not be properly mapped onto Unicode.
>>
>> **** Warning, this is entirely too much information! ****
>>
>> What is an SSS-ID number?
>> The SSS-ID number is a unique character string for every symbol of
>> the IMWA. The SSS-ID number has the format of "xx-xx-xxx-xx-xx-xx"
>> where x is a number from 0 to 9. The SSS-ID number has 6 parts
>> (Catagory - Group - Symbol - Variation - Fill - Rotation). If we
>> look at the first symbol of the IMWA this should make more sense.
>>
>>
>> 01-01-001-01-01-01
>>
>> Catagory 01
>> Group 01
>> Symbol 001
>> Variation 01
>> Fill 01
>> Rotation 01
>>
>> What is a bit?
>> A bit is 1 or 0. It is the smallest value a computer can work with.
>> It is called an on / off switch.
>>
>> 1 bit can represent 2 values
>> -------------------
>> 0
>> 1
>>
>> 2 bits can represent 4 values
>> --------------------
>> 00
>> 01
>> 10
>> 11
>>
>> 3 bits can represent 8 values
>> --------------------
>> 000
>> 001
>> 010
>> 011
>> 100
>> 101
>> 110
>> 111
>>
>> Basic ASCII uses 7 bits. 7 bits can represent 128 values (2^7 or
>> 2*2*2*2*2*2*2). The letter A is 65, or "0100001" in binary.
>>
>> Unicode was designed with 16-bits. 16 bits can represent over 65
>> thousand values. Originally this was thought to be enough. It was
>> not. Unicode was extented to have multiple layers, but each layer
>> still only has 16-bits.
>>
>> The IMWA has around 26 thousand symbols. This should be able to fit
>> on one layer of Unicode (layer 3 would be perfect), however the IMWA
>> is still growing so this is a problem for encoding. If we squeeze
>> the symbols too close, we won't be able to add new symbols. If we
>> don't squeeze them close enough, we run out of room.
>>
>> Let's take a specific example to help clear this up.
>>
>> Here is the first symbol of the IMWA again.
>>
>> 01-01-001-01-01-01
>>
>> If this symbol could be placed in Unicode, it would use 16 bits.
>> Since it is first in the alphabet, it would have the value of 1 or
>> "0000000000000001" in binary.
>>
>> If we store this symbol using the SSS-ID, we would use 18 characters
>> (01-01-001-01-01-01). Since each character uses 8 bits, we would be
>> using 144 bits. This is much bigger than 16 bits, but it is very clear.
>>
>> So we need a mapping from SSS-ID number to a specific number of
>> bits. Since the SSS-ID number is very regular, we can state a
>> maximum number of bits possible.
>>
>> Catagory - Group - Symbol - Variation - Fill - Rotation
>>
>> Every part of the SSS-ID uses 2 numbers except for the Symbol part
>> which uses 3 numbers. 99 is the largest value for 2 numbers which
>> would be covered by 7 bits (2^7 = 128). 999 is the largest value for
>> 3 numbers which would be covered by 10 bits (2^10 = 1032). So....
>> 7 bits - 7 bits - 10 bits - 7 bits - 7 bits - 7 bits = 45 bits.
>>
>> If we analyze the current IMWA, we can get a max number for each
>> position in the SSS-ID number.
>>
>> Highest values in the current IMWA.
>> Catagory - 8
>> Group - 10
>> Symbol - 50
>> Variation - 5
>> Fill - 6
>> Rotation - 16
>>
>> So a bit number optimized for the current IMWA would be...
>> 3 bits - 4 bits - 6 bits - 3 bits - 3 bits - 4 bits = 23 bits.
>>
>> So if we used 45 bits, we would never have a problem with new
>> symbols being added to the IMWA. And we could save half the space
>> again if we optimized the bits for the current IMWA.
>>
>> Unicode uses 16 bits so we would need an additional optimization to
>> squeeze the IMWA number system from 23 bits into 16 bits. However,
>> since the IMWA is still growing, we don't know where the new symbols
>> will show up. Since Unicode is not allowed to change once it has
>> been defined, any optimization could lead to potential problems. For
>> that reason, I think the 45 bit option would be prefered.
>>
>> And that's just for the symbols themselves. We still have the XY
>> coordinates and color for each symbol. But that's enough for now.
>>
>> -Steve
>>
>>
>> Valerie Sutton wrote:
>>
>>> SignWriting List
>>> June 21, 2005
>>>
>>>> On Jun 21, 2005, at 4:40 PM, Stuart Thiessen wrote:
>>>> A clarification on this: I completely agree that SWML is a
>>>> valuable step to making SW searchable and easily transported.
>>>> However, SWML as such does not handle the display of SW, only the
>>>> storage. So computer software that reads SWML will have to use
>>>> some kind of display process to make the SW data visual. This
>>>> display process could use SVG images, PNG images, or a Unicode
>>>> font to provide the displayed images depending on the program.
>>>> So, we need to separate the roles of SWML and display. SWML only
>>>> has to do with storage and retrieval of data, but not display.
>>>
>>>
>>> I see. Thanks for explaining this to me! So when Steve is using
>>> SWML to store data in SignPuddle, he is using PNGs to do the
>>> visual display of what the SWML says should be displayed? I wasn't
>>> aware of this...I am glad to know this...
>>>
>>>> Until SW is finally in Unicode, SW is just graphics because that
>>>> is the only display mechanism we have for SW. The value of SWML
>>>> is that we are now able to search it with a variety of programs.
>>>> SW- DOS by comparison probably could have been equally as
>>>> searchable but because of its binary format, that made it much
>>>> more difficult compared to SWML. But search capability and display
>>>> capabilities are two different "animals". The value of Unicode is
>>>> simply this: hearing people will probably not fully appreciate SW
>>>> until it is available in Unicode and it is able to be composed
>>>> just like spoken languages (in a manner of speaking). This is
>>>> simply because it takes much less room to store Unicode symbols
>>>> than it does to store graphic images. The display happens either
>>>> way, but I'm talking here more about "political" respect or the
>>>> perceived reality of SW's status as a genuine writing system.
>>>
>>>
>>> OK. What about SVG? I remember years ago, Antonio Carlos came to
>>> visit me from Brazil, and was eager to explain both SWML and SVG to
>>> me...I remember feeling amazed at the possibilities when he showed
>>> me a SignWriting symbol being drawn on the web in front of my eyes
>>> in SVG...Now that we see that SWML is really becoming important, I
>>> wonder if SVG isn't next?
>>>
>>> That does not mean that I don't think Unicode is a terrific
>>> idea...it is just that Unicode takes money and time, and if PNG
>>> display is the only alternative right now, then maybe SVG could be
>>> another alternative until Unicode is available for SignWriting?
>>>
>>> Did you know that the French have interest in developing a way to
>>> apply SignWriting to Unicode? I wonder if Mr. Dalle and Mr. Aznar
>>> from France wouldn't be interested in working with SIL on the
>>> Unicode project? Do you think SIL could be interested?...
>>>
>>>> Also, the use of Unicode will not make SWML obsolete. In fact, I
>>>> think that SWML will be even more useful because instead of having
>>>> special code numbers in the markup, we can actually embed the
>>>> Unicode character for that SW symbol. This will make SWML files
>>>> more compact and more easily read and further enhance its
>>>> usefulness. But that is a little more down the road until funding
>>>> and resources become available. Once funding is available, we can
>>>> certainly begin work on it and then just wait on a final
>>>> submission until we feel the IMWA is more stable.
>>>
>>>
>>> I see. Very interesting, Stuart! You know so much! ;-)
>>>
>>> Thanks for your patience with me and all those symbols in the
>>> IMWA!...I actually am not necessarily in favor of placing the whole
>>> IMWA into Unicode. I think we should do a Symbol-Frequency test on
>>> dictionaries to pin down the symbols that you really are using, and
>>> then use the Language-specific symbolset to be the first
>>> SignWriting Unicode...in other words...Unicode US, Unicode NO,
>>> etc...based on only those SignWriting symbols used in one
>>> language...why slow down the Unicode development for SignWriting,
>>> just because DanceWriting has not been entered into the IMWA yet?
>>> And is there really a Unicode for music sounds? No. So why should
>>> DanceWriting be in Unicode?...Unicode should be for SignWriting
>>> specific to one sign language...
>>>
>>> Just a thought. I will leave Unicode development to you and the
>>> next generation!
>>>
>>> Val ;-)
>>>
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/sw-l/attachments/20050622/d76d1559/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: moz-screenshot-10.jpg
Type: image/jpeg
Size: 18423 bytes
Desc: not available
URL: <http://listserv.linguistlist.org/pipermail/sw-l/attachments/20050622/d76d1559/attachment.jpg>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/jpeg
Size: 842 bytes
Desc: not available
URL: <http://listserv.linguistlist.org/pipermail/sw-l/attachments/20050622/d76d1559/attachment.jpe>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/jpeg
Size: 842 bytes
Desc: not available
URL: <http://listserv.linguistlist.org/pipermail/sw-l/attachments/20050622/d76d1559/attachment-0001.jpe>
More information about the Sw-l
mailing list