[sw-l] challenge for programmers
Steve Slevinski
slevin at SIGNPUDDLE.NET
Wed Jun 22 07:33:29 UTC 2005
Hi Val,
My 2 cents, SVG is the next step. It is required for quality publishing
of SignWriting documents with the IMWA. However, if done right, it will
be compatible with our current work so there is no hurry. Unless
someone else works on it first, I will get to it eventually.
For SVG we will need to convert every IMWA symbol from a static graphic
into a vector graphic. There are applications that may be able to do
the conversion automatically. However, we would need to verify every
symbol. We would then need to verify the current IMWA based signs.
Since the IMWA has around 26 thousand symbols, this could take a while.
We do not need Unicode. I believe that Unicode could harm the IMWA if
done too soon.
If we are concerned about document size, we need binary. Binary will
change the SSS-IDs from an 18 character string into a binary equivalent
using 1/6 the amount of data.
Since this is a challenge for programmers, I'll get right down to the
bits and the SSS-ID numbers and explain why the SSS-ID numbers can not
be properly mapped onto Unicode.
**** Warning, this is entirely too much information! ****
What is an SSS-ID number?
The SSS-ID number is a unique character string for every symbol of the
IMWA. The SSS-ID number has the format of "xx-xx-xxx-xx-xx-xx" where x
is a number from 0 to 9. The SSS-ID number has 6 parts (Catagory -
Group - Symbol - Variation - Fill - Rotation). If we look at the first
symbol of the IMWA this should make more sense.
01-01-001-01-01-01
Catagory 01
Group 01
Symbol 001
Variation 01
Fill 01
Rotation 01
What is a bit?
A bit is 1 or 0. It is the smallest value a computer can work with. It
is called an on / off switch.
1 bit can represent 2 values
-------------------
0
1
2 bits can represent 4 values
--------------------
00
01
10
11
3 bits can represent 8 values
--------------------
000
001
010
011
100
101
110
111
Basic ASCII uses 7 bits. 7 bits can represent 128 values (2^7 or
2*2*2*2*2*2*2). The letter A is 65, or "0100001" in binary.
Unicode was designed with 16-bits. 16 bits can represent over 65
thousand values. Originally this was thought to be enough. It was
not. Unicode was extented to have multiple layers, but each layer still
only has 16-bits.
The IMWA has around 26 thousand symbols. This should be able to fit on
one layer of Unicode (layer 3 would be perfect), however the IMWA is
still growing so this is a problem for encoding. If we squeeze the
symbols too close, we won't be able to add new symbols. If we don't
squeeze them close enough, we run out of room.
Let's take a specific example to help clear this up.
Here is the first symbol of the IMWA again.
01-01-001-01-01-01
If this symbol could be placed in Unicode, it would use 16 bits. Since
it is first in the alphabet, it would have the value of 1 or
"0000000000000001" in binary.
If we store this symbol using the SSS-ID, we would use 18 characters
(01-01-001-01-01-01). Since each character uses 8 bits, we would be
using 144 bits. This is much bigger than 16 bits, but it is very clear.
So we need a mapping from SSS-ID number to a specific number of bits.
Since the SSS-ID number is very regular, we can state a maximum number
of bits possible.
Catagory - Group - Symbol - Variation - Fill - Rotation
Every part of the SSS-ID uses 2 numbers except for the Symbol part which
uses 3 numbers. 99 is the largest value for 2 numbers which would be
covered by 7 bits (2^7 = 128). 999 is the largest value for 3 numbers
which would be covered by 10 bits (2^10 = 1032). So....
7 bits - 7 bits - 10 bits - 7 bits - 7 bits - 7 bits = 45 bits.
If we analyze the current IMWA, we can get a max number for each
position in the SSS-ID number.
Highest values in the current IMWA.
Catagory - 8
Group - 10
Symbol - 50
Variation - 5
Fill - 6
Rotation - 16
So a bit number optimized for the current IMWA would be...
3 bits - 4 bits - 6 bits - 3 bits - 3 bits - 4 bits = 23 bits.
So if we used 45 bits, we would never have a problem with new symbols
being added to the IMWA. And we could save half the space again if we
optimized the bits for the current IMWA.
Unicode uses 16 bits so we would need an additional optimization to
squeeze the IMWA number system from 23 bits into 16 bits. However,
since the IMWA is still growing, we don't know where the new symbols
will show up. Since Unicode is not allowed to change once it has been
defined, any optimization could lead to potential problems. For that
reason, I think the 45 bit option would be prefered.
And that's just for the symbols themselves. We still have the XY
coordinates and color for each symbol. But that's enough for now.
-Steve
Valerie Sutton wrote:
> SignWriting List
> June 21, 2005
>
>> On Jun 21, 2005, at 4:40 PM, Stuart Thiessen wrote:
>> A clarification on this: I completely agree that SWML is a valuable
>> step to making SW searchable and easily transported. However, SWML
>> as such does not handle the display of SW, only the storage. So
>> computer software that reads SWML will have to use some kind of
>> display process to make the SW data visual. This display process
>> could use SVG images, PNG images, or a Unicode font to provide the
>> displayed images depending on the program. So, we need to separate
>> the roles of SWML and display. SWML only has to do with storage and
>> retrieval of data, but not display.
>
>
> I see. Thanks for explaining this to me! So when Steve is using SWML
> to store data in SignPuddle, he is using PNGs to do the visual
> display of what the SWML says should be displayed? I wasn't aware of
> this...I am glad to know this...
>
>> Until SW is finally in Unicode, SW is just graphics because that is
>> the only display mechanism we have for SW. The value of SWML is
>> that we are now able to search it with a variety of programs. SW- DOS
>> by comparison probably could have been equally as searchable but
>> because of its binary format, that made it much more difficult
>> compared to SWML. But search capability and display capabilities are
>> two different "animals". The value of Unicode is simply this:
>> hearing people will probably not fully appreciate SW until it is
>> available in Unicode and it is able to be composed just like spoken
>> languages (in a manner of speaking). This is simply because it takes
>> much less room to store Unicode symbols than it does to store
>> graphic images. The display happens either way, but I'm talking
>> here more about "political" respect or the perceived reality of SW's
>> status as a genuine writing system.
>
>
> OK. What about SVG? I remember years ago, Antonio Carlos came to
> visit me from Brazil, and was eager to explain both SWML and SVG to
> me...I remember feeling amazed at the possibilities when he showed me
> a SignWriting symbol being drawn on the web in front of my eyes in
> SVG...Now that we see that SWML is really becoming important, I
> wonder if SVG isn't next?
>
> That does not mean that I don't think Unicode is a terrific idea...it
> is just that Unicode takes money and time, and if PNG display is the
> only alternative right now, then maybe SVG could be another
> alternative until Unicode is available for SignWriting?
>
> Did you know that the French have interest in developing a way to
> apply SignWriting to Unicode? I wonder if Mr. Dalle and Mr. Aznar
> from France wouldn't be interested in working with SIL on the Unicode
> project? Do you think SIL could be interested?...
>
>> Also, the use of Unicode will not make SWML obsolete. In fact, I
>> think that SWML will be even more useful because instead of having
>> special code numbers in the markup, we can actually embed the
>> Unicode character for that SW symbol. This will make SWML files more
>> compact and more easily read and further enhance its usefulness.
>> But that is a little more down the road until funding and resources
>> become available. Once funding is available, we can certainly begin
>> work on it and then just wait on a final submission until we feel
>> the IMWA is more stable.
>
>
> I see. Very interesting, Stuart! You know so much! ;-)
>
> Thanks for your patience with me and all those symbols in the
> IMWA!...I actually am not necessarily in favor of placing the whole
> IMWA into Unicode. I think we should do a Symbol-Frequency test on
> dictionaries to pin down the symbols that you really are using, and
> then use the Language-specific symbolset to be the first SignWriting
> Unicode...in other words...Unicode US, Unicode NO, etc...based on
> only those SignWriting symbols used in one language...why slow down
> the Unicode development for SignWriting, just because DanceWriting
> has not been entered into the IMWA yet? And is there really a Unicode
> for music sounds? No. So why should DanceWriting be in
> Unicode?...Unicode should be for SignWriting specific to one sign
> language...
>
> Just a thought. I will leave Unicode development to you and the next
> generation!
>
> Val ;-)
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/sw-l/attachments/20050622/c59ea7cf/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: moz-screenshot-7.jpg
Type: image/jpeg
Size: 842 bytes
Desc: not available
URL: <http://listserv.linguistlist.org/pipermail/sw-l/attachments/20050622/c59ea7cf/attachment.jpg>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: moz-screenshot-7.jpg
Type: image/jpeg
Size: 842 bytes
Desc: not available
URL: <http://listserv.linguistlist.org/pipermail/sw-l/attachments/20050622/c59ea7cf/attachment-0001.jpg>
More information about the Sw-l
mailing list