[Sw-l] Announcing version 2 of the Sutton SignWriting Core package for JavaScript

Steve Slevinski slevin at signpuddle.net
Sat Nov 30 22:19:55 UTC 2024


Hi SignWriting List.

I'm happy to announce that

Version 2 of the |@sutton-signwriting/core| package is now available on 
GitHub and npm. This update introduces two major features:

 1. *SignWriting Null Symbol (S00000 / U+40000)* for enhanced sorting
    and advanced sequence strategies.
 2. *Tokenizer Functions* for machine learning applications using 1180
    SignWriting tokens with numerical encoding and decoding.

GitHub: https://github.com/sutton-signwriting/core
npm: https://www.npmjs.com/package/@sutton-signwriting/core

------------------------------------------------------------------------


      Breaking Change: The Null Symbol

Version 2 adds support for the SignWriting null symbol as S00000 for 
Formal SignWriting in ASCII (FSW) and U+40000 for SignWriting in Unicode 
(SWU). This is a breaking change because signs using the null symbol are 
not recognized by current tools and libraries. Although its use is 
limited, the null symbol introduces a range of possibilities for sorting 
and linguistic analysis.

The null symbol was first published in January 2022 and is detailed in 
Appendix C of the Formal SignWriting draft specification.

Formal SignWriting draft specification: 
https://www.ietf.org/archive/id/draft-slevinski-formal-signwriting-09.html#appendix-C

Formal SignWriting now includes four types of symbols:

  * *Null Symbol*: For sorting and custom processing in sequences.
  * *Writing Symbols*: For standard sign representation.
  * *Detailed Location Symbols*: For enhanced spatial details.
  * *Punctuation Symbols*: For text-like structuring.

A sign in Formal SignWriting is a two-part word:

  * *Sequence* (one-dimensional): An optional prefix of writing symbols,
    detailed location symbols, and the null symbol.
  * *Signbox* (two-dimensional): Contains writing symbols only; null and
    detailed location symbols are not permitted here.

The null symbol supports sorting strategies like placing one-handed 
signs before two-handed ones. It also enables advanced strategies by 
filling sequence positions (e.g., torso, arm, hand) with the null symbol 
if a location is absent.

------------------------------------------------------------------------


      Tokenizer Functions for Machine Learning

Version 2 also introduces tokenizer functions tailored for machine 
learning. These use 1180 SignWriting tokens for numerical encoding and 
decoding, enhancing compatibility with NLP frameworks like 
Transformer-based models.

Inspired by Amit's SignWriting Python library, which includes custom FSW 
tokenization, I recreated and extended its functionality for JavaScript. 
Bipin has further ported Amit's library to Flutter and Dart, adding 
visualizations and achieving rendering speeds 3,000 times faster than 
|sutton-signwriting/font-db|.

  * Amit's Python library:
    https://github.com/sign-language-processing/signwriting
  * Bipin's Flutter library:
    https://github.com/bipinkrish/signwriting-flutter
  * Bipin's Dart library: https://github.com/bipinkrish/signwriting-dart

------------------------------------------------------------------------


        Features of the Tokenizer

The tokenizer starts with *DEFAULT_SPECIAL_TOKENS*, commonly used in NLP 
frameworks. These can be customized by modifying index numbers, value 
strings, or adding new tokens.

Default tokens:

javascript

|DEFAULT_SPECIAL_TOKENS = [ { index: 0, name: 'UNK', value: '[UNK]' }, { 
index: 1, name: 'PAD', value: '[PAD]' }, { index: 2, name: 'CLS', value: 
'[CLS]' }, { index: 3, name: 'SEP', value: '[SEP]' } ]; |

Utility functions:

  * Tokenize FSW: https://www.sutton-signwriting.io/core/#fswtokenize
  * Detokenize FSW: https://www.sutton-signwriting.io/core/#fswdetokenize
  * Chunk Tokens: https://www.sutton-signwriting.io/core/#fswchunktokens

The tokenizer generator creates an object with properties for encoding, 
decoding, and vocabulary management:
https://www.sutton-signwriting.io/core/#fswcreatetokenizer

*Note*: The tokenizer currently supports Formal SignWriting in ASCII 
(FSW). To use it with SignWriting in Unicode (SWU), convert to FSW first.

Thank you for reading!
–Steve
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/sw-l/attachments/20241130/aa2b3c93/attachment.htm>


More information about the Sw-l mailing list