[Sw-l] Announcing version 2 of the Sutton SignWriting Core package for JavaScript

Valerie Sutton sutton at signwriting.org
Sat Nov 30 23:52:58 UTC 2024


SignWriting List
November 30, 2024

Hello SW List members,

A big THANK YOU, to you Steve, for these new software developments regarding sorting dictionaries with “null”, which I remember years ago was important to sort by SignSpellings properly. I remember Adam needed this for his dictionary work...

And I am just learning about “tokenizers” - all in all thank you - for all you do to give SignWriting developers needed tools -

I am enjoying watching all this unfold - thank you to all of you for your new Github libraries - so many new developments...

Val ;-)


Valerie Sutton
sutton at signwriting.org

---------------

> On Nov 30, 2024, at 2:19 PM, Steve Slevinski <slevin at signpuddle.net> wrote:
> 
> Hi SignWriting List.
> I'm happy to announce that 
> Version 2 of the @sutton-signwriting/core package is now available on GitHub and npm. This update introduces two major features:
>     • SignWriting Null Symbol (S00000 / U+40000) for enhanced sorting and advanced sequence strategies.
>     • Tokenizer Functions for machine learning applications using 1180 SignWriting tokens with numerical encoding and decoding.
> GitHub: https://github.com/sutton-signwriting/core
> npm: https://www.npmjs.com/package/@sutton-signwriting/core
> Breaking Change: The Null Symbol
> Version 2 adds support for the SignWriting null symbol as S00000 for Formal SignWriting in ASCII (FSW) and U+40000 for SignWriting in Unicode (SWU). This is a breaking change because signs using the null symbol are not recognized by current tools and libraries. Although its use is limited, the null symbol introduces a range of possibilities for sorting and linguistic analysis.
> The null symbol was first published in January 2022 and is detailed in Appendix C of the Formal SignWriting draft specification.
> Formal SignWriting draft specification: https://www.ietf.org/archive/id/draft-slevinski-formal-signwriting-09.html#appendix-C
> Formal SignWriting now includes four types of symbols:
>     • Null Symbol: For sorting and custom processing in sequences.
>     • Writing Symbols: For standard sign representation.
>     • Detailed Location Symbols: For enhanced spatial details.
>     • Punctuation Symbols: For text-like structuring.
> A sign in Formal SignWriting is a two-part word:
>     • Sequence (one-dimensional): An optional prefix of writing symbols, detailed location symbols, and the null symbol.
>     • Signbox (two-dimensional): Contains writing symbols only; null and detailed location symbols are not permitted here.
> The null symbol supports sorting strategies like placing one-handed signs before two-handed ones. It also enables advanced strategies by filling sequence positions (e.g., torso, arm, hand) with the null symbol if a location is absent.
> Tokenizer Functions for Machine Learning
> Version 2 also introduces tokenizer functions tailored for machine learning. These use 1180 SignWriting tokens for numerical encoding and decoding, enhancing compatibility with NLP frameworks like Transformer-based models.
> Inspired by Amit's SignWriting Python library, which includes custom FSW tokenization, I recreated and extended its functionality for JavaScript. Bipin has further ported Amit's library to Flutter and Dart, adding visualizations and achieving rendering speeds 3,000 times faster than sutton-signwriting/font-db.
>     • Amit's Python library: https://github.com/sign-language-processing/signwriting
>     • Bipin's Flutter library: https://github.com/bipinkrish/signwriting-flutter
>     • Bipin's Dart library: https://github.com/bipinkrish/signwriting-dart
> Features of the Tokenizer
> The tokenizer starts with DEFAULT_SPECIAL_TOKENS, commonly used in NLP frameworks. These can be customized by modifying index numbers, value strings, or adding new tokens.
> Default tokens:
> javascript
> 
> DEFAULT_SPECIAL_TOKENS = [
> { index: 0, name: 'UNK', value: '[UNK]' },
> { index: 1, name: 'PAD', value: '[PAD]' },
> { index: 2, name: 'CLS', value: '[CLS]' },
> { index: 3, name: 'SEP', value: '[SEP]' }
> ];
> 
> Utility functions:
>     • Tokenize FSW: https://www.sutton-signwriting.io/core/#fswtokenize
>     • Detokenize FSW: https://www.sutton-signwriting.io/core/#fswdetokenize
>     • Chunk Tokens: https://www.sutton-signwriting.io/core/#fswchunktokens
> The tokenizer generator creates an object with properties for encoding, decoding, and vocabulary management:
> https://www.sutton-signwriting.io/core/#fswcreatetokenizer
> Note: The tokenizer currently supports Formal SignWriting in ASCII (FSW). To use it with SignWriting in Unicode (SWU), convert to FSW first.
> Thank you for reading!
> –Steve
> _______________________________________________
> Sw-l mailing list
> Sw-l at listserv.linguistlist.org
> https://listserv.linguistlist.org/cgi-bin/mailman/listinfo/sw-l



More information about the Sw-l mailing list