[Sw-l] Announcing version 2 of the Sutton SignWriting Core package for JavaScript
Valerie Sutton
sutton at signwriting.org
Sat Nov 30 23:52:58 UTC 2024
SignWriting List
November 30, 2024
Hello SW List members,
A big THANK YOU, to you Steve, for these new software developments regarding sorting dictionaries with “null”, which I remember years ago was important to sort by SignSpellings properly. I remember Adam needed this for his dictionary work...
And I am just learning about “tokenizers” - all in all thank you - for all you do to give SignWriting developers needed tools -
I am enjoying watching all this unfold - thank you to all of you for your new Github libraries - so many new developments...
Val ;-)
Valerie Sutton
sutton at signwriting.org
---------------
> On Nov 30, 2024, at 2:19 PM, Steve Slevinski <slevin at signpuddle.net> wrote:
>
> Hi SignWriting List.
> I'm happy to announce that
> Version 2 of the @sutton-signwriting/core package is now available on GitHub and npm. This update introduces two major features:
> • SignWriting Null Symbol (S00000 / U+40000) for enhanced sorting and advanced sequence strategies.
> • Tokenizer Functions for machine learning applications using 1180 SignWriting tokens with numerical encoding and decoding.
> GitHub: https://github.com/sutton-signwriting/core
> npm: https://www.npmjs.com/package/@sutton-signwriting/core
> Breaking Change: The Null Symbol
> Version 2 adds support for the SignWriting null symbol as S00000 for Formal SignWriting in ASCII (FSW) and U+40000 for SignWriting in Unicode (SWU). This is a breaking change because signs using the null symbol are not recognized by current tools and libraries. Although its use is limited, the null symbol introduces a range of possibilities for sorting and linguistic analysis.
> The null symbol was first published in January 2022 and is detailed in Appendix C of the Formal SignWriting draft specification.
> Formal SignWriting draft specification: https://www.ietf.org/archive/id/draft-slevinski-formal-signwriting-09.html#appendix-C
> Formal SignWriting now includes four types of symbols:
> • Null Symbol: For sorting and custom processing in sequences.
> • Writing Symbols: For standard sign representation.
> • Detailed Location Symbols: For enhanced spatial details.
> • Punctuation Symbols: For text-like structuring.
> A sign in Formal SignWriting is a two-part word:
> • Sequence (one-dimensional): An optional prefix of writing symbols, detailed location symbols, and the null symbol.
> • Signbox (two-dimensional): Contains writing symbols only; null and detailed location symbols are not permitted here.
> The null symbol supports sorting strategies like placing one-handed signs before two-handed ones. It also enables advanced strategies by filling sequence positions (e.g., torso, arm, hand) with the null symbol if a location is absent.
> Tokenizer Functions for Machine Learning
> Version 2 also introduces tokenizer functions tailored for machine learning. These use 1180 SignWriting tokens for numerical encoding and decoding, enhancing compatibility with NLP frameworks like Transformer-based models.
> Inspired by Amit's SignWriting Python library, which includes custom FSW tokenization, I recreated and extended its functionality for JavaScript. Bipin has further ported Amit's library to Flutter and Dart, adding visualizations and achieving rendering speeds 3,000 times faster than sutton-signwriting/font-db.
> • Amit's Python library: https://github.com/sign-language-processing/signwriting
> • Bipin's Flutter library: https://github.com/bipinkrish/signwriting-flutter
> • Bipin's Dart library: https://github.com/bipinkrish/signwriting-dart
> Features of the Tokenizer
> The tokenizer starts with DEFAULT_SPECIAL_TOKENS, commonly used in NLP frameworks. These can be customized by modifying index numbers, value strings, or adding new tokens.
> Default tokens:
> javascript
>
> DEFAULT_SPECIAL_TOKENS = [
> { index: 0, name: 'UNK', value: '[UNK]' },
> { index: 1, name: 'PAD', value: '[PAD]' },
> { index: 2, name: 'CLS', value: '[CLS]' },
> { index: 3, name: 'SEP', value: '[SEP]' }
> ];
>
> Utility functions:
> • Tokenize FSW: https://www.sutton-signwriting.io/core/#fswtokenize
> • Detokenize FSW: https://www.sutton-signwriting.io/core/#fswdetokenize
> • Chunk Tokens: https://www.sutton-signwriting.io/core/#fswchunktokens
> The tokenizer generator creates an object with properties for encoding, decoding, and vocabulary management:
> https://www.sutton-signwriting.io/core/#fswcreatetokenizer
> Note: The tokenizer currently supports Formal SignWriting in ASCII (FSW). To use it with SignWriting in Unicode (SWU), convert to FSW first.
> Thank you for reading!
> –Steve
> _______________________________________________
> Sw-l mailing list
> Sw-l at listserv.linguistlist.org
> https://listserv.linguistlist.org/cgi-bin/mailman/listinfo/sw-l
More information about the Sw-l
mailing list