36.2795, Software: GaroVec: Word Embeddings for A’chik/Garo Language Technology

The LINGUIST List linguist at listserv.linguistlist.org
Wed Sep 17 14:05:02 UTC 2025


LINGUIST List: Vol-36-2795. Wed Sep 17 2025. ISSN: 1069 - 4875.

Subject: 36.2795, Software: GaroVec: Word Embeddings for A’chik/Garo Language Technology

Moderator: Steven Moran (linguist at linguistlist.org)
Managing Editor: Valeriia Vyshnevetska
Team: Helen Aristar-Dry, Mara Baccaro, Daniel Swanson
Jobs: jobs at linguistlist.org | Conferences: callconf at linguistlist.org | Pubs: pubs at linguistlist.org

Homepage: http://linguistlist.org

Editor for this issue: Daniel Swanson <daniel at linguistlist.org>

================================================================


Date: 16-Sep-2025
From: B Nyalang [nyalang at mwirelabs.com]
Subject: GaroVec: Word Embeddings for A’chik/Garo Language Technology


GaroVec is a set of static word embeddings trained on curated
monolingual corpora in Garo (A’chik), a language spoken across
Meghalaya and parts of Northeast India. Developed by MWire Labs, this
resource is part of a growing effort to support inclusive, regionally
grounded NLP for underrepresented languages.
Linguistic Context
Garo belongs to the Tibeto-Burman family and is widely spoken in
districts like West Garo Hills, East Garo Hills, and South Garo Hills.
Despite its vitality, Garo remains digitally underserved—especially in
foundational NLP infrastructure. GaroVec aims to support a range of
downstream tasks while respecting the linguistic diversity and
cultural depth of A’chik communities.
Model Overview
-       Type: Static word embeddings (FastText-style)
-       Dimensions: 300
-       Training Data: Cleaned and deduplicated Garo monolingual
corpora
-       CC BY 4.0 — permissive for research, civic tech, and
educational use with attribution
-       Hosted on: Hugging Face with full documentation
https://huggingface.co/MWirelabs/GaroVec
Use Cases
GaroVec is designed to be modular and adaptable. It can support:
-       Semantic search and clustering
-       Text classification and topic modeling
-       Dialectal variation analysis
-       Educational tools and civic applications
-       Cross-lingual transfer for low-resource modeling
Inclusive Design
This model is part of a broader movement to make language technology
more inclusive—especially for communities whose languages are often
overlooked in mainstream NLP. GaroVec is released with permissive
licensing and timestamped provenance to encourage reuse, adaptation,
and collaboration.

Linguistic Field(s): Computational Linguistics
                     General Linguistics
                     Language Documentation
                     Semantics
                     Text/Corpus Linguistics

Subject Language(s): Garo (grt)

Language Family(ies): Tibeto-Burman



------------------------------------------------------------------------------

********************** LINGUIST List Support ***********************
Please consider donating to the Linguist List, a U.S. 501(c)(3) not for profit organization:

https://www.paypal.com/donate/?hosted_button_id=87C2AXTVC4PP8

LINGUIST List is supported by the following publishers:

Bloomsbury Publishing http://www.bloomsbury.com/uk/

Cambridge University Press http://www.cambridge.org/linguistics

Cascadilla Press http://www.cascadilla.com/

De Gruyter Brill https://www.degruyterbrill.com/?changeLang=en

Edinburgh University Press http://www.edinburghuniversitypress.com

John Benjamins http://www.benjamins.com/

Language Science Press http://langsci-press.org

MIT Press http://mitpress.mit.edu/

Multilingual Matters http://www.multilingual-matters.com/

Narr Francke Attempto Verlag GmbH + Co. KG http://www.narr.de/

Netherlands Graduate School of Linguistics / Landelijke (LOT) http://www.lotpublications.nl/

Peter Lang AG http://www.peterlang.com


----------------------------------------------------------
LINGUIST List: Vol-36-2795
----------------------------------------------------------



More information about the LINGUIST mailing list