36.3531, Software: Kren-M: Meghalaya’s First Foundational AI Model for the Khasi Language

The LINGUIST List linguist at listserv.linguistlist.org
Wed Nov 19 17:05:03 UTC 2025


LINGUIST List: Vol-36-3531. Wed Nov 19 2025. ISSN: 1069 - 4875.

Subject: 36.3531, Software: Kren-M: Meghalaya’s First Foundational AI Model for the Khasi Language

Moderator: Steven Moran (linguist at linguistlist.org)
Managing Editor: Valeriia Vyshnevetska
Team: Helen Aristar-Dry, Mara Baccaro, Daniel Swanson
Jobs: jobs at linguistlist.org | Conferences: callconf at linguistlist.org | Pubs: pubs at linguistlist.org

Homepage: http://linguistlist.org

Editor for this issue: Daniel Swanson <daniel at linguistlist.org>

================================================================


Date: 18-Nov-2025
From: Badal Nyalang [nyalang at mwirelabs.com]
Subject: Kren-M: Meghalaya’s First Foundational AI Model for the Khasi Language


Kren-M™ is Meghalaya’s first foundational AI model and the first
open-source bilingual large language model (LLM) built specifically
for Khasi (ISO 639-3: kha), an Austroasiatic language spoken by ~1.4
million people across Meghalaya, Northeast India, and parts of
Bangladesh.
Khasi is an analytic, verb-initial language with rich derivational
morphology, phonemic aspiration and glottal contrasts, and a long
literary history. Despite official status in Meghalaya, Khasi has
never been represented properly in modern AI systems or large language
models for Northeast India.
Kren-M™ (2.6B parameters, Gemma-2-2B base) is the first generative
Khasi-English foundational model trained end-to-end for real bilingual
performance.
Key Highlights
Custom Northeast tokenizer (Kren-NE) trained on mixed Khasi-Garo data,
extended with 2,135 new tokens
→ 36% token reduction on Khasi, 30% on Garo (A·chik)
→ Garo tokens already active, enabling smooth extension into a full
Garo LLM.
5.43 million-sentence cleaned Khasi corpus.
Stable instruction tuning (33K samples) ensuring natural
code-switching, Khasi fluency, and correct bilingual behaviour without
auto-translation failures.
All Resources
Model & Checkpoints: https://huggingface.co/MWirelabs/Kren-M
Project Page + Whitepaper: https://mwirelabs.com/models/kren-m
We have also released large, cleaned corpora for Assamese, Mizo, and
the first open Garo corpus, supporting future Northeast NLP research.
Roadmap
Kren-M™ is the first step in the Kren-NE family, a planned
multilingual foundation model covering major Northeast Indian
languages; Khasi, Garo, Mizo, Assamese, Manipuri, Nagamese and others,
built on a Gemma-2-9B backbone for early 2026.
Call for Collaboration
We welcome partnerships from linguists, dialectologists, educators,
and documentation teams working on Khasi, Garo, or any Northeast India
language.

Linguistic Field(s): Applied Linguistics
                     Computational Linguistics
                     Language Documentation
                     Translation
                     Writing Systems

Subject Language(s): Assamese (asm)
                     Garo (grt)
                     Khasi (kha)
                     Lushai (lus)
                     Manipuri (mni)

Language Family(ies): Austro-Asiatic
                      Sino-Tibetan
                      Tibeto-Burman



------------------------------------------------------------------------------

********************** LINGUIST List Support ***********************
Please consider donating to the Linguist List, a U.S. 501(c)(3) not for profit organization:

https://www.paypal.com/donate/?hosted_button_id=87C2AXTVC4PP8

LINGUIST List is supported by the following publishers:

Bloomsbury Publishing http://www.bloomsbury.com/uk/

Cambridge University Press http://www.cambridge.org/linguistics

Cascadilla Press http://www.cascadilla.com/

De Gruyter Brill https://www.degruyterbrill.com/?changeLang=en

Edinburgh University Press http://www.edinburghuniversitypress.com

John Benjamins http://www.benjamins.com/

Language Science Press http://langsci-press.org

Lincom GmbH https://lincom-shop.eu/

MIT Press http://mitpress.mit.edu/

Multilingual Matters http://www.multilingual-matters.com/

Narr Francke Attempto Verlag GmbH + Co. KG http://www.narr.de/

Netherlands Graduate School of Linguistics / Landelijke (LOT) http://www.lotpublications.nl/

Peter Lang AG http://www.peterlang.com


----------------------------------------------------------
LINGUIST List: Vol-36-3531
----------------------------------------------------------



More information about the LINGUIST mailing list