36.3581, Software: NE-BERT: Northeast India's First Multilingual Model for 9 Languages of the Region
The LINGUIST List
linguist at listserv.linguistlist.org
Fri Nov 21 21:05:02 UTC 2025
LINGUIST List: Vol-36-3581. Fri Nov 21 2025. ISSN: 1069 - 4875.
Subject: 36.3581, Software: NE-BERT: Northeast India's First Multilingual Model for 9 Languages of the Region
Moderator: Steven Moran (linguist at linguistlist.org)
Managing Editor: Valeriia Vyshnevetska
Team: Helen Aristar-Dry, Mara Baccaro, Daniel Swanson
Jobs: jobs at linguistlist.org | Conferences: callconf at linguistlist.org | Pubs: pubs at linguistlist.org
Homepage: http://linguistlist.org
Editor for this issue: Daniel Swanson <daniel at linguistlist.org>
================================================================
Date: 21-Nov-2025
From: Badal Nyalang [nyalang at mwirelabs.com]
Subject: NE-BERT: Northeast India's First Multilingual Model for 9 Languages of the Region
We are pleased to announce the release of NE-BERT, a state-of-the-art
domain-specific foundation model designed to bridge the digital divide
for the languages of Northeast India. Built on the ModernBERT
architecture, NE-BERT provides open-source, high-performance language
representations for 9 underserved languages of the region, spanning
the Tibeto-Burman, Austroasiatic, and Indo-Aryan families.
Linguistic Motivation: Standard multilingual models (like mBERT or
indicBERT) often fail on Northeast Indian languages due to the "curse
of multilinguality," where high-resource languages drown out
low-resource ones. Furthermore, standard sub-word tokenizers often
fragment the highly agglutinative morphology of languages like Mizo
and Garo into meaningless character strings.
NE-BERT addresses this by:
Curated Corpora: Trained on 8.3 million sentences with smart-weighted
upsampling for micro-languages like Pnar and Kokborok.
Morphological Optimization: Utilizes a custom SentencePiece Unigram
tokenizer that achieves 1.6x better token fertility than mBERT,
preserving semantic root integrity.
Architecture: Leveraging ModernBERT (Flash Attention 2, Rotary
Embeddings) to support an 8192-token context window, crucial for
processing long-form cultural and legal texts.
Supported Languages: The model achieves regional State-of-the-Art
(SOTA) perplexity on:
Austroasiatic: Khasi, Pnar
Tibeto-Burman: Meitei (Manipuri), Mizo, Garo, Kokborok, Nyishi
Indo-Aryan: Assamese, Nagamese Creole (distinguished from Naga tribal
languages)
Availability: The model weights, tokenizer, and inference code are
openly available under a CC-BY-4.0 license.
Hugging Face Hub: https://huggingface.co/MWirelabs/ne-bert
Web Demo: https://huggingface.co/spaces/MWirelabs/ne-bert-demo
Documentation: https://mwirelabs.com/ne-bert
Linguistic Field(s): Applied Linguistics
Computational Linguistics
Language Acquisition
Subject Language(s): Assamese (asm)
Garo (grt)
Khasi (kha)
Kok Borok (trp)
Pnar (pbv)
Language Family(ies): Austro-Asiatic
Sino-Tibetan
Tibeto-Burman
------------------------------------------------------------------------------
********************** LINGUIST List Support ***********************
Please consider donating to the Linguist List, a U.S. 501(c)(3) not for profit organization:
https://www.paypal.com/donate/?hosted_button_id=87C2AXTVC4PP8
LINGUIST List is supported by the following publishers:
Bloomsbury Publishing http://www.bloomsbury.com/uk/
Cambridge University Press http://www.cambridge.org/linguistics
Cascadilla Press http://www.cascadilla.com/
De Gruyter Brill https://www.degruyterbrill.com/?changeLang=en
Edinburgh University Press http://www.edinburghuniversitypress.com
John Benjamins http://www.benjamins.com/
Language Science Press http://langsci-press.org
Lincom GmbH https://lincom-shop.eu/
MIT Press http://mitpress.mit.edu/
Multilingual Matters http://www.multilingual-matters.com/
Narr Francke Attempto Verlag GmbH + Co. KG http://www.narr.de/
Netherlands Graduate School of Linguistics / Landelijke (LOT) http://www.lotpublications.nl/
Peter Lang AG http://www.peterlang.com
----------------------------------------------------------
LINGUIST List: Vol-36-3581
----------------------------------------------------------
More information about the LINGUIST
mailing list