37.830, Software: NE-OCR: Open-Source Unified Multilingual OCR for 10 Northeast Indian Languages

The LINGUIST List linguist at listserv.linguistlist.org
Mon Mar 2 16:05:02 UTC 2026


LINGUIST List: Vol-37-830. Mon Mar 02 2026. ISSN: 1069 - 4875.

Subject: 37.830, Software: NE-OCR: Open-Source Unified Multilingual OCR for 10 Northeast Indian Languages

Moderator: Steven Moran (linguist at linguistlist.org)
Managing Editor: Valeriia Vyshnevetska
Team: Helen Aristar-Dry, Mara Baccaro, Daniel Swanson
Jobs: jobs at linguistlist.org | Conferences: callconf at linguistlist.org | Pubs: pubs at linguistlist.org

Homepage: http://linguistlist.org

Editor for this issue: Daniel Swanson <daniel at linguistlist.org>

================================================================


Date: 01-Mar-2026
From: Badal Nyalang [nyalang at mwirelabs.com]
Subject: NE-OCR: Open-Source Unified Multilingual OCR for 10 Northeast Indian Languages


We are pleased to announce the release of NE-OCR, a new unified
open-source OCR recognition model developed by MWire Labs (Shillong,
Meghalaya) specifically for the languages of Northeast India.
NE-OCR is built on the DocTR ViTSTR-Base architecture (86M parameters)
and provides high-accuracy text recognition for 12 languages
(including Hindi and English) across 4 scripts:
Latin script: Khasi, Kokborok, Mizo, Garo, Nagamese, Nyishi and
English
Bengali script: Assamese, Meitei (Bengali variant), Hindi
Devanagari script: Bodo, Hindi
Meitei Mayek script: Meitei
The model is designed for word- and line-level cropped images and
supports a vocabulary of 1,056 characters covering all the above
scripts. It is released under the CC-BY-4.0 license and is freely
available on Hugging Face:
https://huggingface.co/MWirelabs/ne-ocr
A full public benchmark test set (26,000 samples) and per-language
subsets are also provided in the repository for easy evaluation and
comparison.
NE-OCR can be used standalone or easily integrated into the popular
docTR pipeline for complete document processing (detection +
recognition). It fills a long-standing gap in open tools for
low-resource Northeast Indian languages and scripts that are often
poorly supported by general-purpose OCR systems.
The model and accompanying resources are intended to support
linguistic research, language documentation, digital archiving,
government digitization efforts, and cultural preservation work across
the Northeast.
For full details, usage examples, and the benchmark dataset, please
visit the model card on Hugging Face.
We welcome feedback, contributions, and collaborations from the
linguistics and NLP community.
MWire Labs
Shillong, Meghalaya, India

Linguistic Field(s): Applied Linguistics
                     Computational Linguistics
                     Language Acquisition
                     Typology

Subject Language(s): Garo (grt)
                     Khasi (kha)
                     Kok Borok (trp)
                     Manipuri (mni)
                     Nyishi (njz)

Language Family(ies): Austro-Asiatic
                      Indo-Aryan
                      Tibeto-Burman



------------------------------------------------------------------------------

********************** LINGUIST List Support ***********************
Please consider donating to the Linguist List, a U.S. 501(c)(3) not for profit organization:

https://www.paypal.com/donate/?hosted_button_id=87C2AXTVC4PP8

LINGUIST List is supported by the following publishers:

Bloomsbury Publishing http://www.bloomsbury.com/uk/

Cambridge University Press http://www.cambridge.org/linguistics

Cascadilla Press http://www.cascadilla.com/

De Gruyter Brill https://www.degruyterbrill.com/?changeLang=en

Edinburgh University Press http://www.edinburghuniversitypress.com

John Benjamins http://www.benjamins.com/

Language Science Press http://langsci-press.org

Lincom GmbH https://lincom-shop.eu/

MIT Press http://mitpress.mit.edu/

Multilingual Matters http://www.multilingual-matters.com/

Narr Francke Attempto Verlag GmbH + Co. KG http://www.narr.de/

Netherlands Graduate School of Linguistics / Landelijke (LOT) http://www.lotpublications.nl/

Peter Lang AG http://www.peterlang.com

SIL International Publications http://www.sil.org/resources/publications


----------------------------------------------------------
LINGUIST List: Vol-37-830
----------------------------------------------------------



More information about the LINGUIST mailing list