32.801, FYI: First News Text Corpus of Indian English (NTCIE)

The LINGUIST List linguist at listserv.linguistlist.org
Thu Mar 4 04:46:57 UTC 2021


LINGUIST List: Vol-32-801. Wed Mar 03 2021. ISSN: 1069 - 4875.

Subject: 32.801, FYI: First News Text Corpus of Indian English (NTCIE)

Moderator: Malgorzata E. Cavar (linguist at linguistlist.org)
Student Moderator: Jeremy Coburn
Managing Editor: Becca Morris
Team: Helen Aristar-Dry, Everett Green, Sarah Robinson, Lauren Perkins, Nils Hjortnaes, Yiwen Zhang, Joshua Sims
Jobs: jobs at linguistlist.org | Conferences: callconf at linguistlist.org | Pubs: pubs at linguistlist.org

Homepage: http://linguistlist.org

Please support the LL editors and operation with a donation at:
           https://funddrive.linguistlist.org/donate/

Editor for this issue: Everett Green <everett at linguistlist.org>
================================================================


Date: Wed, 03 Mar 2021 23:46:28
From: Niladri Sekhar Dash [ns_dash at yahoo.com]
Subject: First News Text Corpus of Indian English (NTCIE)

 
The Linguistic Research Unit (LRU) of Indian Statistical Institute (ISI),
Kolkata has developed a 'News Text Corpus of Indian English (NTCIE)' from the
online version of a widely circulated English newspaper published from
Kolkata, India. To date, this is the first corpus of its kind on Indian
Newspaper English. The corpus contains around 10 million (1 crore) words of
running texts obtained from news reports published between August and December
2015. The LRU team has processed the corpus and generated a lexical database
of 99,37,817 words, a syntax database of 4,82,532 sentences, and a list of
3,07,599 tokens after tokenization. Moreover, the corpus is POS tagged using
Stanford POS Tagger (v3.6.0-2015-12-09). The corpus has high applicational
value in machine learning, technology development for Indian English, digital
lexicography, education, translation, language planning, discourse analysis,
and many other works. Both raw and POS annotated versions of the corpus are
available for commercial and academic purposes (with a price tag). This is the
product of an entirely self-funded project (July 2016 to December 2020).

Interested people may contact Prof. Niladri Sekhar Dash, Head, LRU, ISI,
Kolkata. 
Email: niladri at isical.ac.in 

Thank you.
 



Linguistic Field(s): Text/Corpus Linguistics

Subject Language(s): English (eng)

Language Family(ies): Indo-European





 



------------------------------------------------------------------------------

***************************    LINGUIST List Support    ***************************
 The 2020 Fund Drive is under way! Please visit https://funddrive.linguistlist.org
  to find out how to donate and check how your university, country or discipline
     ranks in the fund drive challenges. Or go directly to the donation site:
                   https://crowdfunding.iu.edu/the-linguist-list

                        Let's make this a short fund drive!
                Please feel free to share the link to our campaign:
                    https://funddrive.linguistlist.org/donate/
 


----------------------------------------------------------
LINGUIST List: Vol-32-801	
----------------------------------------------------------






More information about the LINGUIST mailing list