[Corpora-List] New from LDC

Thu Apr 23 21:31:02 UTC 2009

LDC2009L01
*-  An English Dictionary of the Tamil Verb, Second Edition 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009L01>  -*

LDC2009T08
*-  Japanese Web N-gram Version 1 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T08>  -*

The Linguistic Data Consortium (LDC) is pleased to announce the 
availability of two new publications.
**

------------------------------------------------------------------------

*New Publications*

(1) An English Dictionary of the Tamil Verb, Second Edition 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009L01> 
represents over twenty-five years of work led by Harold F. Schiffman, 
Professor, emeritus, of Dravidian Linguistics and Culture at the 
University of Pennsylvania's Department of South Asia Studies. It 
contains translations for 6597 English verbs and defines 9716 Tamil 
verbs. This release presents the dictionary in two formats: Adobe PDF 
and XML. The PDF format displays the dictionary in a human readable 
form. The XML version is a purely electronic form and is intended mainly 
for application development and the creation of searchable electronic 
databases.

In the electronic XML version each entry contains the following: the 
English entry or head word; the Tamil equivalent (in Tamil script and 
transliteration); the verb class and transitivity specification; the 
spoken Tamil pronunciation (audio files in mp3 format); the English 
definition(s); additional Tamil entries (if applicable); example 
sentences or phrases in Literary Tamil, Spoken Tamil (with a 
corresponding audio file in .mp3 format) and an English translation; and 
Tamil synonyms or near-synonyms, where appropriate. It is expected that 
the dictionary will be useful for Tamil learners, scholars and others 
interested in the Tamil language.

What's New in the Second Edition?

    * Errors in the Tamil text and the roman transliteration have been
      corrected.
    * Audio files have been updated and corrected and missing files have
      been added.
    * A brand new search and browse application that can access the
      audio has been included in this edition. This application can be
      accessed from the tools directory.
    * The XML structure has been modified to normalize the presentation
      of synonyms.

An English Dictionary of the Tamil Verb seeks to meet needs not 
currently addressed by existing English-Tamil dictionaries. The main 
goal of this dictionary is to get an English-knowing user to a Tamil 
verb, irrespective of whether he or she begins with an English verb or 
some other item, such as an adjective; this is because what may be a 
verb in Tamil may in fact not be a verb in English, and vice versa.  The 
main goal is to specifically concentrate on supplying the kinds of 
information lacking in all previous attempts to capture the 
equivalencies between English and Tamil. 

*

2) Japanese Web N-gram Version 1 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T08> 
was created by Google Inc. It consists of Japanese "word" n-grams and 
their observed frequency counts generated from over 255 billion tokens 
of text. The length of the n-grams ranges from unigrams to seven-grams.

The n-grams were extracted from publicly accessible web pages that were 
crawled by Google in July 2007. This data set contains only n-grams that 
appear at least 20 times in the processed sentences. Less frequent 
n-grams were simply discarded. Those web pages requiring user 
authentication, pages containing "noarchive" or "noindex" meta tags, and 
pages under other special restrictions were excluded from the final 
release. While the aim was to process only Japanese pages, the corpus 
may contain some pages in other languages due to language detection 
errors. This dataset will be useful for research in areas such as 
statistical machine translation, language modeling and speech 
recognition, among others.

Before the n-grams were collected, the web pages were converted into 
UTF-8 encoding, normalized into Unicode Normalization Form KC, and split 
into sentences. Ill-formed sentences were filtered out, and the 
remaining sentences were segmented into "words".  The vocabulary was 
restricted to "words" that appeared at least 50 times in the processed 
sentences. Less frequent words were replaced with the "<UNK>" special token.

------------------------------------------------------------------------
Ilya Ahtaridis
Membership Coordinator

--------------------------------------------------------------------

Linguistic Data Consortium                     Phone: (215) 573-1275
University of Pennsylvania                       Fax: (215) 573-2175
3600 Market St., Suite 810                         ldc at ldc.upenn.edu
 Philadelphia, PA 19104 USA                   http://www.ldc.upenn.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20090423/d65bc0bb/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora