[Corpora-List] New from LDC
Linguistic Data Consortium
ldc at ldc.upenn.edu
Thu Apr 23 21:31:02 UTC 2009
LDC2009L01
*- An English Dictionary of the Tamil Verb, Second Edition
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009L01> -*
LDC2009T08
*- Japanese Web N-gram Version 1
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T08> -*
The Linguistic Data Consortium (LDC) is pleased to announce the
availability of two new publications.
**
------------------------------------------------------------------------
*New Publications*
(1) An English Dictionary of the Tamil Verb, Second Edition
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009L01>
represents over twenty-five years of work led by Harold F. Schiffman,
Professor, emeritus, of Dravidian Linguistics and Culture at the
University of Pennsylvania's Department of South Asia Studies. It
contains translations for 6597 English verbs and defines 9716 Tamil
verbs. This release presents the dictionary in two formats: Adobe PDF
and XML. The PDF format displays the dictionary in a human readable
form. The XML version is a purely electronic form and is intended mainly
for application development and the creation of searchable electronic
databases.
In the electronic XML version each entry contains the following: the
English entry or head word; the Tamil equivalent (in Tamil script and
transliteration); the verb class and transitivity specification; the
spoken Tamil pronunciation (audio files in mp3 format); the English
definition(s); additional Tamil entries (if applicable); example
sentences or phrases in Literary Tamil, Spoken Tamil (with a
corresponding audio file in .mp3 format) and an English translation; and
Tamil synonyms or near-synonyms, where appropriate. It is expected that
the dictionary will be useful for Tamil learners, scholars and others
interested in the Tamil language.
What's New in the Second Edition?
* Errors in the Tamil text and the roman transliteration have been
corrected.
* Audio files have been updated and corrected and missing files have
been added.
* A brand new search and browse application that can access the
audio has been included in this edition. This application can be
accessed from the tools directory.
* The XML structure has been modified to normalize the presentation
of synonyms.
An English Dictionary of the Tamil Verb seeks to meet needs not
currently addressed by existing English-Tamil dictionaries. The main
goal of this dictionary is to get an English-knowing user to a Tamil
verb, irrespective of whether he or she begins with an English verb or
some other item, such as an adjective; this is because what may be a
verb in Tamil may in fact not be a verb in English, and vice versa. The
main goal is to specifically concentrate on supplying the kinds of
information lacking in all previous attempts to capture the
equivalencies between English and Tamil.
*
2) Japanese Web N-gram Version 1
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T08>
was created by Google Inc. It consists of Japanese "word" n-grams and
their observed frequency counts generated from over 255 billion tokens
of text. The length of the n-grams ranges from unigrams to seven-grams.
The n-grams were extracted from publicly accessible web pages that were
crawled by Google in July 2007. This data set contains only n-grams that
appear at least 20 times in the processed sentences. Less frequent
n-grams were simply discarded. Those web pages requiring user
authentication, pages containing "noarchive" or "noindex" meta tags, and
pages under other special restrictions were excluded from the final
release. While the aim was to process only Japanese pages, the corpus
may contain some pages in other languages due to language detection
errors. This dataset will be useful for research in areas such as
statistical machine translation, language modeling and speech
recognition, among others.
Before the n-grams were collected, the web pages were converted into
UTF-8 encoding, normalized into Unicode Normalization Form KC, and split
into sentences. Ill-formed sentences were filtered out, and the
remaining sentences were segmented into "words". The vocabulary was
restricted to "words" that appeared at least 50 times in the processed
sentences. Less frequent words were replaced with the "<UNK>" special token.
------------------------------------------------------------------------
Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium Phone: (215) 573-1275
University of Pennsylvania Fax: (215) 573-2175
3600 Market St., Suite 810 ldc at ldc.upenn.edu
Philadelphia, PA 19104 USA http://www.ldc.upenn.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20090423/d65bc0bb/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list