[Corpora-List] New from the LDC

Wed Apr 30 17:35:47 UTC 2008

LDC2008L01
*-  An English Dictionary of the Tamil Verb 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008L01>  -*

LDC2008T06
*-  GALE Phase 1 Chinese Blog Parallel Text 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T06>  -*

The Linguistic Data Consortium (LDC) would like to announce the 
availability of two new publications.
*
*
------------------------------------------------------------------------
*
*
*New Publications*
*
*

**(1) An English Dictionary of the Tamil Verb 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008L01> 
represents over twenty-five years of work led by Harold F. Schiffman, 
Professor, emeritus, of Dravidian Lingusitics and Culture at the 
University of Pennsylvania's Department of South Asia Studies. It 
contains translations for 6597 English verbs and defines 9716 Tamil 
verbs. This release presents the dictionary in two formats: Adobe PDF 
and XML. The PDF format displays the dictionary in a human readable form 
and is suitable for printing. The XML version is a purely electronic 
form intended mainly for application development and the creation of 
searchable electronic databases.

In the electronic XML version each entry contains the following: the 
English entry or head word; the Tamil equivalent (in Tamil script and 
transliteration); the verb class and transitivity specification; the 
spoken Tamil pronunciation (audio files in mp3 format); the English 
definition(s); additional Tamil entries (if applicable); example 
sentences or phrases in Literary Tamil, Spoken Tamil (with a 
corresponding audio file in .mp3 format) and an English translation; and 
Tamil synonyms or near-synonyms, where appropriate. It is expected that 
the dictionary will be useful for Tamil learners, scholars and others 
interested in the Tamil language.

An English Dictionary of the Tamil Verb seeks to meet needs not 
currently addressed by existing English-Tamil dictionaries. The main 
goal of this dictionary is to get an English-knowing user to a Tamil 
verb, irrespective of whether he or she begins with an English verb or 
some other item, such as an adjective; this is because what may be a 
verb in Tamil may in fact not be a verb in English, and vice versa. 
Since the number of English entries is limited (slightly less than 
10,000) there may not be main entries for certain low-frequency items 
like 'pounce' but this item does appear as a synonym for 'jump, leap', 
and some other verbs, so searching for 'pounce' will get the user to a 
Tamil verb via the synonym field. The main goal is therefore to 
specifically concentrate on supplying the kinds of information lacking 
in all previous attempts to capture the equivalencies between English 
and Tamil.

***

(2) GALE Phase 1 Chinese Blog Parallel Text 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T06> 
was prepared by the LDC and consists of 313K characters (277 files) of 
Chinese blog text and its translation selected from eight sources. This 
release was used as training data in Phase 1 of the DARPA-funded GALE 
program.

The task of preparing this corpus involved four stages of work: data 
scouting, data harvesting, formatting, and data selection.

Data scouting involved manually searching the web for suitable blog 
text. Data scouts were assigned particular topics and genres along with 
a production target in order to focus their web search. Formal 
annotation guidelines and a customized annotation toolkit helped data 
scouts to manage the search process and to track progress.

Data scouts logged their decisions about potential text of interest 
(sites, threads and posts) to a database. A nightly process queried the 
annotation database and harvested all designated URLs. Whenever 
possible, the entire site was downloaded, not just the individual thread 
or post located by the data scout.

Once the text was downloaded, its format was standardized so that the 
data could be more easily integrated into downstream annotation 
processes. Typically a new script was required for each new domain name 
that was identified. After scripts were run, an optional manual process 
corrected any remaining formatting problems.

The selected documents were then reviewed for content suitability using 
a semi-automatic process. A statistical approach was used to rank a 
document's relevance to a set of already-selected documents labeled as 
"good." An annotator then reviewed the list of relevance-ranked 
documents and selected those which were suitable for a particular 
annotation task or for annotation in general.

Manual sentence units/segments (SU) annotation was also performed on a 
subset of files following LDC's Quick Rich Transcription specification. 
Three types of end of sentence SU were identified: statement SU, 
question SU, and incomplete SU.

After files were selected, they were reformatted into a human-readable 
translation format, and the files were then assigned to professional 
translators for careful translation. Translators followed LDC's GALE 
Translation guidelines, which describe the makeup of the translation 
team, the source, data format, the translation data format, best 
practices for translating certain linguistic features (such as names and 
speech disfluencies), and quality control procedures applied to 
completed translations.

------------------------------------------------------------------------

Ilya Ahtaridis
Membership Coordinator

--------------------------------------------------------------------

Linguistic Data Consortium                     Phone: (215) 573-1275
University of Pennsylvania                       Fax: (215) 573-2175
3600 Market St., Suite 810                         ldc at ldc.upenn.edu
 Philadelphia, PA 19104 USA                   http://www.ldc.upenn.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20080430/519d8c29/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora