[Corpora-List] New from LDC

Linguistic Data Consortium ldc at ldc.upenn.edu
Thu Oct 25 19:29:03 UTC 2012


*-  Fall 2012 LDC Data Scholarship Recipients  -* <#scholar>

*-  Language Resource Wiki  -* <#wiki>

/New publications:/

*-  GALE Chinese-English Word Alignment and Tagging Training Part 2 -- 
Newswire  -* <#gale1>

*-  GALE Phase 2 Arabic Broadcast News Parallel Text  -* <#gale2>

------------------------------------------------------------------------

*Fall 2012 LDC Data Scholarship Recipients*

LDC is pleased to announce the student recipients of the Fall 2012 LDC 
Data Scholarship program!  This program provides university and college 
students with access to LDC data at no-cost. Students were asked to 
complete an application which consisted of a proposal describing their 
intended use of the data, as well as a letter of support from their 
thesis adviser. We received many solid applications and have chosen six 
proposals to support.   The following students received no-cost copies 
of LDC data:

    Jaffar Atwan - National University of Malaysia (Malaysia), Phd 
    candidate, Information Science and Technology.  Jaffar has been
    awarded a copy of Arabic Newswire Part 1 (LDC2001T55) for his work
    in information retrieval.

    Sarath Chandar - Indian Institute of Technology, Madras (India), MS
    candidate, Computer Science and Engineering. Sarath has been awarded
    a copy of Treebank-3 (LDC99T42) for his work in grammar induction.

    Kuruvachan K. George - Amrita Vishwa Vidyapeetham (India), Phd
    Candidate, Electrical and Computer Engineering.  Kuruvachan has been
    awarded a copy of Fisher English Part 2 (LDC2005S13/T19) and2008NIST
    Speaker Recognition Evaluationdata (LDC2011S05/07/08/11) for his
    work in speaker recognition.

    Eduardo Motta - Pontifícia Universidade Católica do Rio de Janeiro
    (Brazil), Phd candidate, Information Sciences.  Eduardo has been
    awarded a copy of English Web Treebank (LDC2012T13) for his work in
    machine learning.

    Genevieve Sapijaszko - University of Central Florida (USA), Phd
    Candidate, Electrical and Computer Engineering.Genevieve has been
    awarded a copy TIMIT Acoustic-Phonetic Continuous Speech Corpus
    (LDC93S1) and YOHO Speaker Verification (LDC94S16) for her work in
    digital signal processing.

    John Steinberg - Temple University (USA), MS candidate, Electrical
    and Computer Engineering.  John has been awarded a copy of CALLHOME
    Mandarin Chinese Lexicon (LDC96L15) and CALLHOME Mandarin Chinese
    Transcripts (LDC96T16) for his work in speech recognition.


*Language Resource Wiki*

The Language Resource Wiki <http://lrwiki.ldc.upenn.edu/> catalogs data, 
software, descriptive grammars and other resources for a variety of 
languages but especially those with a paucity of generally available 
resources for research. LDC is actively seeking editors knowledgeable in 
these and other languages to develop and maintain the pages, which are 
readable by anyone but writable only by editors. The wiki currently has 
resource listings for: Bengali, Berber, Breton, Ewe, Greek (Ancient), 
Indonesian, Hindi, Latin, Panjabi, Pashto, Sorani (Central Kurdish), 
Russian, Tagalog, Tamil, and Urdu, and for the following Sign Languages: 
American, British, Catalan, Dutch, Flemish, German, Japanese, New 
Zealand, Polish, Spanish, and Swiss German.

*New publications*

(1) GALE Chinese-English Word Alignment and Tagging Training Part 2 -- 
Newswire 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2012T18> 
was developed by LDC and contains 169,080 tokens of word aligned Chinese 
and English parallel text enriched with linguistic tags. This material 
was used as training data in the DARPA GALE 
<http://projects.ldc.upenn.edu/gale/index.html> (Global Autonomous 
Language Exploitation) program.

Some approaches to statistical machine translation include the 
incorporation of linguistic knowledge in word aligned text as a means to 
improve automatic word alignment and machine translation quality. This 
is accomplished with two annotation schemes: alignment and tagging. 
Alignment identifies minimum translation units and translation relations 
by using minimum-match and attachment annotation approaches. A set of 
word tags and alignment link tags are designed in the tagging scheme to 
describe these translation units and relations. Tagging adds contextual, 
syntactic and language-specific features to the alignment annotation.

    The Chinese word alignment tasks consisted of the following components:

    Identifying, aligning, and tagging 8 different types of links

    Identifying, attaching, and tagging local-level unmatched words

    Identifying and tagging sentence/discourse-level unmatched words

    Identifying and tagging all instances of Chinese ?(DE) except when
    they were a part of a semantic link.


*

(2) GALE Phase 2 Arabic Broadcast News Parallel Text 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2012T18> 
was developed by LDC, and along with other corpora, the parallel text in 
this release comprised training data for Phase 2 of the DARPA GALE 
(Global Autonomous Language Exploitation) Program. This corpus contains 
Modern Standard Arabic source text and corresponding English 
translations selected from broadcast news (BN) data collected by LDC 
between 2005 and 2007 and transcribed by LDC or under its direction.

GALE Phase 2 Arabic Broadcast News Parallel Text includes seven 
source-translation pairs, comprising 29,210 words of Arabic source text 
and its English translation. Data is drawn from six distinct Arabic 
programs broadcast between 2005 and 2007 from Abu Dhabi TV, based in Abu 
Dhabi, United Arab Emirates; Al Alam News Channel, based in Iran; 
Aljazeera, a regional broadcast programmer based in Doha, Qatar; Dubai 
TV, based in Dubai, United Arab Emirates; and Kuwait TV, a national 
television station based in Kuwait. The BN programming in this release 
focuses on current events topics.

The files in this release were transcribed by LDC staff and/or 
transcription vendors under contract to LDC in accordance with the Quick 
Rich Transcription 
<http://projects.ldc.upenn.edu/gale/Transcription/Arabic-XTransQRTR.V3.pdf> 
guidelines developed by LDC. Transcribers indicated sentence boundaries 
in addition to transcribing the text. Data was manually selected for 
translation according to several criteria, including linguistic 
features, transcription features and topic features. The transcribed and 
segmented files were then reformatted into a human-readable translation 
format and assigned to translation vendors. Translators followed LDC's 
Arabic to English translation guidelines. Bilingual LDC staff performed 
quality control procedures on the completed translations.

------------------------------------------------------------------------


-- 
--

Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium                  Phone: 1 (215) 573-1275
University of Pennsylvania                    Fax: 1 (215) 573-2175
3600 Market St., Suite 810ldc at ldc.upenn.edu
Philadelphia, PA 19104 USAhttp://www.ldc.upenn.edu




-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20121025/ba56cf2f/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list