[Corpora-List] News from LDC

Linguistic Data Consortium ldc at ldc.upenn.edu
Wed Dec 22 21:00:27 UTC 2010


/In this newsletter:/

*- Spring 2011 LDC Data Scholarship Program**  -*

/New publications:/

LDC2010T24
*- Indian Language Part-of-Speech Tagset: Hindi **-*

LDC2010T22
*- Manually Annotated Sub-Corpus First Release**  -*

LDC2010T23
*- **NIST 2009 Open Machine Translation (OpenMT) Evaluation**  -*

------------------------------------------------------------------------


*Spring 2011 LDC Data Scholarship Program*

Applications are now being accepted through January 31, 2011 for the 
Spring 2011 LDC Data Scholarship program!  The LDC Data Scholarship 
program provides university students with access to LDC data at 
no-cost.  LDC offered data scholarships for the first time earlier this 
year.  We received many strong applications from students with a range 
of research interests.  Our student winners received no-cost copies of 
LDC data valued at over US$10,000.

This program is open to students pursuing both undergraduate and 
graduate studies in an accredited college or university. LDC Data 
Scholarships are not restricted to any particular field of study; 
however, students must demonstrate a well-developed research agenda and 
a bona fide inability to pay.

The application consists of two parts:

    (1) /*Data Use Proposal*/. Applicants must submit a proposal
    describing their intended use of the data. The proposal must contain
    the applicant's name, university, and field of study. The proposal
    should state which data the student plans to use and contain a
    description of their research project.  Students are advised to
    consult the LDC Corpus Catalog
    <http://www.ldc.upenn.edu/Catalog/index.jsp> for a complete list of
    data distributed by LDC. Due to certain restrictions, a handful of
    LDC corpora are restricted to members of the Consortium.

    (2) /*Letter of Support*/. Applicants must submit one letter of
    support from their thesis adviser or department chair. The letter
    must confirm that the department or university lacks the funding to
    pay the full Non-member Fee for the data and verify the student's
    need for data.

For further information on application materials and program rules, 
please visit the LDC Data Scholarship 
<http://www.ldc.upenn.edu/About/scholarships.html> page.

Students can email their applications to the LDC Data Scholarship 
program <mailto:datascholarships at ldc.upenn.edu>. Decisions will be sent 
by email from the same address.

The deadline for the Spring 2011 program cycle is January 31, 2011.


*New Publications*

(1) Indian Language Part-of-Speech Tagset: Hindi 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010T24> 
is a corpus developed by Microsoft Research (MSR) India to support the 
task of Part-of-Speech Tagging (POS) and other data-driven linguistic 
research on Indian Languages in general. It is created as a part of the 
Indian Language Part-of-Speech Tagset (IL-POST) 
<http://research.microsoft.com/en-us/groups/mls/default.aspx> project, a 
collaborative effort among linguists and computer scientists from MSR 
India, AU-KBC (Anna University, Chennai), Delhi University, IIT Bombay, 
Jawaharlal Nehru University (Delhi) and Tamil University (Tamilnadu).

The goal of the IL-POST project is to provide a common tagset framework 
for Indian Languages that offers flexibility, cross-linguistic 
compatibility and reusability across those languages. It supports a 
three-level hierarchy of Categories, Types and Attributes. The corpus 
mainly consists therefore of two different levels of information for 
each lexical token: (a) lexical Category and Types, and (b) set 
morphological attributes and their associated values in the context.

This corpus contains 4859 sentences (98,450 words) of manually annotated 
Hindi text randomly collected from the Microsoft Hindi Research Corpus, 
sourced from the publisher WebDunia <http://www.webdunia.com/>. All 
annotated data is provided in both xml and text files. The xml files are 
contained in the "XML_files" folder and the text files in the 
"text_files" folder. Each data file contains between 900-5,000 words. 
The XML file contains metadata about the material, such as language, 
encoding and data size.

The Annotation Guidelines for Hindi, included in this release, contain a 
detailed description of the annotation methodology. The Annotation Tool 
Guideline 1.0, also included in this publication, describes the 
annotation interface developed for the IL-POST framework; the tool is 
not included in this corpus.

Non-members may license this data by submitting a completed copy of the 
Microsoft Research India License Agreement 
<http://www.ldc.upenn.edu/Catalog/nonmem_agree/Indian_Language_POS_Tagset_Hindi_License_Agreement.htm>. 
The agreement can be faxed to +1 215 573 2175 or scanned and emailed to 
this address.  This data is available at no charge.

  *

(2) Manually Annotated Sub-Corpus First Release (MASC I) 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010T22> 
is the first of three releases of 500,000 words of MASC data developed 
as part of the American National Corpus 
<http://www.americannationalcorpus.org/> (ANC) project. MASC I consists 
of approximately 80,000 words of contemporary spoken and written 
American English annotated for a variety of linguistic phenomena. The 
MASC <http://www.americannationalcorpus.org/MASC/Home.html> project is 
sponsored by the National Science Foundation and was established to 
address, to the extent possible, many of the obstacles to the creation 
of large-scale, robust, multiply-annotated corpora of English covering a 
wide range of genres of written and spoken language data. Researchers 
from VassarCollege, ColumbiaUniversityand the International Computer 
Science Institute, Universityof Californiaat Berkeleyare the principal 
participants; the WordNet <http://wordnet.princeton.edu/> project 
provides consulting.

The source texts in MASC I are drawn from the open portion of the 
American National Corpus (ANC) Second Release LDC2005T35 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T35>, which 
includes written texts and spoken transcripts of American English from a 
broad range of genres produced since 1990; and from the Language 
Understanding Annotation Corpus LDC2009T09 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T10>, (LU 
Corpus), a collection of various genres including broadcast, newswire, 
email and telephone speech annotated for committed belief, event and 
entity coreference, dialog acts and temporal relations. All of the words 
of data in MASC I have validated annotations for token, part of speech, 
sentence boundary, noun chunks, verb chunks, named entities and Penn 
Treebank <http://www.cis.upenn.edu/%7Etreebank/> syntax. Full-text 
FrameNet <http://framenet.icsi.berkeley.edu/> annotations are available 
for seventeen texts and WordNet word sense annotations are available for 
1000 occurrences of each of fifty-three words. Annotations of all or 
portions of the sub-corpus for a wide variety of other linguistic 
phenomena have been contributed by other projects. Software and services 
available from the ANC project website 
<http://www.anc.org/MASC/Home.html> enable transduction of MASC into a 
wide variety of physical formats.

The MASC directory contains two folders: "masc-1.0.3" and 
"masc_wordsense". masc-1.0.3 contains the actual MASC corpus and 
consists of two folders, "spoken" and "written". The spoken folder 
contains data and annotations for spoken material, and the written 
folder contains the same for written texts. The files in each of the 
respective folders have naming conventions that describe the contents of 
the file.  masc_wordsense contains the MASC sentence samples with word 
sense annotations using WordNet sense numbers as the annotation values.

Non-members may request this data by completing a copy of the LDC User 
Agreement for Non-Members 
<http://www.ldc.upenn.edu/Membership/Agreements/licenses/genericlicense.pdf>.The 
agreement can be faxed +1 215 573 2175 or scanned and emailed to this 
address.This data is available at no charge.


*

(3) NIST 2009 Open Machine Translation (OpenMT) Evaluation 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010T23> 
is a package containing source data, reference translations and scoring 
software used in the NIST 2009 OpenMT evaluation. It is designed to help 
evaluate the effectiveness of machine translation systems. The package 
was compiled and scoring software was developed by researchers at NIST, 
making use of broadcast, newswire and web data and reference 
translations collected and developed by LDC. The 2009 task was to 
evaluate translation from Arabic to English and Urdu to English.

This release contains373 documents with corresponding sets of four 
separate human expert reference translations. The source data is 
comprised of Arabic and Urdu broadcast, newswire and weblog data 
collected by LDC in 2007 and 2009. The newswire and broadcast material 
are from Asharq Al-Awsat (Arabic), Agence France-Presse (Arabic), 
Al-Ahram (Arabic), Al Hayat (Arabic), Assabah (Arabic), An Nahar 
(Arabic), Al-Quds Al-Arabi (Arabic), Xinhua News Agency (Arabic), 
British Broadcasting Corporation (Urdu), Deutsche Welle (Urdu), Mehr 
News Agency (Urdu) and Voice of America (Urdu).

For each language, the test set consists of two files: a source and a 
reference file. Each file contains four independent translations of the 
data set. The evaluation year, source language, test set (which, by 
default, is "evalset"), version of the data, and source vs. reference 
file (with the latter being indicated by "-ref") are reflected in the 
file name. A reference file contains four independent reference 
translations unless noted otherwise in the accompanying README.txt.

This evaluation kit includes scoring software. The data is provided in 
both SGML and XML formats.

Non-members may request this data by completing a copy of the LDC User 
Agreement for Non-Members 
<http://www.ldc.upenn.edu/Membership/Agreements/licenses/genericlicense.pdf>.The 
agreement can be faxed +1 215 573 2175 or scanned and emailed to this 
address.This data is available at for US$150.

------------------------------------------------------------------------

Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium                  Phone: 1 (215) 573-1275
University of Pennsylvania                    Fax: 1 (215) 573-2175
3600 Market St., Suite 810ldc at ldc.upenn.edu
Philadelphia, PA 19104 USAhttp://www.ldc.upenn.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20101222/c95c238a/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list