[Corpora-List] News from LDC
Linguistic Data Consortium
ldc at ldc.upenn.edu
Wed Dec 22 21:00:27 UTC 2010
/In this newsletter:/
*- Spring 2011 LDC Data Scholarship Program** -*
/New publications:/
LDC2010T24
*- Indian Language Part-of-Speech Tagset: Hindi **-*
LDC2010T22
*- Manually Annotated Sub-Corpus First Release** -*
LDC2010T23
*- **NIST 2009 Open Machine Translation (OpenMT) Evaluation** -*
------------------------------------------------------------------------
*Spring 2011 LDC Data Scholarship Program*
Applications are now being accepted through January 31, 2011 for the
Spring 2011 LDC Data Scholarship program! The LDC Data Scholarship
program provides university students with access to LDC data at
no-cost. LDC offered data scholarships for the first time earlier this
year. We received many strong applications from students with a range
of research interests. Our student winners received no-cost copies of
LDC data valued at over US$10,000.
This program is open to students pursuing both undergraduate and
graduate studies in an accredited college or university. LDC Data
Scholarships are not restricted to any particular field of study;
however, students must demonstrate a well-developed research agenda and
a bona fide inability to pay.
The application consists of two parts:
(1) /*Data Use Proposal*/. Applicants must submit a proposal
describing their intended use of the data. The proposal must contain
the applicant's name, university, and field of study. The proposal
should state which data the student plans to use and contain a
description of their research project. Students are advised to
consult the LDC Corpus Catalog
<http://www.ldc.upenn.edu/Catalog/index.jsp> for a complete list of
data distributed by LDC. Due to certain restrictions, a handful of
LDC corpora are restricted to members of the Consortium.
(2) /*Letter of Support*/. Applicants must submit one letter of
support from their thesis adviser or department chair. The letter
must confirm that the department or university lacks the funding to
pay the full Non-member Fee for the data and verify the student's
need for data.
For further information on application materials and program rules,
please visit the LDC Data Scholarship
<http://www.ldc.upenn.edu/About/scholarships.html> page.
Students can email their applications to the LDC Data Scholarship
program <mailto:datascholarships at ldc.upenn.edu>. Decisions will be sent
by email from the same address.
The deadline for the Spring 2011 program cycle is January 31, 2011.
*New Publications*
(1) Indian Language Part-of-Speech Tagset: Hindi
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010T24>
is a corpus developed by Microsoft Research (MSR) India to support the
task of Part-of-Speech Tagging (POS) and other data-driven linguistic
research on Indian Languages in general. It is created as a part of the
Indian Language Part-of-Speech Tagset (IL-POST)
<http://research.microsoft.com/en-us/groups/mls/default.aspx> project, a
collaborative effort among linguists and computer scientists from MSR
India, AU-KBC (Anna University, Chennai), Delhi University, IIT Bombay,
Jawaharlal Nehru University (Delhi) and Tamil University (Tamilnadu).
The goal of the IL-POST project is to provide a common tagset framework
for Indian Languages that offers flexibility, cross-linguistic
compatibility and reusability across those languages. It supports a
three-level hierarchy of Categories, Types and Attributes. The corpus
mainly consists therefore of two different levels of information for
each lexical token: (a) lexical Category and Types, and (b) set
morphological attributes and their associated values in the context.
This corpus contains 4859 sentences (98,450 words) of manually annotated
Hindi text randomly collected from the Microsoft Hindi Research Corpus,
sourced from the publisher WebDunia <http://www.webdunia.com/>. All
annotated data is provided in both xml and text files. The xml files are
contained in the "XML_files" folder and the text files in the
"text_files" folder. Each data file contains between 900-5,000 words.
The XML file contains metadata about the material, such as language,
encoding and data size.
The Annotation Guidelines for Hindi, included in this release, contain a
detailed description of the annotation methodology. The Annotation Tool
Guideline 1.0, also included in this publication, describes the
annotation interface developed for the IL-POST framework; the tool is
not included in this corpus.
Non-members may license this data by submitting a completed copy of the
Microsoft Research India License Agreement
<http://www.ldc.upenn.edu/Catalog/nonmem_agree/Indian_Language_POS_Tagset_Hindi_License_Agreement.htm>.
The agreement can be faxed to +1 215 573 2175 or scanned and emailed to
this address. This data is available at no charge.
*
(2) Manually Annotated Sub-Corpus First Release (MASC I)
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010T22>
is the first of three releases of 500,000 words of MASC data developed
as part of the American National Corpus
<http://www.americannationalcorpus.org/> (ANC) project. MASC I consists
of approximately 80,000 words of contemporary spoken and written
American English annotated for a variety of linguistic phenomena. The
MASC <http://www.americannationalcorpus.org/MASC/Home.html> project is
sponsored by the National Science Foundation and was established to
address, to the extent possible, many of the obstacles to the creation
of large-scale, robust, multiply-annotated corpora of English covering a
wide range of genres of written and spoken language data. Researchers
from VassarCollege, ColumbiaUniversityand the International Computer
Science Institute, Universityof Californiaat Berkeleyare the principal
participants; the WordNet <http://wordnet.princeton.edu/> project
provides consulting.
The source texts in MASC I are drawn from the open portion of the
American National Corpus (ANC) Second Release LDC2005T35
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T35>, which
includes written texts and spoken transcripts of American English from a
broad range of genres produced since 1990; and from the Language
Understanding Annotation Corpus LDC2009T09
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T10>, (LU
Corpus), a collection of various genres including broadcast, newswire,
email and telephone speech annotated for committed belief, event and
entity coreference, dialog acts and temporal relations. All of the words
of data in MASC I have validated annotations for token, part of speech,
sentence boundary, noun chunks, verb chunks, named entities and Penn
Treebank <http://www.cis.upenn.edu/%7Etreebank/> syntax. Full-text
FrameNet <http://framenet.icsi.berkeley.edu/> annotations are available
for seventeen texts and WordNet word sense annotations are available for
1000 occurrences of each of fifty-three words. Annotations of all or
portions of the sub-corpus for a wide variety of other linguistic
phenomena have been contributed by other projects. Software and services
available from the ANC project website
<http://www.anc.org/MASC/Home.html> enable transduction of MASC into a
wide variety of physical formats.
The MASC directory contains two folders: "masc-1.0.3" and
"masc_wordsense". masc-1.0.3 contains the actual MASC corpus and
consists of two folders, "spoken" and "written". The spoken folder
contains data and annotations for spoken material, and the written
folder contains the same for written texts. The files in each of the
respective folders have naming conventions that describe the contents of
the file. masc_wordsense contains the MASC sentence samples with word
sense annotations using WordNet sense numbers as the annotation values.
Non-members may request this data by completing a copy of the LDC User
Agreement for Non-Members
<http://www.ldc.upenn.edu/Membership/Agreements/licenses/genericlicense.pdf>.The
agreement can be faxed +1 215 573 2175 or scanned and emailed to this
address.This data is available at no charge.
*
(3) NIST 2009 Open Machine Translation (OpenMT) Evaluation
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010T23>
is a package containing source data, reference translations and scoring
software used in the NIST 2009 OpenMT evaluation. It is designed to help
evaluate the effectiveness of machine translation systems. The package
was compiled and scoring software was developed by researchers at NIST,
making use of broadcast, newswire and web data and reference
translations collected and developed by LDC. The 2009 task was to
evaluate translation from Arabic to English and Urdu to English.
This release contains373 documents with corresponding sets of four
separate human expert reference translations. The source data is
comprised of Arabic and Urdu broadcast, newswire and weblog data
collected by LDC in 2007 and 2009. The newswire and broadcast material
are from Asharq Al-Awsat (Arabic), Agence France-Presse (Arabic),
Al-Ahram (Arabic), Al Hayat (Arabic), Assabah (Arabic), An Nahar
(Arabic), Al-Quds Al-Arabi (Arabic), Xinhua News Agency (Arabic),
British Broadcasting Corporation (Urdu), Deutsche Welle (Urdu), Mehr
News Agency (Urdu) and Voice of America (Urdu).
For each language, the test set consists of two files: a source and a
reference file. Each file contains four independent translations of the
data set. The evaluation year, source language, test set (which, by
default, is "evalset"), version of the data, and source vs. reference
file (with the latter being indicated by "-ref") are reflected in the
file name. A reference file contains four independent reference
translations unless noted otherwise in the accompanying README.txt.
This evaluation kit includes scoring software. The data is provided in
both SGML and XML formats.
Non-members may request this data by completing a copy of the LDC User
Agreement for Non-Members
<http://www.ldc.upenn.edu/Membership/Agreements/licenses/genericlicense.pdf>.The
agreement can be faxed +1 215 573 2175 or scanned and emailed to this
address.This data is available at for US$150.
------------------------------------------------------------------------
Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium Phone: 1 (215) 573-1275
University of Pennsylvania Fax: 1 (215) 573-2175
3600 Market St., Suite 810ldc at ldc.upenn.edu
Philadelphia, PA 19104 USAhttp://www.ldc.upenn.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20101222/c95c238a/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list