[Corpora-List] News from LDC

Linguistic Data Consortium ldc at ldc.upenn.edu
Fri Feb 24 17:18:31 UTC 2012


*- **Spring 2012 LDC Data Scholarship Recipients! <#scholar>**  -*

/New publications:/

LDC2012S03
*- Digital Archive of Southern Speech (DASS) <#dass>**  -*

LDC2012T01
*- ModeS TimeBank 1.0 <#modes>**  -*

------------------------------------------------------------------------

*
* *Spring 2012 LDC Data Scholarship Recipients!*

LDC is pleased to announce the student recipients of the Spring 2012 LDC 
Data Scholarship program!  This program provides university students 
with access to LDC data at no-cost. Students were asked to complete an 
application which consisted of a proposal describing their intended use 
of the data, as well as a letter of support from their thesis adviser. 
We received many solid applications and have chosen six proposals to 
support.   The following students will receive no-cost copies of LDC data:

    Zainab Ali Khalaf-- University of Science, Malaysia (Malaysia),
    graduate student, Computer Science. Zainab has been awarded a copy
    of /1996 English Broadcast News Transcripts (HUB4)/ (LDC97T22) for
    her work in spoken document retrieval.

    Daniel Jettka -- Trinity College Dublin (Ireland), graduate student,
    Centre for Language & Communication Studies.Daniel has been awarded
    copies of /Penn Discourse Treebank Version 2.0/ (LDC2008T05) and
    /RST Discourse Treebank/ (LDC2002T07) for his work in anaphora
    resolution.

    Olga Nickolaevna Ladoshko - National Technical University of Ukraine
    "KPI" (Ukraine), graduate student, Acoustics and Acoustoelectronics.
    Olga has been awarded copies of /NTIMT/ (LDC93S2) and /STC-TIMIT
    1.0/ (LDC2008S03) for her research in automatic speech recognition
    for Ukrainian.

    Ming Yang, Xiaoxiao Ma, and Jiajia Huang -- Wuhan University
    (China), graduate students, Computer Science.Ming, Xiaoxiao, and
    Jiajia have been awarded copies of /ACE Time Normalization (TERN)
    2004 English Training Data/ /v 1.0/ (LDC2005T07) and /GALE Phase 1
    Chinese Broadcast News Parallel Text -- Part 1/ (LDC2007T23) for
    their work in summarization and data mining.

    Daria Vazhenina -- University of Aizu (Japan), graduate student,
    Human Interface Lab.Daria has been awarded a copy of /2005 Spring
    NIST Rich Transcription (RT-05S) Evaluation Set/ (LDC2011S06) for
    her work in speaker diarization.

    Tanina Zappone - University of Rome "La Sapienza" (Italy), graduate
    student, Oriental Studies.Tanina has been awarded a copy of /Chinese
    Treebank 7.0/ (LDC2010T07) for her work in China's political
    communications.

Please join us in congratulating our student recipients!   The next LDC 
Data Scholarship program is scheduled for the Fall 2012 semester.


*New publications*

(1) Digital Archive of Southern Speech (DASS) 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2012S03> 
was developed by the University of Georgia. It is a subset of the 
Linguistic Atlas of the Gulf States 
<http://www.lap.uga.edu/Site/LAGS.html> (LAGS), which is in turn part of 
the Linguist Atlas Project <http://www.lap.uga.edu/> (LAP). DASS 
contains approximately 370 hours of English speech data from 30 female 
speakers and 34 male speakers in .wav format and in .mp3 format, along 
with associated metadata about the speakers and the recordings and maps 
in .jpeg format relating to the recording locations.

LAP consists of a set of survey research projects about the words and 
pronunciation of everyday American English, the largest project of its 
kind in the United States. Interviews with thousands of native speakers 
across the country have been carried out since 1929. LAGS surveyed the 
everyday speech of Georgia, Tennessee, Florida, Alabama, Mississippi, 
Arkansas, Louisiana, and Texas in a series of 914 audio-taped interviews 
conducted from 1968-1983. Interviews average approximately six hours in 
length; the systematic LAGS tape archive amounts to 5500 hours of sound 
recordings. DASS is a collection of 64 interviews from LAGS selected to 
cover a range of speech across the region and to represent multiple 
education levels and ethnic backgrounds.

Also included in this release is a version of the LICHEN software 
developed at the University of Oulu, Finland. LICHEN allows users to 
browse and search through the audio data in a more advanced fashion 
using a graphical interface.

*

(2) ModeS TimeBank 1.0 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2012T01> 
was developed by researchers at Technical University of Madrid 
<http://www.upm.es/internacional> and Barcelona Media 
<http://www.barcelonamedia.org/en> and is a corpus of Modern Spanish 
(17th and 18th centuries) annotated with temporal and event information 
according to TimeML mark-up and annotated with spatial information 
following the SpatialML scheme.

TimeML (Pustejovsky et al., 2005) is a specification language for 
annotating eventualities and time expressions in natural language as 
well as the temporal relations among them, thus facilitating the task of 
extraction, representation and exchange of temporal information. 
SpatialML (Mani et al., 2008) is a specification language for annotating 
and normalizing spatial expressions by means of geographic coordinates.

ModeS TimeBank 1.0 contains 102 documents reporting a sea-crossing 
cruise by a ship called La Princesa, which took place from December 1768 
to April 1769. There exist copious logbooks from that period that not 
only provide information about shipping routes, but also contain 
valuable data concerning information flows, commercial agents and social 
networks.

All text is encoded in UTF-8. The data in ModeS TimeBank 1.0 has been 
tokenized, POS-tagged, and annotated with space, time and event 
information according to the TimeML and SpatialML specification schemes.

ModeS TimeBank 1.0 is distributed via web download.

Non-members may request this data by completing a copy of the LDC User 
Agreement for Non-Members 
<http://www.ldc.upenn.edu/Membership/Agreements/licenses/genericlicense.pdf>.  
The agreement can be faxed +1 215 573 2175 or scanned and emailed to 
this address.  This data is available at no charge.

------------------------------------------------------------------------

Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium                  Phone: 1 (215) 573-1275
University of Pennsylvania                    Fax: 1 (215) 573-2175
3600 Market St., Suite 810ldc at ldc.upenn.edu
Philadelphia, PA 19104 USAhttp://www.ldc.upenn.edu



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20120224/58eec6ba/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list