[Corpora-List] New LDC Corpora

Linguistic Data Consortium ldc at ldc.upenn.edu
Tue May 3 15:40:56 UTC 2005


LDC2005S13
*Fisher English Training Part 2 Speech*
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005S13>

LDC2005T19
*Fisher English Training Part 2 Transcripts*
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T19>

LDC2005L01
*Mawukakan Lexicon*
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005L01>

* *
The Linguistic Data Consortium (LDC) is pleased to announce the
availability of three new corpora.

------------------------------------------------------------------------


Fisher English Training Part 2 Speech
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005S13>
represents the second half of a collection of conversational telephone
speech (CTS) that was collected at the LDC.  It contains 5849 audio
files, each one containing a full conversation of up to 10 minutes.
Corresponding transcripts are available as Fisher English Training Text
Data, Part 2.

The individual audio files are presented in NIST SPHERE format, and
contain two-channel mu-law sample data; "shorten" compression has been
applied to all files.

*

Fisher English Training Part 2 Transcripts
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T19>
contains the corresponding transcripts for the Fisher English Training
Part 2 Speech collection. About 12% of the conversations were
transcribed at the LDC, and the rest were done by BBN and WordWave,
using a significantly different approach to the task.  A central goal in
both sets was to
maximize the speed and economy of the transcription process, and this in
turn involved certain aspects of mark-up detail and quality control that
may have been common in previous, smaller corpora.

*

Mawukakan Lexicon
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005L01>
is the first publication of an on going project aiming to build an
Electronic Dictionary of four Mandekan (Eastern Manding languages of the
Mande Group of the Niger-Congo family).  The lack of written tradition
makes such a dictionary project extremely important. Our expectation is
that once this initial goal reached, it will become easier to extend the
dictionary to all the other varieties of Mandekan.

The lexicon is trilingual, that is, the target language is Mawukakan,
while English and French are used as glossing languages.  Both the
Toolbox and the XML versions of this dictionary use the Unicode (UTF-8)
encoding.

------------------------------------------------------------------------

If you need further information, or would like to inquire about
membership to the LDC, please email ldc at ldc.upenn.edu or call +1 215 573
2175.


--------------------------------------------------------------------

Linguistic Data Consortium                     Phone: (215) 573-1275
3600 Market Street                             Fax:   (215) 573-2175
Suite 810                             	    	   ldc at ldc.upenn.edu
Philadelphia, PA 19104                 	    http://www.ldc.upenn.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20050503/0f4518a1/attachment.htm>


More information about the Corpora mailing list