[Corpora-List] New Data from the LDC
Linguistic Data Consortium
ldc at ldc.upenn.edu
Thu Jan 6 19:29:53 UTC 2005
The Linguistic Data Consortium (LDC) is pleased to announce the
availability of three (3) new databases.
------------------------------------------------------------------------
(1) The Buckwalter Arabic Morphological Analyzer Version 2.0
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004L02>
consists primarily of three Arabic-English lexicon files: prefixes (299
entries), suffixes (618 entries), and stems (82158 entries representing
38600 lemmas). The lexicons are supplemented by three morphological
compatibility tables used for controlling prefix-stem combinations (1648
entries), stem-suffix combinations (1285 entries), and prefix-suffix
combinations (598 entries). The documentation consists of a readme file
with a description of the lexicon files, the morphological compatibility
tables, the morphology analysis algorithm, a summary of stem
morphological categories, and a table with the author's Arabic
transliteration system.
Institutions that have membership in the LDC for the Membership Year
(MY) 2004 will be able to receive this corpus free of charge. Please
note that this corpus is designated 'Members Only' and is, therefore,
not available for nonmember licensing. You can find information on
becoming an LDC member at our Members FAQ
<http://www.ldc.upenn.edu/Membership/FAQ_Members.shtml>.
*
(2) Fisher English Training Speech Part 1 Speech
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004S13>
represents the first half of a collection of conversational telephone
speech (CTS) that was created at the LDC during 2003. It contains 5850
audio files, each one containing a full conversation of up to 10
minutes. The individual audio files are presented in NIST SPHERE
format, and contain two-channel mu-law sample data; "shorten"
compression has been applied to all files. Fisher English Training
Speech Part 1 Speech is distributed on seven DVD-ROM.
Institutions that have membership in the LDC for the Membership Year
(MY) 2004 will be able to receive this corpus free of charge. Nonmembers
may license this corpus for US$7000.
*
(3) Fisher English Training Speech Part 1 Transcripts
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T19>
represents the first half of a collection of conversational telephone
speech (CTS) that was created at the LDC. It contains transcript data
for 5850 complete conversations, each lasting up to 10 minutes. In
addition to the transcriptions, there is a complete set of tables
describing the speakers, the properties of the telephone calls, and the
set of topics that were used to initiate the conversations. Fisher
English Training Speech Part I Transcripts is distributed on one CD-ROM.
Institutions that have membership in the LDC for the Membership Year
(MY) 2004 will be able to receive this corpus free of charge. Nonmembers
may license this corpus for US$1000.
------------------------------------------------------------------------
For further information on LDC data, please visit our online catalog
<http://www.ldc.upenn.edu/Catalog/>. Should you have any questions
concerning the licensing of data or if you are interested in membership
to the LDC, please call +1 215 573 1275 or email ldc at ldc.upenn.edu.
--------------------------------------------------------------------
Linguistic Data Consortium Phone: (215) 573-1275
3600 Market Street Fax: (215) 573-2175
Suite 810 ldc at ldc.upenn.edu
Philadelphia, PA 19104 www.ldc.upenn.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20050106/d74c0507/attachment.htm>
More information about the Corpora
mailing list