[Corpora-List] New LDC Publications

Linguistic Data Consortium ldc at ldc.upenn.edu
Mon May 19 20:40:34 UTC 2003


                             LDC2003S03
	  *   Korean Telephone Conversations Speech    *

                             LDC2003T08
	*   Korean Telephone Conversations Transcripts   *

                             LDC2003L02
	  *   Korean Telephone Conversations Lexicon   *

                             LDC2003P01
          *   Korean Telephone Conversations Complete Set   *



The Linguistic Data Consortium (LDC) is pleased to announce the
availability of several new publications.


1.  The Korean Telephone Conversations Speech corpus was originally
recorded as part of the Callfriend project.  The conversations were
collected by the Linguistic Data Consortium primarily in support of the
Language Identification (LID) project, sponsored by the U.S. Department
of Defense.

The Korean Telephone Conversations Speech corpus consists of 100
telephone conversations between native speakers of Korean. Of these, 49
were published by the LDC in 1996 as LDC96S54 CALLFRIEND Korean; 51
conversations are previously unreleased material.  The recorded
conversations last up to 30 minutes.

There are 100 speech files, totaling approximately 44 hours of audio.
All speech files are in sphere format (shorten-compressed), recorded in
2-channel ulaw with a sampling rate of 8 KHz.  This publication consists
of three CD-ROM's.

For further information, including a link to online documentation,
please visit:

http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003S03

Institutions that have membership in the LDC during the 2003
Membership Year will be able to receive this corpus free of charge.
Nonmembers may license this publication for $1000.


2. The Korean Telephone Conversations Transcripts consists of 100
transcribed telephone conversations between native speakers of Korean.
The transcripts correspond to the 100 conversations in Korean Telephone
Conversations Speech.  The recorded conversations last up to 30 minutes,
of which the transcribed speech covers between 15 to 18 minutes.

The Korean Telephone Conversations Transcripts contains 100 text files,
totaling approximately 190K words and 25K unique words. All files are in
Korean orthography, using the KSC-5601 character set.  This publication
is distributed by ftp.

For further information, including a link to a sample transcript,
please visit:

http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003T08

Institutions that have membership in the LDC during the 2003
Membership Year will be able to receive this corpus free of charge.
Nonmembers may license this publication for $1000.


3.  The Korean Telephone Conversations Lexicon consists of 25,251
words, and contains separate fields with phonological, morphological,
and frequency information for each word.  The lexicon covers the tokens
occurring in the 100 telephone conversations transcribed and published
as Korean Telephone Conversations Transcripts. The token coverage is 100%.

The lexicon contains five tab-separated information fields:

        1. orthographic form in Hangul (headword), encoded in the
               	  KSC-5601
           character set.
        2. orthographic form in Yale romanization
        3. pronunciation
        4. frequency of the word in Korean Telephone Conversations
           Transcripts
        5. morphological analysis of the word

This publication is distributed by ftp.

For more information, including a link to a sample page from the
lexicon, please visit:

http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003L02

Institutions that have membership in the LDC during the 2003
Membership Year will be able to receive this corpus free of charge.
Nonmembers may license this publication for $1500.


4.  The Korean Telephone Conversations Complete Set consists of the
following:

LDC2003S03  Korean Telephone Conversations Speech
LDC2003T08  Korean Telephone Conversations Transcripts
LDC2003L02  Korean Telephone Conversations Lexicon

All three of the above publications may be licensed together as a
package for the nonmember fee of $3000, a savings of $500 off the
sum of the individual corpora licensing fees.

                                    *


If you need additional information before placing your order, or
would like to inquire about membership in the LDC, please send email to
<ldc at ldc.upenn.edu> or call (215) 573-1275.


--------------------------------------------------------------------
Linguistic Data Consortium          Phone: (215) 573-1275
3600 Market Street                  Fax:   (215) 573-2175
Suite 810                           email: ldc at ldc.upenn.edu
Philadelphia, PA 19104-2653         www: http://www.ldc.upenn.edu



More information about the Corpora mailing list