[Corpora-List] New from LDC

Linguistic Data Consortium ldc at ldc.upenn.edu
Fri Jan 24 21:12:07 UTC 2014


/New publications:/

*CALLFRIEND Farsi Second Edition Speech* <#speech>*
*

*CALLFRIEND Farsi Second Edition Transcripts* <#trans>

------------------------------------------------------------------------
*New Publications*

(1) CALLFRIEND Farsi Second Edition Speech 
<http://catalog.ldc.upenn.edu/LDC2014S01> was developed by LDC and 
consists of approximately 42 hours of telephone conversation (100 
recordings) among native Farsi speakers. The calls were recorded in 1995 
and 1996 as part of the CALLFRIEND collection, a project designed 
primarily to support research in automatic language identification. One 
hundred native Farsi speakers living in the continental United States 
each made a single telephone call, lasting up to 30 minutes, to a family 
member or friend living in the United States.

This release represents all calls from the collection. LDC released 
recordings from 60 calls without transcripts in 1996 as CALLFRIEND Farsi 
(LDC96S50 <http://catalog.ldc.upenn.edu/LDC96S50>) after 20 of those 
calls were used as evaluation data in the first NIST Language 
Recognition Evaluation <http://www.itl.nist.gov/iad/mig/tests/lre/1996/> 
(LRE).

Corresponding transcripts are available in CALLFRIEND Farsi Second 
Edition Transcripts (LDC2014T01 <http://catalog.ldc.upenn.edu/LDC2014T01>).

All recordings involved domestic calls routed through LDC's automated 
telephone collection platform and were stored as 2-channel (4-wire), 
8-KHz mu-law samples taken directly from the public telephone network 
via a T-1 circuit. Each audio file is a FLAC 
<https://xiph.org/flac/>-compressed MS-WAV (RIFF) format audio file 
containing 2-channel, 8-KHz, 16-bit PCM sample data.

This release includes speaker information, including gender, the number 
of speakers on each channel and call duration.

*

(2) CALLFRIEND Farsi Second Edition Transcripts 
<http://catalog.ldc.upenn.edu/LDC2014T01> was developed by LDC and 
consists of transcripts for approximately 42 hours of telephone 
conversation (100 recordings) among native Farsi speakers. The calls 
were recorded in 1995 and 1996 as part of the CALLFRIEND collection, a 
project designed primarily to support research in automatic language 
identification. One hundred native Farsi speakers living in the 
continental United States made a single telephone call, lasting up to 30 
minutes, to a family member or friend living in the United States.

Corresponding speech data is available as CALLFRIEND Farsi Second 
Edition Speech (LDC2014S01 <http://catalog.ldc.upenn.edu/LDC2014S01>).

Transcripts are presented in three formats: romanized transcripts 
(*asc.txt), Arabic-script transcripts (*ntv.txt) and both romanized and 
Arabic forms in a simple XML format (*.xml). For the *.txt files, the 
four main fields on each line (start-offset, end-offset, speaker-label, 
transcript-text) are separated by tabs. Each file begins with a single 
comment line containing the file_id string. This is followed immediately 
by the list of time-stamped segments, in order according to their 
start-offset values, with no blank lines. The XML form of the 
transcripts contains both Arabicized and romanized forms for Farsi words.


------------------------------------------------------------------------

-- 
--

Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium                  Phone: 1 (215) 573-1275
University of Pennsylvania                    Fax: 1 (215) 573-2175
3600 Market St., Suite 810                        ldc at ldc.upenn.edu
Philadelphia, PA 19104 USA                 http://www.ldc.upenn.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20140124/b12e37c3/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list