[Corpora-List] Query about corpora of spoken English

Mon Dec 12 21:14:19 UTC 2005

Rayson, Paul wrote:
> Hi,
> 
> I've been told by Anne Wichmann and Gerry Knowles that the latest
> version of MARSEC is held by Daniel Hirst in Aix-en-Provence:
> 
> http://aune.lpl.univ-aix.fr/~hirst/home.html

This is the Aix-MARSEC project: a more direct link is

http://www.lpl.univ-aix.fr/~EPGA/en_marsec_com.html

The Aix-MARSEC project takes the work of MARSEC a great deal further.

1) The original SEC was not time-aligned in any way with the speech data: it 
consisted of transcripts only, at various lingustic levels.

2) The MARSEC project time-aligned the speech data with a word-level 
transcription and also a transcription at the level of the tone group.

3) The Aix-MARSEC project time-aligns the speech data at several linguistic 
levels, namely:  the phoneme, the syllable, sub-syllabic constituents, the 
rhythmic unit, the stress foot, the word, major and minor intonation units, 
and the MOMEL/INTSINT intonational coding.

The annotations are available online from the Aix-MARSEC project, but the 
recordings are available from them on CD-ROM only.

-----------------------------------------------------------------------
Quoting from: 
http://www.lpl.univ-aix.fr/~EPGA/marsec_com/Auran_Bouzon_Hirst_SP2004.pdf

"For compatibility and processing reasons, the 332-minute long
audio component is available under the form of 408
16 kHz .wav format files.

"The annotation component currently comprises the 9
different levels mentioned earlier: phonemes, syllables,
subsyllabic constituents, words, stress feet, rhythm units, minor
and major intonation units, INTSINT coding and the
corresponding values of the targets in Hz. Each level is
represented by a separate tier in Praat TextGrids (as illustrated
in figure 1). Two supplementary levels, based on the syntactic
annotation of the corpus using the CLAWS system and a
Property Grammar system developed in the Laboratoire Parole
et Langage in Aix-en-Provence are to be integrated soon, thus
allowing not only future analyses taking into account the
grammatical tagging and parsing of the data, but also the direct
comparison of automatic syntactic annotation systems.

"The Aix-MARSEC tools consist of a set of reference files
(grapheme-phoneme conversion dictionaries) and (multiplatform)
Praat and Perl scripts."
------------------------------------------------------------------------

As one of the two prosodic transcribers of the original IBM/Lancaster SEC 
project, I am delighted that this work has evolved into such a rich resource, 
which will be of immense benefit to those studying the phonetics and 
structure of spoken UK English.

Best regards

Briony Williams