[Corpora-List] misalignments in Boston University Radio Speech Corpus

Sabine Buchholz sabine.buchholz at crl.toshiba.co.uk
Wed May 26 15:46:20 UTC 2004


Dear list members,

I am supervising a student who works with the .wrd, .brk and .pos files of
the Boston University Radio Speech Corpus. Although in theory all these file
types should contain the same number of words/lines for any given file name,
in practice there are many differences. For example, in one file
"school-based" or "they're" are treated as one word and in another as two.
I guess that everybody who has worked with these files will have noticed
this at some point and I wondered how other people dealt with it. Does
anybody have a script to correct at least the easy cases? Or are there newer
versions of the corpus where this has been corrected?

Thank you very much for any information,
Sabine Buchholz


_____________________________________________________________________
This e-mail has been scanned for viruses by MCI's Internet Managed Scanning Services - powered by MessageLabs. For further information visit http://www.mci.com



More information about the Corpora mailing list