hyperlinking timecoded audio data in dictionaries

Patrick McConvell Patrick.McConvell at aiatsis.gov.au
Fri Aug 13 04:00:55 UTC 2004


I am very impressed by the performance of the Switchboard corpus hosted
on the LDC site (mentioned by Nick)  in this regard. Even from my 56k
modem at home I can download audiobites of 20secs or more in only a
couple of secs then run them through PRAAT or whatever. You can do a
global search for items (words or phrases) across the very large corpus
of transcripts, get a concordance with large sections of text either
side of the target item, play/download either the item or the section
containing it with very little delay and continue doing such searches
and downloads. Several formats are available including wav.

I don't know what technology lies behind this, if this could be
replicated or how much it would cost, but maybe Steven Bird could
advise.


Pat

>>> Pascale Jacq <pascale.jacq at anu.edu.au> 08/13/04 01:35pm >>>
The use of Data streaming and time-coded audio data for hyperlinking
audio
in interactive dictionary making

Hello RNLD list subscribers. My name is Pascale Jacq, I'm working on a
Hans
Rausing Endangered Language Funded project to document the moribund
Jawoyn
language of Arnhem land, with the Chief Investigator Prof. Francesca
Merlan, at the Australian National University.
I just joined RNLD yesterday after asking Nick Thieberger for advice on
how
to create hyperlinks to our audio data within the Jawoyn interactive
dictionary we're creating in Shoebox (MDF) for use by the language
community in language teaching/maintenance etc. He suggested to send my

query (which I've reworded somewhat) to the RNLD list so others can
benefit
from our experience.

Background:
The analogue and DAT tape audio data we have is now digitised (by
AIATSIS)
onto more than 100 CD's (each CD has a single unsegmented WAV file of
an
hour or 90 minutes length, ie. in "real time"). We hold Master copies
of
each, as do AIATSIS in their archive, plus the original tapes are
archived
there. Any future copies made from the Master CD I believe is called a
'red
book' copy and is in read-only format.

Problem:
Now the problem is, when I wish to make a hyperlink to a sentence
exemplifying a dictionary entry (easily done when the Shoebox lexicon
is
exported to WORD), I can only link to the single WAV file, not the
relevant
time-coded segment where the sentence occurs.

Solution:
The solution Nick suggested is the following (extracted from his email

reply to me dated 12/08/2004):
>"It sounds like you could use a streaming server for which you would
have
>all of your CDs loaded onto a hard disk and be able to access
timecoded
>segments anywhere within that data. You could also convert the wav
files
>to MP3 for this purpose and it would take up a tenth of the disk
space.
>The LDC/Talbank use a streaming server to deliver their data. There is

>also the work being done on Annodex by CSIRO (also known as CMWeb)
>http://www.cmis.csiro.au/maaate/, and http://www.annodex.net/. They
may be
>able to provide a solution and I would be interested to hear about
>anything you come up with with them".

[I'm currently investigating the streaming server solution]

Final Questions:
A further concern which emerged when I thought about downsampling to
MP3
was: Would the time coding change from the original WAV format? The aim
of
archiving linguistic data is to make it consistent, durable, catalogued
and
thus easily accessible (always back to the original source) in the
future
by those to whom the speakers allow data access.
I've already had the experience of a DAT tape 'drop out' of 23 seconds
in
the digitisation process. Luckily the Master copy kept at AIATSIS had
these
23 seconds of material and they could make a 'red book' copy for our
use.
However, I noticed that the time coding of the first Master CD we had
was
now one second out from the 'red book' copy (in addition to the 23
seconds)
and thus I wonder if any copy made from the original Master would not
share
the same time coding?

This is a serious issue to consider if we are to use hyperlinks to
audio
recordings, and I'd appreciate any advice, comments or similar
experiences
you may have.



More information about the Resource-network-linguistic-diversity mailing list