portable storage

William J Poser wjposer at LDC.UPENN.EDU
Thu Dec 6 08:47:15 UTC 2007

>As far as I'm aware, you cannot operate raw pcm files on a computer;
>they have to be embedded inside some file format with a necessary header,
>be it wav, au or aiff.

This isn't strictly true - nothing in principle prevents the use of
raw pcm files and there is a fair amount of software that can manipuldate
them. It just isn't advisable in most circumstances because in the absence
of a header the file itself contains no information about the sampling
rate and other parameters. Software that can read raw pcm files has
to be told the sampling rate etc. by means of command line flags or
something in the GUI. So, I wouldn't advise using raw files, but
such things do exist. In fact, I've got some old data files in this
format, from a time when there weren't any standards and I couldn't
assume that I would again have the particular header system with which
they were created.

>Wav is proprietary right? Owned by IBM? Is there an open-source equivalent?
Wav is a Microsoft format. It is actually a special case of the RIFF
format (also Microsoft), which is a derivative of the Interchange Format
Files format created by Electronic Arts. Wav is proprietary in the sense
that it was promulgated by Microsoft and is not, as far as I know, an ISO
standard. However, it is fully published and Microsoft probably does not
own anything other than the trademark, meaning that you couldn't define
a different format and give it the same name, but that you can implement
the format without a license from Microsoft. Much software that reads and
writes wav files is open-source, e.g. libsndfile.

I'm not aware of any strictly open-source audio file format, that
is, one created by a non-commercial entity and standardized, but
wav, snd/au, and aiff are close enough in practice that I don't think
it matters. Ogg/Vorbis is an open-source audio file/data format for
compressed audio, intended as a replacement for MP3, which really
is proprietary. 

File formats suitable for linguistic data are so simple that new one could
easily be created if there was a need for it. Really all one needs are
a header containing:

(a) file format identifier
(b) sampling rate
(c) resolution
(d) audio format (if anything other than linear PCM is permitted)
(e) endianness, if not fixed by the spec
(f) channels

Not strictly necessary but appreciated by programmers on some OSs

(g) size of following data chunk

One might also include a field of arbitrary length for metadata such
as speaker name, language, record of processing, etc.

Choose which fields to include, decide what order to put them in
and how many bytes are needed for each, and you've defined a new
file format that can easily be implemented. The reason that some of
the commercial formats, like wav, are so complicated, is that they
include all sorts of stuff that are of interest only, as far as I can
see, for the entertainment industry. We have no need for playlists
and looping instructions or embedded synthesizer commands.


More information about the Resource-network-linguistic-diversity mailing list