Labelling and metadata

Aidan Wilson aidan.wilson at SYDNEY.EDU.AU
Tue May 4 06:46:39 UTC 2010


Hi Gwen,

I'd avoid using periods as much as spaces, moreso in fact. One of the less 
critical reasons spaces are avoided is that on Unix systems, spaces are 
often part of the syntax of commands, so they have to be 'escaped' so that 
they don't affect how the command runs. Periods, I find, tend to confuse 
windows, which seems to pass off everything after a single period as being 
the file extension, so in your example, if you double-clicked on 
KS2009.09.19.01.wav, windows might tell you that it down't know how to 
handle filetype 09.19.01.wav (as if the whole thing were a file 
extension). I get exactly this happening on Kirrkirr, when I try to 
install a new dictionary and have to point Kirrkirr to the 
dictionary.properties file.

You may also want to hold delimiters for separating pieces of information. 
The entire date in your example is, I find, a single piece of information. 
Peter, I think, earlier in this discussion said that dates of the format 
20090919 can be difficult to read - I have no problem with it myself - so 
perhaps try something like:
KS_2009-09-19_01.wav
where dashes and underscores are used differently.

Generally relating to this whole thread, filenames are a constant 
difficulty. And as if it weren't bad enough for individuals, we get the 
problem of having different filename conventions from one archive to 
another, which are invariably different again from the depositor's 
convention. The best bet is to have something transparent and fairly 
semantic, but also computationally sound (allow for the possibility that 
someone may want to perform regular pattern matches on your names - be 
consistent) and never, ever treat your filenames as a substitue for 
collecting metadata!!

-- 
Aidan Wilson

The University of Sydney
+612 9036 9558
+61428 458 969
aidan.wilson at sydney.edu.au

On Mon, 3 May 2010, Gwendolyn Hyslop wrote:

> Dear RNLDers,
>
>
>
> Thanks so much for this discussion; I have appreciated following it. I?d
> like to add to it by asking one small question. There has been discussion
> about not using capital letters or spaces in files names, but I am wondering
> about the use of periods (.) as a way to separate information in file names,
> as opposed to an underscore (_) or hyphen (-). In other words,  is there any
> known problem with doing something like: KS2009.09.19.01.wav, as opposed to
> KS2009_09_19_01.wav?
>
>
>
> Best,
>
> Gwen
>
>
>
> =====
>
> Gwendolyn Hyslop
>
> Department of Linguistics
>
> 1290 University of Oregon
>
> Eugene, OR 97403, USA
>
> +1-541-505-1594 (USA)
>
> +975-1776-2177 (Bhutan)
>
> http://www.uoregon.edu/~glow
>
>
>
>
>
>
>
> From: Peter Austin [mailto:pa2 at soas.ac.uk]
> Sent: Monday, May 03, 2010 10:28 PM
> To: Alex Francois
> Cc: Resource-Network-Linguistic-Diversity; munanga at bigpond.com
> Subject: Re: Labelling and metadata
>
>
>
> Alex
>
>
>
> Here's a possible solution.
>
>
>
> In your Excel metadata sheet add a column called 'file' and for each row go
> to the Insert Menu > Hyperlink and you can insert a hyperlink to the
> relevant file (with its short unique ID) -- in Windows Control-K is the
> shortcut for this. Save your Excel file. When you click on this cell now the
> file will be opened by the software that you have associated with it (eg.
> 2010-05-04.prj will open with Toolbox etc -- you might get a warning from
> Excel about the dangers of opening the file but that's just Microsoft being
> helpful).
>
>
>
> So, Excel has a function that does what you said you do manually:  "try and
> identify the string of digits which I'm looking for, write it down, then try
> and access the recording among hundreds of files, essentially in a
> non-automatic way". This is the solution David Nathan was alluding to in his
> post. (You can also do this in HTML if you prefer that over Excel.)
>
>
>
> If you set up your Excel columns with the semantic fields that are useful to
> you (ie. the ones you listed out for us) then simply sort on whatever column
> you like. find what you're after and click the hyperlink to open the file.
> You can easily add new columns, like "Have you transcribed this file yet?"
> or "Date of last checking of this file" etc. and then use them to sort and
> access you data files.
>
>
>
> As David suggested in his post, this is an information management system
> solution to the problem. The SIL promise-ware that was mentioned in an
> earlier post is a packaged application solution.
>
>
>
> I think this kind of "dirty laundry confessions" is really useful for us to
> share experiences and solutions that work for each of us, so thanks Alex.
> It's the kind of "bottom up" development of good practice ideas that I find
> valuable from a forum like this one.
>
>
>
> Best,
>
> Peter
>
>
>
> On 4 May 2010 14:43, Alex Francois <Alexandre.Francois at vjf.cnrs.fr> wrote:
>
> dear Greg, dear all,
>
> Useful thread indeed.
> I am especially curious about the contrast suggested in the earlier
> discussion, between trying to include semantics in filenames, vs using
> opaque filenames and then search a database.
>
> The reason is, during the last decade, I have experienced the two ends of
> the spectrum, and I'm not sure where I should stand now.
>
> For many years, I had taken the habit of naming my audio files with
> maximally informative (and therefore rather long) names, such as:
>
> *	BD04-24 Veraa Harold ch Jesus mtp-vrs.wav
> *	DD04-13 Lovoko Mamuli leg Laperus2 tnm.wav
> *	ED10-30 Yaqane Edwad-Bilis conv chamanisme hiw.wav
>
> [NB:  At that time I would use spaces in filenames (I'm not doing this
> anymore), but this can be easily changed to underscore with some file
> utility.  Sometimes I even used non-Ascii characters, I confess! ]
>
> These file names would begin with a unique alphanumerical ID, so that the
> chronological order of recordings would be easily retrieved by automatic
> sorting.  The other reason for starting a long file name with an id, was
> that, should some software truncate the filename to the first 8 characters,
> it would still remain unique.
> Here is how my (customised) system worked:
>
> *	first letter is a code for a whole collection = a single fieldtrip
> [A for my first fieldtrip, B for my second.... F for my 6th];
> *	second letter is a code for the support (D for digital audio
> recording, P for photo, V for video?)
> *	then 2 digits for a subcollection (in the olden days this was the
> number of a minidisc);  This subcollection ID is also the name of the folder
> in the folder-tree.
> *	then hyphen plus 2 digits for item in this subcollection (never more
> than 99)
>
> and then the Homo Sapiens-friendly stuff came in:
>
> *	location of recording, spelled out ? usually a village in Vanuatu:
> e.g. Veraa (=Vera'a, a village in Vanua Lava), Yaqane (a hamlet in Hiw);  or
> in the Solomons (Lovoko, Vanikoro);
>
> *	name of main speaker, spelled out
> ("Harold"; "Mamuli"; "Edwad-Bilis" as this was a conversation between two
> men);
> names also uttered in full in the recording itself.
>
> *	genre of recording, using a limited set of abbreviations:
> ch= chant (song), ct='conte' (tale), leg='legend', conv='conversation', etc.
>
> *	a very short title:
> "Jesus" (a church song on someone with a name like this);
> "Laperus2" (the legend of Lapérouse's wreckage ? second version by same
> speaker that day);
> "chamanisme" (a conversation on shamanism);
>
> *	a 3-letter id for the language
> => Very useful as several languages can be spoken in the same village, and
> sometimes the very same person would tell me the same story in 2 different
> languages.
> e.g. tnm=Tanema, hiw=Hiw;  mtp-vrs= Mwotlap and Vurës, because this church
> song was exceptionally mixing the two languages.
> [I'm not using ISO codes because they are opaque, and poorly designed for my
> area; but the equivalence between the codes I use and ISO codes is made
> easily accessible in my publications & homepage
> <http://alex.francois.free.fr/AF-field.htm#Vanuatu>  anyway.]
>
> Admittedly some info is missing, e.g. my own name, or the date:  but the
> date is usually retrievable from the collection & subcollection, and I
> always uttered it orally in the recording itself. Maybe one day I should
> hardcode it in the filename.
>
> These (relatively) transparent long names have proven very useful to me as I
> was working on all these files, whether to transcribe them, compare
> different versions of similar stories, or whatever.  Because I have 1150
> different sound files in my corpus, it proved also convenient to perform
> automatic search queries on filenames, say, to easily retrieve all
> recordings with the same storyteller over the years, or to filter all
> recordings of the same language.  I don't know if I would recommend such a
> system (maybe not) but at least I found it convenient for myself: the file
> name says it all. The good thing was also that most of these filenames were
> easily interpretable to people other than myself, with a minimal amount of
> abbreviations or codes.  The initial id (BD04-24?) doesn't really need to be
> interpreted anyway (it's an id), but the village & speaker's names (+title)
> are explicit, and a simple Txt file can help make sense of language names or
> genres (and collections).  In parallel I've always used spreadsheet for
> metadata, with full name of speaker, their age, precise location, date, full
> name, etc.
>
> And then a few years ago, I wanted to archive these hundreds of files into
> our open archive (LACITO's Archivage
> <http://lacito.vjf.cnrs.fr/archivage/presentation_en.htm> ).
> When they saw these long file names, our IT people were horrified.  They
> insisted that they should all be shortened to a simple id, as short as
> possible, getting rid of all the semantics.  They thought it would be much
> more convenient, or more elegant perhaps, to handle filenames like
> "AF03-05-02.wav" [AF03=my initials + 3rd field trip, etc.], coupled with
> some metadata file. Fair enough, they were surely right.  (my earlier use of
> spaces and occasionally non-Ascii was probably at fault, together with the
> sheer length of each string).
>
> So I created a copy of my 1150 audio files, and renamed them all (manually)
> with these elegant numbers, which are now opaque even to myself.  Took me
> ages (weeks? months?).  In parallel I would fill a metadata sheet for each
> item, and send it to the IT people for them to encode in Xml/Xsl format onto
> the server. (I didn't know Xml/Xsl/Php well enough to create the search
> interface myself.) This was several years ago, and it never became as
> convenient as I was hoping it would be. In fact a fair part of the metadata
> is still awaiting to be format-converted & transferred to a new server,
> which was stopped halfway due to shortage in funding? but this is another
> story.
>
> In the meantime, I now have my whole audio archives (37 Gb) in two versions:
> exactly the same sound files, but one set has the old filenames, one has the
> numbers. This is very silly, and was meant to be temporary, yet has lasted
> for some reason.
> Finally what happens is, every time I want to quickly retrieve a file from
> my archives, I basically have the choice between accessing the set of files
> with the long, transparent names which are visually readable, easily
> searchable, and instantly clickable
> ?  OR accessing my metadata spreadsheet, try and identify the string of
> digits which I'm looking for, write it down, then try and access the
> recording among hundreds of files, essentially in a non-automatic way.  Now
> guess which solution I end up choosing.  (*grin*)
>
> There's probably something I've done wrong (as always) but I'm still
> wondering what the ideal combination would be.  It seems that different
> usages (working on one's own files vs long-term archiving?) may warrant
> different decisions, but of course this is not a good answer to Greg.
> I am especially trying to identify the best procedure in terms of archiving
> for the future, and making access easy for other prospective users.
>
> regards,
> Alex.
>
>  _____
>
>
> Margaret Carew wrote:
>
> Useful thread, and I am now looking back at my various drives with one
> eyebrow raised...
>
> I'm wondering, what is the role of folders in all this?
>
> I have an almost well organised system of audio recordings that is in the
> main not archived (although carefully backed up!), from various years and
> places. I have established a folder for each year that has passed since I
> commenced recording in digital (ie. 2006 2007 etc). Within each of these
> year folders is a recording session folder with a name that includes the
> year and month (sometimes day) the place and the event or key topic. Within
> each of these secondary folders are the recordings that are part of that
> session, with a date, speaker and other semantic info (eg.
> 20100209_BP_kurdu_wita.WAV). The metadata files (marked up text files) are
> stored within each folder, and the name of the folder is entered as a field
> in the metadata.
>
> Like my erstwhile colleague Greg I'm probably closer to the hodge-podge end
> of things, doing lots of recordings with students, sometimes in a bit of a
> random fashion, multi-tasking like crazy, yet trying to keep some order in
> it. I'm now wondering whether the folder based system is going to be a
> problem when it comes to archiving - one thing that has popped up is the
> existence of these lots of folder based metadata files - this might need to
> be consolidated into one file.
>
> I might also add that I've become fond of using itunes to make playlists of
> recordings - usually edited ones - and to use as a secondary database (a
> kind of partial mirror if you like). You can use the file info to point back
> to the folderised filenames as described. And it's great for making CDs for
> students of their recordings, to repatriate materials quickly etc. Also good
> for compiling files that will be used in a resource (eg. a set of clips for
> a voiceover) Am I committing an archiving crime by using itunes in this way?
>
> Regards
>
> Marg Carew
>
>
> -----Original Message-----
> From: Claire Bowern [mailto:clairebowern at gmail.com]
> Sent: Tue 04/05/2010 00:49
> To: David Nathan
> Cc: Resource-Network-Linguistic-Diversity
> Subject: Re: Labelling and metadata
>
> David, that would work at the end of the documentation (in fact I'm
> doing something pretty close to that right now for One Arm Point
> School for Bardi stories) but while working on the collection, doing
> searches, transcribing, etc, I'm constantly using the underlying
> files, and I'm not sure that creating another layer of reference would
> solve the problem. It would be useful for managing collections where
> there are several numbering systems though (e.g. I have tapes that
> have 3 references - the AIATSIS archive tape number, the internal
> collection number, and the number they'd get if I put them in my
> scheme...)
> Claire
>
> On Mon, May 3, 2010 at 6:58 AM, David Nathan  <mailto:dn2 at soas.ac.uk>
> <dn2 at soas.ac.uk> wrote:
>
>
> Dear all
>
> About the filenames, there are some excellent suggestions in this
> thread, but I think that there is a tendency to conflate the function
> of filenames as identifers with the functions that enable retrieval
> and access to resources. This conflation remains invisible only while
> we all keep imagining that documentation materials are merely "data" -
> without some genres, granularities, interface considerations etc. that
> relate to the presentation and usage of the resources. In that sense,
> you might think (even hypothetically) of the interface by which you
> might wish people to access them, and it is probably likely to be some
> kind of link. As those familiar with HTML and related technologies
> know, a link has a target as well as a "display text" (and other
> possible attributes in semantic web formalisms). Translating this back
> to one's local data management, there seems a good case for separating
> out the two functions mentioned above, and thinking about a simple
> linking system (that you can implement easily in spreadsheet pages, or
> HTML), and then the relevant considerations for what you want the
> "display text" to be - for yourself, and, quite possibly differently,
> for other users. This might help resolve out the different issues that
> are most relevant for each function in your contexts.
>
> best wishes
>
> David
>
> At 18:11 03/05/2010, you wrote:
>
>
> If you are going to include semantics in the file names can I make a plea
> that your labels are a little more transparent -- why not use:
>
> fm_2009_session10_audio.wav
> fm_2009_session10_video.wav
>
> rather than FM09_v10A ?? v could stand for "version" or "volume" or who
> knows what else, and, as for "A", well that's anyone's guess. Also, if the
> "09" is a year then write it as >2009 (one might even argue for "felicity"
> or "meakins" rather than "FM"). I recommend separators like _ as well, as
> Bill Poser did in his contribution to this discussion. Note also, >that if
> you have more than 99 video sessions you'll need the label to be:
>
> fm_2009_session010_audio.wav
>
> I think there are good reasons for being a little more explicit in file
> names if you want to put in some (useful) semantics like this -- after all
> YOU know what "FM" "09" "v" "A" mean >but who else could guess? Compare that
> with:
>
> felicity_2009_session10_video.wav
>
> Best,
> Peter
>
>
> On 3 May 2010 18:19, Felicity Meakins  <mailto:f.meakins at uq.edu.au>
> <f.meakins at uq.edu.au> wrote:
> This is a good point, particularly if you use two recorders (e.g. audio
> recorded plus video camera) to record the same session. I use 'v' and 'a' to
> distinguish these. In this respect, it is the recording _session_ that's
> primary, not the actual recording.
>
> FM09_v10A
>
> FM=me
> 09=year (full date is in metadata)
> v=video
> 10=recording session
> A=part of recording session
>
> e.g. recording session may have taken place at X place but over two hours we
> recorded 3 stories A, B, C.
>
>
> On 3/5/10 6:13 PM, "Joe Blythe"  <mailto:blythe.joe at gmail.com>
> <blythe.joe at gmail.com> wrote:
>
>
>
> The only two cents worth I'd like to add to this discussion is that I had to
> modify my numbering numbering system to indicate whether the original
> recording was made with a video or dedicated audio recorder. I only mark the
> video ones as "vid".
>
> Thus video files might be
> 20100503JBvid01.mov
>
> Because you sometimes need to extract audio files from video files the video
> file, such an extracted audio file would be
> 20100503JBvid01.wav
>
> This ensures that any files recorded on the same date from a dedicated audio
> recorder (e.g., 20100503JBv01.wav) don't end up with the same file name.
>
> Joe
>
> --
> Prof Peter K. Austin
> Marit Rausing Chair in Field Linguistics
> Department of Linguistics, SOAS
> Thornhaugh Street, Russell Square
> London WC1H 0XG
> United Kingdom
>
> web: http://www.hrelp.org/aboutus/staff/index.php?cd=pa
> -------------
> David Nathan
> Endangered Languages Archive
> SOAS
> -------------
>
>  _____
>
> Dr Alex FRANÇOIS
>
> LACITO - CNRS, France
>
> 2009-2011:  Visiting Fellow
>        Dept of Linguistics
>        School of Culture, History and Language
>        Australian National University
>        ACT 0200, Australia
>
>        http://alex.francois.free.fr
>
>
>
>
>
>
> --
> Prof Peter K. Austin
> Marit Rausing Chair in Field Linguistics
> Department of Linguistics, SOAS
> Thornhaugh Street, Russell Square
> London WC1H 0XG
> United Kingdom
>
> web: http://www.hrelp.org/aboutus/staff/index.php?cd=pa
>
>


More information about the Resource-network-linguistic-diversity mailing list