A query...

Brian MacWhinney macw at cmu.edu
Tue Oct 24 22:34:31 UTC 2006


Dear Miriam,

     You ask some very good questions.  Mark-up is certainly an  
important consideration.  The TalkBank approach to this
has been to develop a highly structured and heavily semantic  
underlying XML mark-up language called CHAT.  During the last 12  
years, CHAT has been extended to include CA, ISL, Discourse  
Transcription, and four other coding systems. We are currently  
working on translators from LDC formats such as SLX and Switchboard.   
The underlying XML Schema is published at http://www.talkbank.org/ 
talkbank.xsd and documented in readable English in the manuals that  
are available from childes.psy.cmu.edu.  All of the data in CHILDES  
and TalkBank, which come from some 12 different disciplines,  
including some of sociolinguistics, are in CHAT.  These CHAT files  
can be automatically converted to XML by program, validated, and then  
reformatted back to CHAT to make sure that they are the same as the  
originals (round-tripping).
   The CHILDES and TalkBank databases include perhaps 40 languages,  
many with non-Roman orthographies (Thai, Chinese, Japanese) and lots  
and lots of different types of people.  The tools include methods for  
linking transcripts to audio and video which allow for playback from  
individual sentences both locally and over the web.
    Should the various projects you mention that are emerging in  
North Carolina and Philadelphia use these tools and formats?
Obviously, I am biased.  But one has to ask the simple question:  why  
not?
    Some people seem to confuse the issue of metadata with specific  
transcription markup.  TalkBank and CHILDES use the OLAC metadata  
format.  However, they are also included in the MPI IMDI format too.   
But settling on these metadata formats is really not the core issue.   
The core issue is transcription.  When I look at the various  
databases emerging in field linguistics, I see none that have treated  
transcription as a structured object.  Instead, the idea is typically  
to post some PDF or Word files on the web.  Even though this format  
is far from optimal, when these documents are accompanied by audio,  
they do at least provide a great community resource.  But we can and  
should do better.  I have spent hours trying to rework a PDF into a  
CHAT file.  It would make a lot of sense to put the structure in  
during the process of transcription to allow for the full power of  
computational linguistics.

--Brian MacWhinney


On Oct 24, 2006, at 5:06 PM, Miriam Meyerhoff wrote:

> At the risk of returning this discussion to general topics of  
> discussion (rather than personal proclamations of what  
> outstandingly caring and responsible researchers we are as  
> individuals -- of course we are, this is Funknet, right? -- and  
> whether or not untenured members of our community are or are not  
> paranoid about how much time they ought to spend on peer-reviewed  
> papers vs creating web-based archives) ...
>
> I was interested in the off-hand way in which the emergence of  
> different archiving systems was glossed over in the debate. Someone  
> (Dan Everett, I believe -- forgive me if I am misattributing, the  
> thread was very long by the time I joined it) made some comment to  
> the effect that they would prefer it if funding were given to  
> thoroughly document (and archive through to public access) fewer  
> languages than to document in less open archives a larger number of  
> languages.
>
> I'm interested by this for several reasons. One is that I have  
> started to get the impression that the very limited NSF funding for  
> linguistics is doubling-up on different archiving systems. My own  
> area of research is sociolinguistics, and I am dismayed when I see  
> funding going on digitising different sociolinguistics archives to  
> different standards when so much basic research in sociolinguistics  
> is left unfunded. We have standards or systems emerging in North  
> Carolina, Philadelphia, to say nothing of the International Corpora  
> of English which do not (sadly) all adhere to the same mark-up  
> norms. In Oceanic linguistics (my other research interest) there is  
> the excellent PARADISEC archive which has been set up, but the  
> discussants on this list are clearly thinking of many others, and  
> Helen Dry and Anthony Aristar have been trying to lead with  
> archiving and mark-up standards for years.
>
> Is it being too unbearably cynical to suggest that people are  
> pursuing their own archive projects because this suits the current  
> priorities/worries of funding agencies (and, not coincidentally,  
> enhances our own professional standing or mana), rather than  
> because it best serves the immediate and long-term goals fo the  
> user groups (whether speakers of these languages or linguists)?
>
> The example of the Jesuit grammars was raised early in the piece --  
> I have no experience whatsoever with these, so I will simply take  
> it as writ that they are exemplary -- but surely these guys did not  
> have a standardised format that they presented data in? If they  
> did, or to the extent that they did, surely the standard was  
> something more like the "archiving" standard adopted by Malcolm  
> Ross, Andy Pawley and Darrell Tryon at Pacific Linguistics years  
> ago: if you go to a Pacific Linguistics grammar now, you know what  
> to expect to find in section 4.3.2 and you know what to expect to  
> find in section 4.3.2.1. etc. etc.
>
> No, I know we don't have easy access to the authors' original  
> notebooks or recordings in all cases so we can't check where they  
> have perhaps made honest category errors (though -- by the way --  
> PARADISEC does make written records and recordings available...).  
> But notebooks are bloody good ways of archiving data (Peter  
> Ladefoged's name has been invoked in this discussion and he was  
> quite clear in the last few years that hard copy is absolutely  
> essential for sustaining further research). And yes, I agree that  
> there are some things we can and should be more forthcoming about  
> sharing with the academic community more widely. But I'm sorry,  
> people, the recording of the woman telling me about her rape -- you  
> can't have that. Not because I promised her the conversation was  
> private, but because it is quite simply not my story to share. But  
> sure, the argument about who should have won the beauty contest ...  
> when I have time, because she understood the recordings would be  
> used for academic research. But I hope that is not time that is  
> funded at the expense of some energetic, and fresh-minded new  
> researcher in the field, whose work will challenge me and mine.
>
> In short... my point is: I disagree the idea that the extremely  
> limited funding to linguistics should go principally to projects  
> feeding labour-intensive digital archiving. Yes, it would be lovely  
> if there were more  and larger grants in linguistics so we didn't  
> have to make this kind of choice. But at the moment we do and I  
> think we would be doing our community a dis-service if we backed  
> the Big Few at the expense of the Small Many.
>
> And no, I have nothing to do with PARADISEC, but their web page is  
> here if you don't know about their enterprise and would like to  
> learn more: http://paradisec.org.au/
>
> best,  Miriam
> -- 
> Miriam Meyerhoff
> Professor of Sociolinguistics
> Linguistics & English Language
> University of Edinburgh
> 14 Buccleuch Place
> Edinburgh EH8 9LN
> SCOTLAND
>
> ph.: +44 131 650-3961/3628 (main office) or 651-1836 (direct line)
> fax: +44 131 650-6883
>
> http://www.ling.ed.ac.uk/~mhoff
>
>



More information about the Funknet mailing list