[Corpora-List] Re: transcribing video corpora

Hardie, Andrew a.hardie at lancaster.ac.uk
Wed Nov 15 16:21:32 UTC 2006


CHILDES/CHAT does actually now have a lot of support for XML (and Unicode). The CHILDES collection (originally in CHAT format) is nowadays also made available as XML using (principally) the familiar u and w tags. Software for flipping between the two is provided on Talkbank (though it's not easily findable from CHILDES) at http://talkbank.org/software.html , though I've never made use of this myself. 

The CHAT manual states in a couple of places that the definition of CHAT is based on an underlying XML schema; it seems to be the one found here: http://www.talkbank.org/talkbank.xsd . 

The CHILDES website used to have a very useful little gizmo that used stylesheets (I presume) to render XML files as a more readable HTMLised version of CHAT. It seems to have gone now, however.

Andrew.

Andrew Hardie
Department of Linguistics
Bowland College
Lancaster University
Lancaster LA1 4YT
United Kingdom
 
a.hardie at lancaster.ac.uk



-----Original Message-----
From: owner-corpora at lists.uib.no [mailto:owner-corpora at lists.uib.no] On Behalf Of Martin Wynne
Sent: 15 November 2006 15:17
To: James_L._Fidelholtz
Cc: Alex Boulton; corpora at uib.no
Subject: Re: [Corpora-List] Re: transcribing video corpora

Saying that the CHILDES system is "basically ASCII" doesn't tell much of 
the story (as well as begging lots of questions about non-English texts, 
Unicode compliance, etc....).

I think anyone considering this route should think very carefully. Using 
an annotation system such as CHAT, which is not conformant to open 
standards, and which requires specific software to use the texts, can 
mean that the usability of the data is very restricted. Some of the 
software is open source and available under a GNU licence, but not all, 
as far as I can see. CHAT is a de facto standard for a few communities 
of linguists, but not for the vast majority of researchers who might 
want to use language resources, and is not even widely known in 
mainstream corpus linguistics. CHAT-encoded texts cannot easily be used 
with generic software that deal with texts or the other data streams in 
multimedia data. To put it simply, with XML files you can use style 
sheets, web browsers, and more sophisticated programs made available via 
web services, and with CHAT files you can't. Finally, the reliance on 
the CHAT software means that it is not a format which is appropriate for 
the long-term preservation of the data.

While using the CHILDES transcription system may appear to be a viable 
route for data development because of the current availability of tools, 
guidelines and a lively user community, choosing this route will block 
the majority of potenially interested researchers from using the data, 
and restrict the ways in which it can be exploited. Unless there now 
exist some migration tools from CHAT to a sensible form of XML (or some 
more  standards system), I wouldn't recommend this route. Can anyone 
shed more light on the migration facilities?

Martin

James_L._Fidelholtz wrote:
> Alex Boulton escribió:
> <<Does anyone know of any free tools which help with transcription of 
> video corpora? What we would ideally like would be a kind of video 
> version of Transcriber (WinPitch is a bit complicated for our needs), 
> ie which allows multimedia alignment of transcription, sound & video, 
> plus the usual tools for annotating etc.>>
> Hi, Alex,
> Check out the CHILDES site. They have all sorts of transcription aids, 
> as well as analysis tools (as long as you transcribe in their system, 
> which is basically ASCII (I'm sure it's updated by now to permit 
> slightly more 'elegant' transcriptions, ie ANSI). The tools are quite 
> useful. (Child Language Data Exchange System: childes.psy.cmu.edu/)
> Don't let the 'child language' label fool you: the system is quite 
> versatile and general (and, of course, includes video capabilities of 
> the sort you are looking for). Now, I have not entered deeply into 
> this system, being an oldie but fogey, but my wife (who works on child 
> language) uses it and swears by it.
> Jim
> James L. Fidelholtz
> Posgrado en Ciencias del Lenguaje, ICSyH
> Benemérita Universidad Autónoma de Puebla     MÉXICO
>
>


-- 
Martin Wynne
Head of the Oxford Text Archive and
AHDS Literature, Languages and Linguistics

Oxford University Computing Services
13 Banbury Road
Oxford
UK - OX2 6NN
Tel: +44 1865 283299
Fax: +44 1865 273275
martin.wynne at oucs.ox.ac.uk



More information about the Corpora mailing list