extract plain text from CHAT ?

Frank Binder fbinder at eva.mpg.de
Fri Aug 1 10:46:08 UTC 2008


Dear Brian and Leonid,

thanks for your replies.

To take up on your example: What I would like to have is something like 
(built by hand from Leonid's mail) :

export +t*chi +t%spa -s$con* sample.cha +d +d3

 From file <sample.cha>
yeah
yeah
mommy
what's that
neat chalk chalk


Ideally this could then automatically be sent to a file that's named

sample.cha.CHI.mainTier.txt (*)

when running over a collection of CHAT files.


And yes, Brian, I know that I cannot further use this in CLAN, but that 
is precisely what I want. With a text file that contains just those 
lines as the above 5 - where each utterance is on one line that contains 
no additional codes (or line breaks) - and the file contains nothing 
more than tier contents - many things could be done.

Many software packages for computational language processing use input 
of that kind. As just one prototypical example: Ted Pedersen et al.'s 
NGRAM statistics package [1] Hence, an export to "plain flat text" from 
CHAT could be directly passed to their tokenizer [2]. No self-made CHAT 
parser would be necessary. (**)

Furthermore, with such a format, under Mac/Linux/Unix operation systems 
one could easily count word tokens and utterances via 'wc', select 
utterances with patterns via 'grep', 'sort' things, and list things with 
'uniq' etc ... Okay, that may be no real added value, since we could do 
these things in CLAN. But it just helps for some real quick checks for 
those who are familiar with the Unix command line and/or the GNU 
Coreutils [2]

Do not get me wrong: I am not calling to move data away from CLAN into 
text file collections that are probably more difficult to manage. I 
think, CLAN is an excellent platform to build annotated corpora for 
language acquisition research. Most importantly, it allows for 
consistency checks, sound file linking, various automated or assisted 
annotation procedures etc. And above all it allows to share the data in 
a consistent format. But as it is difficult to export the data into some 
"plain, simple, flat working format", some people might not consider 
CLAN (or the available data) for their use.


Just thinking ...

All the best,

Frank





(*) Seleted tiers would probably need to be identified in the name of 
the exported text files. It would be cool, for instance, if one could 
export the %mor tier in a similar way, say, to have a file of lemmas. 
Note that I have removed utterance terminators. I would also remove any 
codings for retracings, transcibers' comments etc. And I think, that you 
guys "must have" already implemented all this =o) for any of the stats 
within CLAN. =o)


(**) Self-made CHAT parsers are very likely to introduce all kinds of 
problems and inconsistencies with respect to how CLAN's built-in tools 
parse the files (and do their stats and checks).


[1] http://www.d.umn.edu/~tpederse/nsp.html

[2] :

http://search.cpan.org/dist/Text-NSP/doc/README.pod#3._The_Tokenization_Process:

[2] http://www.gnu.org/software/coreutils/manual/html_node/index.html




Leonid Spektor wrote:
> Frank,
> 
>     You can use kwal program to extract any tiers using +/-t option and/or
> +/-s to extract only certain codes. I am using sample.cha file that comes
> with every CLAN and is located in clan/lib/samples folder. Here is the
> command to extract only *CHI tiers:
> 
> kwal +t*chi +t%spa sample.cha +d +d3
> 
> From file <sample.cha>
> *CHI:    yeah . [+ Q]
> %spa:    $RES:sel:ve $DES:tes:ve
> *CHI:    yeah . [+ Q]
> %spa:    $RES:sel:in $DES:tes:non
> *CHI:    Mommy .
> %spa:    $RFA:sel:non $DES:sel:non $INI:sel:non
> *CHI:    what's that ? [+ I]
> %spa:    $IMI:sel:ve $CON:sel:in
> *CHI:    neat chalk chalk .
> %spa:    $CON:sel:in $RES:tes:in
> *CHI:    xxx .  [+ V]
> 
> Here is the command to extract only *CHI tiers that and to exclude any $CON
> codes from data:
> 
> kwal +t*chi +t%spa -s$con* sample.cha +d +d3
> 
> From file <sample.cha>
> *CHI:    yeah . 
> %spa:    $res:sel:ve $des:tes:ve
> *CHI:    yeah . 
> %spa:    $res:sel:in $des:tes:non
> *CHI:    mommy .
> %spa:    $rfa:sel:non $des:sel:non $ini:sel:non
> *CHI:    what's that ?
> %spa:    $imi:sel:ve
> *CHI:    neat chalk chalk .
> %spa:    $res:tes:in
> *CHI:    . 
> 
> 
> This output is a legal CHAT format and can be used as an input to any of
> CLAN programs. Also, the output is in a UTF8 Unicode plain text encoding and
> can be opened by any text editor that can decode UTF8 encoded text files.
> 
> Leonid.
> 
> 
> 
> On 31-07-08 08:17, "Frank Binder" <fbinder at eva.mpg.de> wrote:
> 
>> Dear chibolts,
>>
>> there is this simple question that for some reason nobody asks, but I
>> am feeling lucky today ...
>>
>>
>> Do you know of any (CLAN?) tool that allows to extract data from CHAT
>> files, such as the main tier contents, and export it to "plain"
>> (unicode) text?
>>
>>
>> That is, I am looking for a tool that removes the CHAT from the CHAT.
>> Ideally this would support CLAN's shared options - such as +R +S +T etc.
>> -  to select speakers and include or exclude certain annotations and
>> symbols/punctuation. Although this would probably be a one-way ticket,
>> it seems needed sometimes.
>>
>> Also, if there's no such tool, any suggestions or experience on how to
>> do it?
>>
>> Thanks in advance and best regards,
>>
>> Frank
>>
>>
>>
> 
> 
> 
> > 
> 

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups "chibolts" group.
To post to this group, send email to chibolts at googlegroups.com
To unsubscribe from this group, send email to chibolts+unsubscribe at googlegroups.com
For more options, visit this group at http://groups.google.com/group/chibolts?hl=en
-~----------~----~----~----~------~----~------~--~---



More information about the Chibolts mailing list