CLAN: Text Extraction
Leonid Spektor
spektor at andrew.cmu.edu
Tue Aug 13 18:50:35 UTC 2024
I have changed FLO to not wrap long lines and updated everything on the web.
Leonid.
> On Aug 13, 2024, at 09:37, Brian Macwhinney <macw at andrew.cmu.edu> wrote:
>
> Yes, I see the problem. The longest lines get wrapped and a tab is added. You could replace carriage return and tab \r\t with nothing. Better yet, Leonid may be able to fix this problem.
>
> — Brian MacWhinney
> Teresa Heinz Professor of Cognitive Psychology,
> Language Technologies and Modern Languages, CMU
>
>
>
>> On Aug 9, 2024, at 11:28 PM, Xiaowei Zhao <xiaoweizhao at gmail.com> wrote:
>>
>> Hello,
>>
>> First of all, Sorry to pick up this conversation for so long ago!
>>
>> I am also trying to use the "flo" command to extract "clean" text from .cha files, and it works very well except one small thing -- it seems to automatically add line wraps to break long lines exceeding a certain length to several lines.
>>
>> For example, for a file (060002c.cha) in the MacWhinney database, I run
>> flo +cr +t* 060002c.cha
>>
>> and for a long line in the original .cha file
>> "
>> *MAR: no (.) it's not Mr Munsters (.) it's only the Munsters (.) what if the monsters won't be on anymore and xxx will be with other movie (.) what if it's at with the other program .
>> "
>>
>> I got three lines
>> "
>> no it's not Mr Munsters it's only the Munsters what if the monsters won't
>> be on anymore and will be with other movie what if it's at with the other
>> program.
>> "
>> I am just wondering if there is any command/option/switch within Clan to avoid this and still keep them on the same line? I tried "LONGTIER", but it did not work.
>>
>> Many thanks!
>>
>> Sincerely,
>> Xiaowei
>>
>> Xiaowei Zhao, Ph.D.
>> Professor of Psychology
>>
>> Emmanuel College
>> 400 The Fenway | Boston | MA 02115
>> www.emmanuel.edu
>>
>> On Tue, Feb 6, 2024 at 4:39 PM Leonid Spektor <spektor at andrew.cmu.edu> wrote:
>> Command flo +ca +t* *.cha should work.
>>
>>
>> Leonid.
>>> On Feb 6, 2024, at 16:14, Snigdha Khanna <snkhanna at iu.edu> wrote:
>>>
>>> I want to remove all annotations like the gestures and errors. Hence, I would like to use the txt format of just the transcribed text without annotations.
>>>
>>> Any idea how to do that?
>>>
>>>
>>> On Tuesday, February 6, 2024 at 4:10:32 PM UTC-5 macw wrote:
>>> CLAN’s FLO program does most of this. Alternatively, you could grab all the <w> tags from the XML version of the database.
>>>
>>> What kind of NLP do you want to use? You could apply Universal Dependencies directly.
>>>
>>> — Brian MacWhinney
>>> Teresa Heinz Professor of Cognitive Psychology,
>>> Language Technologies and Modern Languages, CMU
>>>
>>>> On Feb 6, 2024, at 3:08 PM, Snigdha Khanna <snkh... at iu.edu> wrote:
>>>>
>>>> Hello!
>>>>
>>>> I am trying to extract "clean" text from annotated transcripts that I have. Is there any way to use CLAN to export a txt file format, or a simpler method to remove annotations from the transcripts, so that I can parse it using NLP?
>>>>
>>>> Any help is appreciated!
>>>>
>>>> Thanks,
>>>> Snigdha
>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google Groups "chibolts" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send an email to chibolts+u... at googlegroups.com.
>>>> To view this discussion on the web visit https://groups.google.com/d/msgid/chibolts/237e8996-63ba-4476-859f-4b1e6841ab3an%40googlegroups.com.
>>>
>>>
>>> --
>>> You received this message because you are subscribed to the Google Groups "chibolts" group.
>>> To unsubscribe from this group and stop receiving emails from it, send an email to chibolts+unsubscribe at googlegroups.com.
>>> To view this discussion on the web visit https://groups.google.com/d/msgid/chibolts/cb3c67ac-e21e-492a-8710-3f1ef74cda6dn%40googlegroups.com.
>>
>>
>> --
>> You received this message because you are subscribed to the Google Groups "chibolts" group.
>> To unsubscribe from this group and stop receiving emails from it, send an email to chibolts+unsubscribe at googlegroups.com.
>> To view this discussion on the web visit https://groups.google.com/d/msgid/chibolts/7256CB6D-33FE-461B-9A0E-F479DDCC69C7%40andrew.cmu.edu.
>>
>> --
>> You received this message because you are subscribed to the Google Groups "chibolts" group.
>> To unsubscribe from this group and stop receiving emails from it, send an email to chibolts+unsubscribe at googlegroups.com.
>> To view this discussion on the web visit https://groups.google.com/d/msgid/chibolts/CANVosvX1Q%2BjGDL0WxZKTr2CjtAZeUAPn7%2Bz6gb6X061c%3Du_4-A%40mail.gmail.com.
>
--
You received this message because you are subscribed to the Google Groups "chibolts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to chibolts+unsubscribe at googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/chibolts/F85F005C-3EEE-4390-A0C4-93433FD70F04%40andrew.cmu.edu.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/chibolts/attachments/20240813/0c0b150c/attachment.htm>
More information about the Chibolts
mailing list