From daniellekaryllehu at gmail.com Tue Apr 8 00:21:37 2025 From: daniellekaryllehu at gmail.com (Danielle Hu) Date: Mon, 7 Apr 2025 19:21:37 -0500 Subject: Use of FREQ in Bilingual Corpora Message-ID: Hi all, I'm doing some CLAN analyses for my Master's thesis, and maybe I'm missing it in the manual trying to teach myself how to do this, but I'm struggling to code the analyses I want. For context, I am trying to count the number of parent/caregiver utterances in a transcript that are in each language, and of those utterances, which are related to a postcode that says whether the utterance was related to directly reading from the story provided, or extratextual speech (e.g., questioning, commenting about the story, etc.). I am analyzing this to determine whether changing the order of language presentation in a bilingual book impacts caregiver language use in Tagalog and English. I've processed the following command to get a frequency of the number of caregiver utterances that are marked as English and Tagalog: *freq +l1 +t*CAR +s"<- eng>" +s"<- tgl>" +d2 *.cha* I want to do the following things: - Exclude utterances that have code switching. I have attempted *freq +l1 +t*CAR +s"<- eng> -s"@s:tgl" +d2 *.cha*, but this does not work. It outputs the same number of English utterances as the first command. - Identify utterances that have both a precode of [- eng] and a postcode of [+ b] (book) or [+ e] (extra-textual), and vice versa for [- tgl] - Of code-switched utterances, which ones are related to the postcode of [+ b] or [+ e] I hope that's a clear explanation of what I'm looking for! If CLAN doesn't have a way to do this, that's also okay, but wanted to check with everyone to make sure that it wasn't a possibility before I start manually counting things. Luckily, it's only for 6 participants. Thanks in advance! All the best, Danielle Hu -- You received this message because you are subscribed to the Google Groups "chibolts" group. To unsubscribe from this group and stop receiving emails from it, send an email to chibolts+unsubscribe at googlegroups.com. To view this discussion visit https://groups.google.com/d/msgid/chibolts/CAEGyqUg21cgc-1K41R5hu%3D%2BNXjjCOCxa2A8c34cwRc6a_h8wwQ%40mail.gmail.com. -------------- next part -------------- An HTML attachment was scrubbed... URL: From spektor at andrew.cmu.edu Tue Apr 8 03:12:06 2025 From: spektor at andrew.cmu.edu (Leonid Spektor) Date: Mon, 7 Apr 2025 23:12:06 -0400 Subject: Use of FREQ in Bilingual Corpora In-Reply-To: References: Message-ID: Hi Danielle, 1. To Exclude utterances that have code switching you need to do it in two passes. First you need to exclude the utterances that have words with @s:tgl. Following KWAL command assumes that you did not code words with @s:tgl on [- tgl] utterances: kwal +o@ +o% +d -s*@s:tgl* +f *.cha After that you can run your FREQ command: freq +l1 +t*CAR +s"<- eng> +d2 *.kwal.cex You can take care of both @s:tgl and @s:eng code switching with one KWAL command. As above KWAL command, this command also assumes that you did not code words with @s:eng on [- eng] utterances: kwal +o@ +o% +d -s*@s:tgl* -s*@s:eng* +f *.cha and next run command: freq +l1 +t*CAR +s"<- eng>" +s"<- tgl>" +d2 *.kwal.cex 2. To Identify utterances that have both a precode of [- eng] and a post-code of [+ b] again you can do it with two passes. kwal -d +o@ +o% +s"[+ b]" +f *.cha kwal -d +l1 +s"[- eng]" *.kwal.cex OR one COMBO command: combo +o@ +o% -d +l1 +s"[- eng]^*^[+ b]" +d +f *.cha 3. Of code-switched utterances, which ones are related to the post-code of [+ b]. First pass KWAL command: kwal +o@ +o% +d +s*@s:tgl* +s*@s:eng* +s"[+ b]" +f *.cha Next FREQ command: freq +l1 +t*CAR +s"<- eng>" +s"<- tgl>" +d2 *.kwal.cex Leonid. > On Apr 7, 2025, at 20:21, Danielle Hu wrote: > > Hi all, > > I'm doing some CLAN analyses for my Master's thesis, and maybe I'm missing it in the manual trying to teach myself how to do this, but I'm struggling to code the analyses I want. For context, I am trying to count the number of parent/caregiver utterances in a transcript that are in each language, and of those utterances, which are related to a postcode that says whether the utterance was related to directly reading from the story provided, or extra textual speech (e.g., questioning, commenting about the story, etc.). I am analyzing this to determine whether changing the order of language presentation in a bilingual book impacts caregiver language use in Tagalog and English. > > I've processed the following command to get a frequency of the number of caregiver utterances that are marked as English and Tagalog: freq +l1 +t*CAR +s"<- eng>" +s"<- tgl>" +d2 *.cha > > I want to do the following things: > Exclude utterances that have code switching. I have attempted freq +l1 +t*CAR +s"<- eng> -s"@s:tgl" +d2 *.cha, but this does not work. It outputs the same number of English utterances as the first command. > Identify utterances that have both a precode of [- eng] and a postcode of [+ b] (book) or [+ e] (extra-textual), and vice versa for [- tgl] > Of code-switched utterances, which ones are related to the postcode of [+ b] or [+ e] > I hope that's a clear explanation of what I'm looking for! If CLAN doesn't have a way to do this, that's also okay, but wanted to check with everyone to make sure that it wasn't a possibility before I start manually counting things. Luckily, it's only for 6 participants. Thanks in advance! > > All the best, > Danielle Hu > > > -- > You received this message because you are subscribed to the Google Groups "chibolts" group. > To unsubscribe from this group and stop receiving emails from it, send an email to chibolts+unsubscribe at googlegroups.com . > To view this discussion visit https://groups.google.com/d/msgid/chibolts/CAEGyqUg21cgc-1K41R5hu%3D%2BNXjjCOCxa2A8c34cwRc6a_h8wwQ%40mail.gmail.com . -- You received this message because you are subscribed to the Google Groups "chibolts" group. To unsubscribe from this group and stop receiving emails from it, send an email to chibolts+unsubscribe at googlegroups.com. To view this discussion visit https://groups.google.com/d/msgid/chibolts/F810BDE8-9389-46E4-BAB2-E93738DA9CEA%40andrew.cmu.edu. -------------- next part -------------- An HTML attachment was scrubbed... URL: From macw at andrew.cmu.edu Wed Apr 30 02:04:56 2025 From: macw at andrew.cmu.edu (Brian Macwhinney) Date: Wed, 30 Apr 2025 10:04:56 +0800 Subject: CLAN in non-roman scripts In-Reply-To: References: Message-ID: <8B822B98-B50E-499F-88F6-76497FFBD00D@andrew.cmu.edu> Dear Marisa, Good and important question. I wouldn?t call either Russian or Chinese underrepresented. We have an enormous amount of Chinese in TalkBank and Russian is growing. One major language still not represented is Tagalog. And we have remarkably little Arabic. The important thing is that CLAN supports transcription in native script in ALL languages. This is due to five major things. 1. CLAN is in Unicode and Unicode supports all languages. 2. Leonid recently modified CLAN to even work with languages like Hebrew, Farsi, and Arabic that write right-to-left. 3. I spent months working to convert ALL of the older romanizations in CHILDES and SLABank to native script and that is now done although Hebrew is still a bit of a challenge. 4. Houjun Liu's, Batchalign program provides ASR for at least the 100 languages that Whisper supports, and for languages with weaker support such as Cantonese we have improved on its methods. 5. Fifth, Batchalign also allows automatic morphosyntactic analysis using Universal Dependencies (UD) for over 100 languages directly with CHAT files. UD requires native script. I hope this encourages your students to transcribe in CLAN. They should definitely not be using romanization if the language has its own native script. Best, ? Brian MacWhinney Teresa Heinz Professor of Cognitive Psychology, Language Technologies and Modern Languages, CMU > On 30 Apr 2025, at 3:16?AM, Marisa Casillas wrote: > > Hi Brian! > > I hope all's very well with you. I'm teaching an online course to folks eager to do observational and experimental work on a diverse group of under-represented languages (many participants are, themselves, members of those language communities). A question came up today in class that I did not know the answer to, so I am hoping you can provide one if you have a moment: > > What expectations should researchers have about using CLAN if they are transcribing language in a non-roman script (e.g., Farsi, Russian, Chinese)? For example, which CLAN tools/functions will "work" as expected? Should they anticipate doing any kinds of adaptations to the transcription (e.g., converting to a romanized script) to use any of the tools? > > In all my experience with CHAT/CLAN, I hadn't run into this issue. But I'm sure you have! > > Thanks in advance for any insights! > Best, > Middy > > -- > Dr. Marisa Casillas (she/her) > Assistant Professor, Comparative Human Development, > University of Chicago > chatterlab.uchicago.edu -- You received this message because you are subscribed to the Google Groups "chibolts" group. To unsubscribe from this group and stop receiving emails from it, send an email to chibolts+unsubscribe at googlegroups.com. To view this discussion visit https://groups.google.com/d/msgid/chibolts/8B822B98-B50E-499F-88F6-76497FFBD00D%40andrew.cmu.edu.