CLAN in non-roman scripts

Brian Macwhinney macw at andrew.cmu.edu
Wed Apr 30 02:04:56 UTC 2025


Dear Marisa,
   
Good and important question.  I wouldn’t call either Russian or Chinese underrepresented.  We have an enormous amount of Chinese in TalkBank and Russian is growing.  One major language still not represented is Tagalog.  And we have remarkably little Arabic.
    
The important thing is that CLAN supports transcription in native script in ALL languages.  This is due to five major things.  
1.  CLAN is in Unicode and Unicode supports all languages. 
2.  Leonid recently modified CLAN to even work with languages like Hebrew, Farsi, and Arabic that write right-to-left.  
3.  I spent months working to convert ALL of the older romanizations in CHILDES and SLABank to native script and that is now done although Hebrew is still a bit of a challenge.
4.  Houjun Liu's, Batchalign program provides ASR for at least the 100 languages that Whisper supports, and for languages with weaker support such as Cantonese we have improved on its methods.  
5.  Fifth, Batchalign also allows automatic morphosyntactic analysis using Universal Dependencies (UD) for over 100 languages directly with CHAT files.  UD requires native script.

I hope this encourages your students to transcribe in CLAN.  They should definitely not be using romanization if the language has its own native script.

Best,

— Brian MacWhinney
Teresa Heinz Professor of Cognitive Psychology, 
Language Technologies and Modern Languages, CMU 

> On 30 Apr 2025, at 3:16 AM, Marisa Casillas <mcasillas at uchicago.edu> wrote:
> 
> Hi Brian!
> 
> I hope all's very well with you. I'm teaching an online course to folks eager to do observational and experimental work on a diverse group of under-represented languages (many participants are, themselves, members of those language communities). A question came up today in class that I did not know the answer to, so I am hoping you can provide one if you have a moment:
> 
> What expectations should researchers have about using CLAN if they are transcribing language in a non-roman script (e.g., Farsi, Russian, Chinese)? For example, which CLAN tools/functions will "work" as expected? Should they anticipate doing any kinds of adaptations to the transcription (e.g., converting to a romanized script) to use any of the tools?
> 
> In all my experience with CHAT/CLAN, I hadn't run into this issue. But I'm sure you have!
> 
> Thanks in advance for any insights!
> Best,
> Middy
> 
> -- 
> Dr. Marisa Casillas (she/her)
> Assistant Professor, Comparative Human Development,
> University of Chicago
> chatterlab.uchicago.edu

-- 
You received this message because you are subscribed to the Google Groups "chibolts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to chibolts+unsubscribe at googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/chibolts/8B822B98-B50E-499F-88F6-76497FFBD00D%40andrew.cmu.edu.


More information about the Chibolts mailing list