[Linganth] Recommendations for tools transcribing and analyzing large amounts of data

Jocelyn Aznar contact at jocelynaznar.eu
Thu Apr 9 19:28:13 UTC 2026


Hi everyone,

I'm curious, how do you end up with so much data without first thinking 
about how you will handle it?

As you are within an English department, I assume you work with English? 
Do you have some budget? What kind of annotation do you need? which 
format? how do you do your analysis? using CSV files? XML? should the 
data be reusable by other researchers? meant for being archived? FAIR? etc.

Using online AI tools is probably not ethical, as you have no way to 
know what will the companies do with the data and what the people you 
recorded said... If you have a recent computer, some budget or access to 
University servers, you can use for instance Whisper and a model from 
Mistral (like the 7B) to do some annotations automatically. With 
languages like English, French and co, it works quite well. But that 
requires some scripting.

Best,
Jocelyn

Le 09/04/2026 à 21:13, Nathan Straub 曹內森 a écrit :
> Hi Dominika,
> 
> I use Vook.ai (an AI-based subscription service) for rapid automatic 
> transcription of English. (It also does Spanish, French, Italian, 
> Portuguese, and German.) You would likely have to sort out overlaps and 
> speaker labels on you own after that.
> 
> For field recordings, I liked using SIL's Saymore software, because it 
> provided a place to store recordings and break up a recording into short 
> breath groups and listen again and again with slow speech and type up 
> rough transcriptions, and then I could port the vernacular and free 
> translation lines into FLEx.
> 
> Which languages are you working with?
> 
> Nathan
> 
> We are sent into this world for some end.  It is our duty to discover by 
> close study what this end is & when we once discover it to pursue it 
> with unconquerable perseverance.
> JQA at age 12 to his brother Charles (June 1778)
> 
> On Thu, Apr 9, 2026, 12:02 Dominika Baran, Ph.D. 
> <dominika.baran at duke.edu <mailto:dominika.baran at duke.edu>> wrote:
> 
>     Dear Colleagues,
> 
>     I am looking for recommendations of your favorite tool(s), at the
>     moment, for processing large amounts of recorded spoken & written
>     conversational data (informal interviews, free conversations), for
>     both transcription and coding & analysis.
> 
>     I have about 100 hours of digitally recorded conversations,
>     including those among multiple speakers, with lots of simultaneous
>     speech, two conversations going on at once, overlap, and code-
>     switching (mostly bilingual, occasionally trilingual). I also have
>     13 years of written group chat conversations, which don’t need
>     transcribing but it is over 300,000 words. I am looking for
>     suggestions for software, online or otherwise, for both
>     transcription (which is tricky because of the multilingual and
>     overlapping conversations) and, more importantly, organization,
>     coding, and analysis. It has been a while since I have dealt with
>     THIS much data and I am sure there is a lot out there that I don’t
>     know about - all and any suggestions of what has worked for folks
>     are very much appreciated!
> 
>     Best,
>     Dominika
> 
> 
>     Dominika M. Baran
> 
>     Associate Professor
> 
>     English Department
> 
>     Duke University
> 
>     Allen Building 303
> 
>     Durham, NC 27708
> 
>     Pronouns: she/her/hers
> 
>     _______________________________________________
>     Linganth mailing list
>     Linganth at listserv.linguistlist.org
>     <mailto:Linganth at listserv.linguistlist.org>
>     https://listserv.linguistlist.org/cgi-bin/mailman/listinfo/linganth
>     <https://listserv.linguistlist.org/cgi-bin/mailman/listinfo/linganth>
> 
> 
> _______________________________________________
> Linganth mailing list
> Linganth at listserv.linguistlist.org
> https://listserv.linguistlist.org/cgi-bin/mailman/listinfo/linganth



More information about the Linganth mailing list