[Corpora-List] language sort

Daniel Zeman zeman at ufal.mff.cuni.cz
Wed Jan 10 22:17:24 UTC 2007


Oh, I see. I was thinking about a language recognizer that would not 
require you to open a file manually but would read files specified on 
command line instead (and then do something reasonable with them, like 
putting the lang id into their name, moving them to a directory etc.) I 
do not know whether any of-the-shelf recognizers behave that way; 
however, some time ago I tried to write such a thing myself and I have 
been assigning language recognition as a student exercise, too. I just 
have to look whether I have something that other people could use 
without my spending hours on adjusting and documenting it first. Stay tuned,

Dan

Maria Esteva napsal(a):
> Daniel
>
> I have tons and tons of files so it will be very time consuming for me 
> to load each file to the programme. I might just as well open the file 
> and read the content to recognize the language.
>
> I do have more than one language within one file but I will deal with 
> that. Many files are in spanish but have names, titles, addresses, 
> etc. in other language. I guess that will not bother me as much.
>
> any ideas?
>
> Maria
>
> At 03:07 PM 1/10/2007, you wrote:
>> Maria,
>>
>> why does file-by-file approach not work for you? Does that mean that 
>> you have potentially more than one language within one file?
>>
>> Dan
>>
>> Maria Esteva napsal(a):
>>> Dear all,
>>>
>>> I am wondering if somebody knows of a program that will recognize 
>>> and sort large sets of files according to language. For my text 
>>> mining project, I need to sort sets of files that contain electronic 
>>> texts mostly in Spanish and English (although there is some French 
>>> and some Portuguese as well).There are many free language 
>>> recognition programmes out there but they work on a file by file 
>>> bases. Let me know if you have some advice.
>>>
>>> Thanks,
>>>
>>> Maria Esteva
>>> PhD Candidate
>>> School of Information
>>> University of Texas at Austin



More information about the Corpora mailing list