[Corpora-List] language sort
Daniel Zeman
zeman at ufal.mff.cuni.cz
Wed Jan 10 22:17:24 UTC 2007
Oh, I see. I was thinking about a language recognizer that would not
require you to open a file manually but would read files specified on
command line instead (and then do something reasonable with them, like
putting the lang id into their name, moving them to a directory etc.) I
do not know whether any of-the-shelf recognizers behave that way;
however, some time ago I tried to write such a thing myself and I have
been assigning language recognition as a student exercise, too. I just
have to look whether I have something that other people could use
without my spending hours on adjusting and documenting it first. Stay tuned,
Dan
Maria Esteva napsal(a):
> Daniel
>
> I have tons and tons of files so it will be very time consuming for me
> to load each file to the programme. I might just as well open the file
> and read the content to recognize the language.
>
> I do have more than one language within one file but I will deal with
> that. Many files are in spanish but have names, titles, addresses,
> etc. in other language. I guess that will not bother me as much.
>
> any ideas?
>
> Maria
>
> At 03:07 PM 1/10/2007, you wrote:
>> Maria,
>>
>> why does file-by-file approach not work for you? Does that mean that
>> you have potentially more than one language within one file?
>>
>> Dan
>>
>> Maria Esteva napsal(a):
>>> Dear all,
>>>
>>> I am wondering if somebody knows of a program that will recognize
>>> and sort large sets of files according to language. For my text
>>> mining project, I need to sort sets of files that contain electronic
>>> texts mostly in Spanish and English (although there is some French
>>> and some Portuguese as well).There are many free language
>>> recognition programmes out there but they work on a file by file
>>> bases. Let me know if you have some advice.
>>>
>>> Thanks,
>>>
>>> Maria Esteva
>>> PhD Candidate
>>> School of Information
>>> University of Texas at Austin
More information about the Corpora
mailing list