[Corpora-List] language sort

Eric Atwell eric at comp.leeds.ac.uk
Wed Jan 10 21:45:26 UTC 2007


Maria,

this is probably only a last resort if noone else comes up with a better
solution: why not follow the Web-as-Corpus trend and use Google?

Specifically: copy all your files to a website, say http://mysite.atu.edu
... then use Google Advanced search,
     Domain set to return results from your website http://mysite.atu.edu
     Language set to return results in Spanish
      ... (and then in English, then in French, then in Portuguese)...

This should return the URLs of the Spansih texts first time, then the
English texts, Frecnh texs, Portuguese texts; then you need to download
and collate the files from each google search.

Of course, it would be nice not to have to do all this using the Google
interface, but instead using a web-as-corpus tool such as BootCat...

Eric Atwell, Leeds University

On Wed, 10 Jan 2007, Maria Esteva wrote:

> Dear all,
>
> I am wondering if somebody knows of a program that will recognize and sort 
> large sets of files according to language. For my text mining project, I need 
> to sort sets of files that contain electronic texts mostly in Spanish and 
> English (although there is some French and some Portuguese as well).There are 
> many free language recognition programmes out there but they work on a file 
> by file bases. Let me know if you have some advice.
>
> Thanks,
>
> Maria Esteva
> PhD Candidate
> School of Information
> University of Texas at Austin
>

Eric Atwell,
Senior Lecturer, Language research group leader, School of Computing,
Faculty of Engineering, University of Leeds, LEEDS LS2 9JT, England
TEL: +44-113-3435430  FAX: +44-113-3435468  http://www.comp.leeds.ac.uk/eric



More information about the Corpora mailing list