[Corpora-List] language sort

Trond Trosterud trond.trosterud at hum.uit.no
Thu Jan 11 17:09:57 UTC 2007


Maria Esteva kirjoitti 10. jan. 2007 kello 22.02:

> Dear all,
>
> I am wondering if somebody knows of a program that will recognize  
> and sort large sets of files according to language.

My experience is that a file certainly may contain different  
languages. For our work, we identify language down to the paragraph  
level, although we would often like to be as fine-grained as sentence  
level.

We use text_cat, cf.
http://www.let.rug.nl/~vannoord/TextCat/
and have very good experiences.

Trond.

----------------------------------------------------------------------
Trond Trosterud                                        t +47 7764 4763
Institutt for språkvitskap, Det humanistiske fakultet  m +47 950 70140
N-9037 Universitetet i Tromsø, Noreg                   f +47 7764 5216
Trond.Trosterud (a) hum.uit.no          http://www.hum.uit.no/a/trond/
----------------------------------------------------------------------



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20070111/f6b31ced/attachment.htm>


More information about the Corpora mailing list