[Corpora-List] charset identifier

Simon Carter s.c.carter at uva.nl
Mon Apr 18 07:59:23 UTC 2011


Along the same lines of Mike Maxwell's contribution, there is a version of TextCat that uses information about encodings for language ID: (Languid) http://languid.cantbedone.org/ and http://search.cpan.org/~mceglows/Language-Guess-0.01/

Otherwise, this page may be of help http://odur.let.rug.nl/~vannoord/TextCat/competitors.html

Simon


On 17 Apr 2011, at 14:56, Mike Maxwell wrote:

> On 4/16/2011 6:51 AM, Joerg Tiedemann wrote:
>> Can someone point me to reliable (freely available) tools for
>> character set identification?
>> I would like to have a rather universal tool that can give me the used
>> char encoding for a given text and given the expected language of that
>> text.
> 
> You might have a look at Kevin Scannel's site:
>   http://borel.slu.edu/crubadan/stadas.html
> Not so much about character set identification as language ID.  I'm not sure what he does about character codes, although I suppose one could create multiple clusters for a single language that uses multiple encoding systems.  We did something like that some years back in the TIDES Surprise Language exercise for Hindi, where there were multiple proprietary encodings on the web, and very little Unicode-encoded text.  Perhaps the situation has improved since then!
> -- 
> 	Mike Maxwell
> 	maxwell at umiacs.umd.edu
> 	"My definition of an interesting universe is
> 	one that has the capacity to study itself."
>        --Stephen Eastmond
> 
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

Simon Carter
ISLA, Informatics Institute,
University of Amsterdam,
Science Park 107
1098 XG Amsterdam
Phone: +31 (0)20 525 6731
Email: s.c.carter at uva.nl
Web: www.scarter.org


_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list