[Corpora-List] charset identifier

Julien Nioche lists.digitalpebble at gmail.com
Mon Apr 18 08:21:42 UTC 2011


Jorg,

Have a look at Tika (http://tika.apache.org). It does mime-type, charset and
language detection, is under Apache License and is widely used.
You can find quite a bit of documentation on the Tika website but for those
who want to go a bit deeper, the book Tika In Action is available from
Manning Early Access Program [1].

HTH

Julien

[1] http://www.manning.com/mattmann/

-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

On 16 April 2011 11:51, Joerg Tiedemann <jorg.tiedemann at lingfil.uu.se>wrote:

> Can someone point me to reliable (freely available) tools for
> character set identification?
> I would like to have a rather universal tool that can give me the used
> char encoding for a given text and given the expected language of that
> text.
> (Possibly with confidence values if available.)
>
> I know about these tools (but I would also appreciate any comments
> about their quality):
>
> enca:  http://gitorious.org/enca
> This is exactly what I need but does not support a lot of languages.
> Maybe someone knows how to extend it with more languages/encodings?
>
> utrac:  http://utrac.sourceforge.net/
> I haven't tested it but it seems to be quite restricted as well.
>
> cpdetector:  http://cpdetector.sourceforge.net/
>
> https://github.com/goerz/convert_encoding.py
> includes a "guess encoding option":
>
> These tools do not seem to be freely available:
> http://www.lingua-systems.com/language-identifier/lidc-application/
> http://www.lingua-systems.com/unicode-converter/autouniconv-library/
>
> The standard unix tool 'file' is of course also sometimes helpful but
> too restricted.
>
> Is there anything else (that I can use without training specific models
> myself)?
>
> Thanks!
> Jörg
>
>
>
> --
>
> **********************************************************************************
>  Jörg Tiedemann
> jorg.tiedemann at lingfil.uu.se
>  Dep. of Linguistics and Philology
> http://stp.lingfil.uu.se/~joerg/
>  Uppsala University                                  tel:  +46 (0)18 - 471
> 1412
>  Box 635, SE-751 26 Uppsala/SWEDEN   fax: +46 (0)18 - 471 1094
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20110418/d4ab187d/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list