[Corpora-List] charset identifier
Julien Nioche
lists.digitalpebble at gmail.com
Mon Apr 18 08:21:42 UTC 2011
Jorg,
Have a look at Tika (http://tika.apache.org). It does mime-type, charset and
language detection, is under Apache License and is widely used.
You can find quite a bit of documentation on the Tika website but for those
who want to go a bit deeper, the book Tika In Action is available from
Manning Early Access Program [1].
HTH
Julien
[1] http://www.manning.com/mattmann/
--
*
*Open Source Solutions for Text Engineering
http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
On 16 April 2011 11:51, Joerg Tiedemann <jorg.tiedemann at lingfil.uu.se>wrote:
> Can someone point me to reliable (freely available) tools for
> character set identification?
> I would like to have a rather universal tool that can give me the used
> char encoding for a given text and given the expected language of that
> text.
> (Possibly with confidence values if available.)
>
> I know about these tools (but I would also appreciate any comments
> about their quality):
>
> enca: http://gitorious.org/enca
> This is exactly what I need but does not support a lot of languages.
> Maybe someone knows how to extend it with more languages/encodings?
>
> utrac: http://utrac.sourceforge.net/
> I haven't tested it but it seems to be quite restricted as well.
>
> cpdetector: http://cpdetector.sourceforge.net/
>
> https://github.com/goerz/convert_encoding.py
> includes a "guess encoding option":
>
> These tools do not seem to be freely available:
> http://www.lingua-systems.com/language-identifier/lidc-application/
> http://www.lingua-systems.com/unicode-converter/autouniconv-library/
>
> The standard unix tool 'file' is of course also sometimes helpful but
> too restricted.
>
> Is there anything else (that I can use without training specific models
> myself)?
>
> Thanks!
> Jörg
>
>
>
> --
>
> **********************************************************************************
> Jörg Tiedemann
> jorg.tiedemann at lingfil.uu.se
> Dep. of Linguistics and Philology
> http://stp.lingfil.uu.se/~joerg/
> Uppsala University tel: +46 (0)18 - 471
> 1412
> Box 635, SE-751 26 Uppsala/SWEDEN fax: +46 (0)18 - 471 1094
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20110418/d4ab187d/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list