Jorg, <br><br>Have a look at Tika (<a href="http://tika.apache.org">http://tika.apache.org</a>). It does mime-type, charset and language detection, is under Apache License and is widely used.<br>You can find quite a bit of documentation on the Tika website but for those who want to go a bit deeper, the book Tika In Action is available from Manning Early Access Program [1].<br>
<br>HTH<br><br>Julien<br><br>[1] <a href="http://www.manning.com/mattmann/">http://www.manning.com/mattmann/</a><br clear="all"><br>-- <br><span style="border-collapse: separate; color: rgb(0, 0, 0); font-family: 'Times New Roman'; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; font-size: medium;"><span style="font-family: arial; font-size: small;"><b style="color: rgb(0, 0, 0); font-family: arial,helvetica,sans-serif;"><img src="http://digitalpebble.com/img/logo.gif" height="38" width="200"><br style="color: rgb(51, 51, 51); font-family: arial,helvetica,sans-serif;">
</b><span style="color: rgb(102, 102, 102); font-family: arial,helvetica,sans-serif;"><span style="color: rgb(51, 51, 51);">Open Source Solutions for Text Engineering</span><br><br></span></span></span><span style="color: rgb(102, 102, 102);"><a href="http://digitalpebble.blogspot.com/" target="_blank">http://digitalpebble.blogspot.com/</a></span><br style="color: rgb(102, 102, 102);">
<span style="color: rgb(102, 102, 102);"><a href="http://www.digitalpebble.com/" target="_blank">http://www.digitalpebble.com</a></span><br><br><div class="gmail_quote">On 16 April 2011 11:51, Joerg Tiedemann <span dir="ltr"><<a href="mailto:jorg.tiedemann@lingfil.uu.se">jorg.tiedemann@lingfil.uu.se</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">Can someone point me to reliable (freely available) tools for<br>
character set identification?<br>
I would like to have a rather universal tool that can give me the used<br>
char encoding for a given text and given the expected language of that<br>
text.<br>
(Possibly with confidence values if available.)<br>
<br>
I know about these tools (but I would also appreciate any comments<br>
about their quality):<br>
<br>
enca: <a href="http://gitorious.org/enca" target="_blank">http://gitorious.org/enca</a><br>
This is exactly what I need but does not support a lot of languages.<br>
Maybe someone knows how to extend it with more languages/encodings?<br>
<br>
utrac: <a href="http://utrac.sourceforge.net/" target="_blank">http://utrac.sourceforge.net/</a><br>
I haven't tested it but it seems to be quite restricted as well.<br>
<br>
cpdetector: <a href="http://cpdetector.sourceforge.net/" target="_blank">http://cpdetector.sourceforge.net/</a><br>
<br>
<a href="https://github.com/goerz/convert_encoding.py" target="_blank">https://github.com/goerz/convert_encoding.py</a><br>
includes a "guess encoding option":<br>
<br>
These tools do not seem to be freely available:<br>
<a href="http://www.lingua-systems.com/language-identifier/lidc-application/" target="_blank">http://www.lingua-systems.com/language-identifier/lidc-application/</a><br>
<a href="http://www.lingua-systems.com/unicode-converter/autouniconv-library/" target="_blank">http://www.lingua-systems.com/unicode-converter/autouniconv-library/</a><br>
<br>
The standard unix tool 'file' is of course also sometimes helpful but<br>
too restricted.<br>
<br>
Is there anything else (that I can use without training specific models myself)?<br>
<br>
Thanks!<br>
Jörg<br>
<br>
<br>
<br>
--<br>
**********************************************************************************<br>
Jörg Tiedemann <a href="mailto:jorg.tiedemann@lingfil.uu.se">jorg.tiedemann@lingfil.uu.se</a><br>
Dep. of Linguistics and Philology<br>
<a href="http://stp.lingfil.uu.se/%7Ejoerg/" target="_blank">http://stp.lingfil.uu.se/~joerg/</a><br>
Uppsala University tel: +46 (0)18 - 471 1412<br>
Box 635, SE-751 26 Uppsala/SWEDEN fax: +46 (0)18 - 471 1094<br>
<br>
_______________________________________________<br>
UNSUBSCRIBE from this page: <a href="http://mailman.uib.no/options/corpora" target="_blank">http://mailman.uib.no/options/corpora</a><br>
Corpora mailing list<br>
<a href="mailto:Corpora@uib.no">Corpora@uib.no</a><br>
<a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/listinfo/corpora</a><br>
</blockquote></div><br><br>