Jorg, <br><br>Have a look at Tika (<a href="http://tika.apache.org">http://tika.apache.org</a>). It does mime-type, charset and language detection, is under Apache License and is widely used.<br>You can find quite a bit of documentation on the Tika website but for those who want to go a bit deeper, the book Tika In Action is available from Manning Early Access Program [1].<br>

<br>HTH<br><br>Julien<br><br>[1] <a href="http://www.manning.com/mattmann/">http://www.manning.com/mattmann/</a><br clear="all"><br>-- <br><span style="border-collapse: separate; color: rgb(0, 0, 0); font-family: 'Times New Roman'; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; font-size: medium;"><span style="font-family: arial; font-size: small;"><b style="color: rgb(0, 0, 0); font-family: arial,helvetica,sans-serif;"><img src="http://digitalpebble.com/img/logo.gif" height="38" width="200"><br style="color: rgb(51, 51, 51); font-family: arial,helvetica,sans-serif;">

</b><span style="color: rgb(102, 102, 102); font-family: arial,helvetica,sans-serif;"><span style="color: rgb(51, 51, 51);">Open Source Solutions for Text Engineering</span><br><br></span></span></span><span style="color: rgb(102, 102, 102);"><a href="http://digitalpebble.blogspot.com/" target="_blank">http://digitalpebble.blogspot.com/</a></span><br style="color: rgb(102, 102, 102);">

<span style="color: rgb(102, 102, 102);"><a href="http://www.digitalpebble.com/" target="_blank">http://www.digitalpebble.com</a></span><br><br><div class="gmail_quote">On 16 April 2011 11:51, Joerg Tiedemann <span dir="ltr"><<a href="mailto:jorg.tiedemann@lingfil.uu.se">jorg.tiedemann@lingfil.uu.se</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">Can someone point me to reliable (freely available) tools for<br>

character set identification?<br>

I would like to have a rather universal tool that can give me the used<br>

char encoding for a given text and given the expected language of that<br>

text.<br>

(Possibly with confidence values if available.)<br>

<br>

I know about these tools (but I would also appreciate any comments<br>

about their quality):<br>

<br>

enca:  <a href="http://gitorious.org/enca" target="_blank">http://gitorious.org/enca</a><br>

This is exactly what I need but does not support a lot of languages.<br>

Maybe someone knows how to extend it with more languages/encodings?<br>

<br>

utrac:  <a href="http://utrac.sourceforge.net/" target="_blank">http://utrac.sourceforge.net/</a><br>

I haven't tested it but it seems to be quite restricted as well.<br>

<br>

cpdetector:  <a href="http://cpdetector.sourceforge.net/" target="_blank">http://cpdetector.sourceforge.net/</a><br>

<br>

<a href="https://github.com/goerz/convert_encoding.py" target="_blank">https://github.com/goerz/convert_encoding.py</a><br>

includes a "guess encoding option":<br>

<br>

These tools do not seem to be freely available:<br>

<a href="http://www.lingua-systems.com/language-identifier/lidc-application/" target="_blank">http://www.lingua-systems.com/language-identifier/lidc-application/</a><br>

<a href="http://www.lingua-systems.com/unicode-converter/autouniconv-library/" target="_blank">http://www.lingua-systems.com/unicode-converter/autouniconv-library/</a><br>

<br>

The standard unix tool 'file' is of course also sometimes helpful but<br>

too restricted.<br>

<br>

Is there anything else (that I can use without training specific models myself)?<br>

<br>

Thanks!<br>

Jörg<br>

<br>

<br>

<br>

--<br>

**********************************************************************************<br>

 Jörg Tiedemann                                     <a href="mailto:jorg.tiedemann@lingfil.uu.se">jorg.tiedemann@lingfil.uu.se</a><br>

 Dep. of Linguistics and Philology<br>

<a href="http://stp.lingfil.uu.se/%7Ejoerg/" target="_blank">http://stp.lingfil.uu.se/~joerg/</a><br>

 Uppsala University                                  tel:  +46 (0)18 - 471 1412<br>

 Box 635, SE-751 26 Uppsala/SWEDEN   fax: +46 (0)18 - 471 1094<br>

<br>

_______________________________________________<br>

UNSUBSCRIBE from this page: <a href="http://mailman.uib.no/options/corpora" target="_blank">http://mailman.uib.no/options/corpora</a><br>

Corpora mailing list<br>

<a href="mailto:Corpora@uib.no">Corpora@uib.no</a><br>

<a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/listinfo/corpora</a><br>

</blockquote></div><br><br>