Have a look at <a href="http://tika.apache.org/%20">Apache Tika</a>. It can extract text and metadata from various formats including MS Word and provides a unified API over libraries such as Apache POI.<br><br>HTH<br><br>Julien<br>
<br><div class="gmail_quote">On 9 February 2012 12:26, Jakub Piskorski <span dir="ltr"><<a href="mailto:jakub.piskorski@frontex.europa.eu">jakub.piskorski@frontex.europa.eu</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
Maybe be this will come in handy as well:<br>
<a href="http://poi.apache.org/" target="_blank">http://poi.apache.org/</a><br>
<br>
cheers,<br>
<br>
Jakub<br>
<div class="HOEnZb"><div class="h5"><br>
> -----Original Message-----<br>
> From: <a href="mailto:corpora-bounces@uib.no">corpora-bounces@uib.no</a> [mailto:<a href="mailto:corpora-bounces@uib.no">corpora-bounces@uib.no</a>] On Behalf Of Tino<br>
> Didriksen<br>
> Sent: 09 February 2012 13:08<br>
> To: Josep M. Fontana<br>
> Cc: <a href="mailto:corpora@uib.no">corpora@uib.no</a><br>
> Subject: Re: [Corpora-List] Tools for batch conversion Word to UTF-8.<br>
><br>
> Modern MS Word .docx files are ZIPs with XML documents, which don't require<br>
> much scripting to extract plain text from.<br>
><br>
> Older .doc files will need a trip through some tool. It is possible to use<br>
> OpenOffice/LibreOffice in headless mode for this, and OOo/LO's Office reader gets<br>
> most of the doc format right.<br>
><br>
> -- Tino Didriksen<br>
><br>
><br>
> On Thu, Feb 9, 2012 at 12:38, Josep M. Fontana <<a href="mailto:josepm.fontana@upf.edu">josepm.fontana@upf.edu</a>> wrote:<br>
><br>
><br>
> Does anyone here know of a good free application to batch convert Word<br>
> documents to UTF-8? (Linux, OS X or Windows, it doesn't matter)<br>
><br>
><br>
<br>
</div></div><div class="HOEnZb"><div class="h5">_______________________________________________<br>
UNSUBSCRIBE from this page: <a href="http://mailman.uib.no/options/corpora" target="_blank">http://mailman.uib.no/options/corpora</a><br>
Corpora mailing list<br>
<a href="mailto:Corpora@uib.no">Corpora@uib.no</a><br>
<a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/listinfo/corpora</a><br>
</div></div></blockquote></div><br><br clear="all"><br>-- <br><span style="border-collapse:separate;color:rgb(0,0,0);font-family:'Times New Roman';font-style:normal;font-variant:normal;font-weight:normal;letter-spacing:normal;line-height:normal;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;font-size:medium"><span style="font-family:arial;font-size:small"><b style="color:rgb(0,0,0);font-family:arial,helvetica,sans-serif"><img src="http://digitalpebble.com/img/logo.gif" height="38" width="200"><br style="color:rgb(51,51,51);font-family:arial,helvetica,sans-serif">
</b><span style="color:rgb(102,102,102);font-family:arial,helvetica,sans-serif"><span style="color:rgb(51,51,51)">Open Source Solutions for Text Engineering</span><br><br></span></span></span><span style="color:rgb(102,102,102)"><a href="http://digitalpebble.blogspot.com/" target="_blank">http://digitalpebble.blogspot.com/</a></span><br style="color:rgb(102,102,102)">
<span style="color:rgb(102,102,102)"><a href="http://www.digitalpebble.com" target="_blank">http://www.digitalpebble.com</a><br><a href="http://twitter.com/digitalpebble" target="_blank">http://twitter.com/digitalpebble</a></span><br>
<br>