[Corpora-List] Tools for batch conversion Word to UTF-8.

Julien Nioche lists.digitalpebble at gmail.com
Thu Feb 9 12:45:12 UTC 2012


Have a look at Apache Tika <http://tika.apache.org/%20>. It can extract
text and metadata from various formats including MS Word and provides a
unified API over libraries such as Apache POI.

HTH

Julien

On 9 February 2012 12:26, Jakub Piskorski <jakub.piskorski at frontex.europa.eu
> wrote:

>
> Maybe be this will come in handy as well:
> http://poi.apache.org/
>
> cheers,
>
> Jakub
>
> > -----Original Message-----
> > From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf
> Of Tino
> > Didriksen
> > Sent: 09 February 2012 13:08
> > To: Josep M. Fontana
> > Cc: corpora at uib.no
> > Subject: Re: [Corpora-List] Tools for batch conversion Word to UTF-8.
> >
> > Modern MS Word .docx files are ZIPs with XML documents, which don't
> require
> > much scripting to extract plain text from.
> >
> > Older .doc files will need a trip through some tool. It is possible to
> use
> > OpenOffice/LibreOffice in headless mode for this, and OOo/LO's Office
> reader gets
> > most of the doc format right.
> >
> > -- Tino Didriksen
> >
> >
> > On Thu, Feb 9, 2012 at 12:38, Josep M. Fontana <josepm.fontana at upf.edu>
> wrote:
> >
> >
> >       Does anyone here know of a good free application to batch convert
> Word
> > documents to UTF-8? (Linux, OS X or Windows, it doesn't matter)
> >
> >
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20120209/425f16b7/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list