[Corpora-List] Tools for batch conversion Word to UTF-8.

Adam Radziszewski kocikikut at gmail.com
Thu Feb 9 12:07:05 UTC 2012


Dear Josep,
there exists a quite universal solution, namely using OpenOffice's “UNO”
API. This can be done in Python. There are existing Python scripts that do
this. The advantage is that you can convert this way any type of text
document that OpenOffice is able to open. Typical session will require to
have an OO instance running, e.g.

$ openoffice.org "-accept=socket,host=localhost,port=2002;urp;"

I can't find the script I was using originally, something similar can be
found here: http://rajeeshknambiar.wordpress.com/tag/pyuno/

If you're interested, I've got a modified version of the original script I
found — this version outputs in very simple XML format, where each
paragraphs is written separately (variant of XCES, as employed in the IPI
PAN Corpus of Polish). If you're interested, I'll share the script (it's
able to process all the text files in a given directory).

Best,
Adam Radziszewski
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20120209/da5c0819/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list