[Corpora-List] Tools for batch conversion Word to UTF-8.

Hardie, Andrew a.hardie at lancaster.ac.uk
Thu Feb 9 16:56:16 UTC 2012


Hi Josep & others,

It's worth knowing that you don't need a separate program to do this. It can be scripted within Word using Word's own macro language (Visual Basic For Applications).

Here's a very quick script I threw together to do just that, built from a couple of prerecorded macros and a few hints from web searches. It finds anything matching *.doc in the target directory (here, C:\, but you could change it to whatever you wanted by adjusting the path variable), and saves them as the same thing with an extra ".txt". If you edit this text into the macros of a dummy document or template, and select the run option for the SaveAllDocsAsUtf8 macro, it will generate the files you need.

(begin bit to copy into the macro file)

Sub SaveAllDocsAsUtf8()
Dim path As String
Dim file As String
Dim newfile As String

path = "C:\"
file = Dir(path & "*.doc")

Do While file <> ""
newfile = path & file & ".txt"
'MsgBox newfile
Documents.Open FileName:=path & file, ConfirmConversions:= _
    False, ReadOnly:=False, AddToRecentFiles:=False, PasswordDocument:="", _
    PasswordTemplate:="", Revert:=False, WritePasswordDocument:="", _
    WritePasswordTemplate:="", Format:=wdOpenFormatAuto, XMLTransform:=""

ActiveDocument.SaveAs FileName:=newfile, FileFormat:= _
    wdFormatText, LockComments:=False, Password:="", AddToRecentFiles:=True, _
    WritePassword:="", ReadOnlyRecommended:=False, EmbedTrueTypeFonts:=False, _
    SaveNativePictureFormat:=False, SaveFormsData:=False, SaveAsAOCELetter:= _
    False, Encoding:=65001, InsertLineBreaks:=False, AllowSubstitutions:= _
    False, LineEnding:=wdCRLF

ActiveDocument.Close

file = Dir
Loop

End Sub

(end bit to copy into the macro file)

Hope this is useful.

best

Andrew.


> On Thu, Feb 9, 2012 at 12:38, Josep M. Fontana <josepm.fontana at upf.edu<mailto:josepm.fontana at upf.edu>> wrote:
>
>
>       Does anyone here know of a good free application to batch convert Word
> documents to UTF-8? (Linux, OS X or Windows, it doesn't matter)
>
>
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no<mailto:Corpora at uib.no>
http://mailman.uib.no/listinfo/corpora

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20120209/25ef1480/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list