More on Transcriber and Unicode input
Andrew Margetts
apmargetts at IPRIMUS.COM.AU
Sat Nov 20 06:15:15 UTC 2010
This is a follow up to my earlier post, 'Toolbox as Transcriber'.
Several people responded with the suggestion that an alternative
strategy might be to use Microsoft Keyboard Layout Creator (MSKLC) with
Transcriber to facilitate the direct input of Unicode special characters
(using UTF-8). MSKLC is freely available from
http://msdn.microsoft.com/en-us/goglobal/bb964665.aspx
This sounded like a great idea (albeit one that rather negated my own),
so I had a look at it. Unfortunately, as far as I can see, MSKLC doesn't
work in Transcriber, (at least with version 1.5.1 on Windows XP
Professional). If anybody knows otherwise I would be very interested to
hear.
The good news is that MSKLC is easy to set-up and use, and does work
well in (among others):
Toolbox
ELAN
Notepad
Regarding what IS possible in Transcriber:
1) you can do search-and-replace oprations within Transcriber, but it
seems you must paste special characters to the dialog box from an
external editor that can handle the required input. Therefore it is
really simpler to just do all such editing in a text editor, after
completing the Transcriber file. Notepad can be used for this task -
i.e. it can handle UTF-8. (Avoid Wordpad and Word which just introduce
problems; Notepad is reliable because it is purely a text editor).
2) you can also paste text strings which include special characters
directly into Transcriber units.
In any case, it is crucial is to explicitly set the encoding in
Transcriber to UTF-8 thus:
'Options > General > Encoding > Unicode(UTF-8)'
The result is that the top line in the .trs file will read:
<?xml version="1.0" encoding="UTF-8"?>
rather than
<?xml version="1.0" encoding="ISO-8859-1"?>
which is the default.
This technique however does not always work well on existing Transcriber
files (you have to at least make a change to the file so that you can
save it); but of course you can instead just make the substitution in a
text editor rather than using the Transcriber commands.
For good measure I suggest also doing in Transcriber:
'Options > Save configuration'
to keep UTF-8 as the default encoding for new Transcriber files.
Failure to do this may result in Transcriber discarding all your Unicode
characters on save, close and re-open - which is really very annoying.
If you are having this problem check that the top line is correct!
To summarise this as a work-flow, in case you do wish to use
search-and-replace techniques with MSKLC I suggest:
1) define and load your custom MSKLC keyboard - it will show up as one
of the options in the 'Language bar' (usually present in the Windows
Taskbar at the bottom of the screen - if it is not there you will have
to enable it via 'Control Panel > Regional and Language Options >
Languages > Details > Settings > Language Bar > Show the language bar on
the desktop').
1) set Transcriber to encode as UTF-8, but then use a working
orthography in Transcriber.
2) open each finished file in Notepad (or other text editor) and
transform to the real orthography with search-and-replace, using the
default keyboard for the 'search' term and switching to your custom
keyboard (using the Language bar) for the 'replace' term.
3) (Optionally reopen the file in Transcriber to see what it should have
looked like all along).
Incidentally, the on-line Transcriber to Toolbox converter can handle
(and display) UTF-8 so you should have no problem using it to convert
such a Transcriber file to an accurate Toolbox representation. As
mentioned above, both Toolbox and ELAN support MSKLC keyboards.
If you subsequently import a Toolbox file that uses UTF-8 special
characters into ELAN you must use 'File > Import > Toolbox File...' ,
rather than 'File > Import > Shoebox File...' , and you must tick the
box 'All markers are Unicode'. Similarly, you should use 'File > Export
As > Toolbox File(UTF-8)...'
I hope these notes save someone some pain.
Andrew Margetts
More information about the Resource-network-linguistic-diversity
mailing list