More on Transcriber and Unicode input

Andrew Margetts apmargetts at IPRIMUS.COM.AU
Sat Nov 20 06:15:15 UTC 2010


This is a follow up to my earlier post, 'Toolbox as Transcriber'. 
Several people responded with the suggestion that an alternative 
strategy might be to use Microsoft Keyboard Layout Creator (MSKLC) with 
Transcriber to facilitate the direct input of Unicode special characters 
(using UTF-8). MSKLC is freely available from 
http://msdn.microsoft.com/en-us/goglobal/bb964665.aspx

This sounded like a great idea (albeit one that rather negated my own), 
so I had a look at it. Unfortunately, as far as I can see, MSKLC doesn't 
work in Transcriber, (at least with version 1.5.1 on Windows XP 
Professional). If anybody knows otherwise I would be very interested to 
hear.

The good news is that MSKLC is easy to set-up and use, and does work 
well in (among others):
Toolbox
ELAN
Notepad

Regarding what IS possible in Transcriber:
1) you can do search-and-replace oprations within Transcriber, but it 
seems you must paste special characters to the dialog box from an 
external editor that can handle the required input. Therefore it is 
really simpler to just do all such editing in a text editor, after 
completing the Transcriber file. Notepad can be used for this task - 
i.e. it can handle UTF-8. (Avoid Wordpad and Word which just introduce 
problems; Notepad is reliable because it is purely a text editor).
2) you can also paste text strings which include special characters 
directly into Transcriber units.

In any case, it is crucial is to explicitly set the encoding in 
Transcriber to UTF-8 thus:
'Options > General > Encoding > Unicode(UTF-8)'

The result is that the top line in the .trs file will read:
<?xml version="1.0" encoding="UTF-8"?>
rather than
<?xml version="1.0" encoding="ISO-8859-1"?>
which is the default.

This technique however does not always work well on existing Transcriber 
files (you have to at least make a change to the file so that you can 
save it); but of course you can instead just make the substitution in a 
text editor rather than using the Transcriber commands.

For good measure I suggest also doing in Transcriber:
'Options > Save configuration'
to keep UTF-8 as the default encoding for new Transcriber files.

Failure to do this may result in Transcriber discarding all your Unicode 
characters on save, close and re-open - which is really very annoying. 
If you are having this problem check that the top line is correct!

To summarise this as a work-flow, in case you do wish to use 
search-and-replace techniques with MSKLC I suggest:
1) define and load your custom MSKLC keyboard - it will show up as one 
of the options in the 'Language bar' (usually present in the Windows 
Taskbar at the bottom of the screen - if it is not there you will have 
to enable it via 'Control Panel > Regional and Language Options > 
Languages > Details > Settings > Language Bar > Show the language bar on 
the desktop').
1) set Transcriber to encode as UTF-8, but then use a working 
orthography in Transcriber.
2) open each finished file in Notepad (or other text editor) and 
transform to the real orthography with search-and-replace, using the 
default keyboard for the 'search' term and switching to your custom 
keyboard (using the Language bar) for the 'replace' term.
3) (Optionally reopen the file in Transcriber to see what it should have 
looked like all along).

Incidentally, the on-line Transcriber to Toolbox converter can handle 
(and display) UTF-8 so you should have no problem using it to convert 
such a Transcriber file to an accurate Toolbox representation. As 
mentioned above, both Toolbox and ELAN support MSKLC keyboards.

If you subsequently import a Toolbox file that uses UTF-8 special 
characters into ELAN you must use 'File > Import > Toolbox File...' , 
rather than 'File > Import > Shoebox File...' , and you must tick the 
box 'All markers are Unicode'. Similarly, you should use 'File > Export 
As > Toolbox File(UTF-8)...'

I hope these notes save someone some pain.

Andrew Margetts



More information about the Resource-network-linguistic-diversity mailing list