Utilities for analyzing keyboards?

Don Osborn
Sun Jun 29 23:35:11 UTC 2008

Hi Andrew, One problem we run up against in talking about various advanced
applications is the issue of corpora. There is a need to find ways to (1)
more effectively digitize existing text, and (2) generate new text. On the
former (1), I would really like to see a project to (a) assure that extended
Latin texts already scanned for projects like Google books are OCR'd
properly when extended Latin and diacritics are involved (I've written that
particular project about that already), and (b) a new/additional focused
effort be undertaken to digitize all extant texts in under-resourced
languages. On the latter (2) , Mark Liberman and colleagues at the
Linguistic Data Consortium (University of Pennsylvania) have an interesting
project concept for involving school students transcribing oral histories
that then could become part of local heritage resources as well as
developing the corpora for the languages (makes me wonder if OLPC and
similar projects could be involved in a pilot effort along these lines).


That said, and returning to the topic of analyzing keyboards: I would hope
that even a relatively small amount of text could in the meantime give us an
idea how efficient alternative keyboard layouts are. We can sort of give an
educated guess about what might be more advantageous in one way or another
of particular key arrangements, but until we can begin to collect and
statistically analyze basic data on keystokes, etc. it is just estimates.
With small texts that are probably not "representative samplings" (if such a
thing were possible in language), there is a risk that a particular text
could give a misleading result. But at this stage in discussion we may be
just talking about beginning to get some better ideas about the efficiency
of alternative layouts.






your second tool would necessitate having a large corpus in each language to
use for the analysis. 

as a quick experiment, i thought I'd look at some character frequencies in a
single text, just an experiment, since a single text couldn't be considered
adequate for a proper analysis.

Since the draft Yoruba keyboard layout uses combining diacritics for all the
diacritics, I took the Yoruba translation of the UDHR. Then normalised the
text using NFD. I then ran it through a script to count the occurrence of
each character.

Of the four most frequent characters, three were the combining diacritics:
acute, grave and dot-below. Although a single text is inconclusive, it is
suggestive that for Yoruba the combining diacritics need to be typed
frequently and should be in positions allowing them to be typed easily and

And yes, i converted the vertical line below to a dot below before running
the test on the UDHR translation.

Andrew Cunningham
Research and Development Coordinator
State Library of Victoria

andrewc at vicnet.net.au

