[Lexicog] Incorporating an existing English/Vernacular word list/dictionary into a dictionary project....
Ronald Moe
ron_moe at SIL.ORG
Wed Mar 26 17:22:57 UTC 2008
Thapelo Otlogetswe wrote:
“Will marking tone therefore demand manual labour or there is an elegant !
computational way of doing it?”
Whenever some aspect of a language is unpredictable, there is no way to
record that feature in an entirely automatic way. It requires a speaker of
the language to manually indicate it. However there are ways to make this
easy. For instance if you need to mark tone on 40,000 headwords, the
following procedure would make it simpler:
1. Generate a CV pattern field. FieldWorks can do this in a few
minutes. Then sort the database on the CV pattern field. This brings
together all those words with the same syllable structure. When you read
these words, you will be dealing with fewer tone patterns, and therefore the
decision making is easier.
2. Indicate the grammatical category. In some languages this can be
done semi-automatically. For instance if the citation form has an
inflectional or derivational affix, in many cases the affix indicates the
grammatical category. FieldWorks enables you to filter for the affix and
then assign the correct grammatical category to each word. For instance in a
Bantu language you could assign the grammatical category to 75% of the words
in a few minutes. Once you have marked the grammatical category, you can
sort the database on the grammatical category and the CV pattern field. This
further restricts the number of tone patterns, making the decision making
even easier.
3. Use a Find/Replace function to mark the tone on many lexemes at
once. In order to do this, your software has to enable you to work
interactively with the data. For instance Toolbox has a Replace function,
but it is not efficient to interact with it. In contrast FieldWorks has a
Replace function that works on a browse view so that you can see what it is
doing. It has a “Preview” function that shows what changes will be performed
by the Replace tool. In addition it has check boxes so that you can
eliminate exceptions to a rule. It is a little difficult to explain how this
all works in FieldWorks. But trust me, the combination of the Replace
function, the Preview function, and the checkboxes makes it very easy and
efficient to rapidly make changes to your database, such as assigning tone
marks to the citation form or a pronunciation field. These tools in
FieldWorks were specifically designed to do such work in the most efficient
way possible.
If you were to mark tone on 40,000 headwords working one by one by one, it
would probably take you nearly three weeks of difficult tedious labor.
However doing it the way I have described would probably only take you a
couple of days and the results would be much more accurate. This is why I
highly recommend FieldWorks as the best software for rapidly developing a
dictionary database. It will save you huge amounts of time and help you keep
your data consistent. We carefully thought through each task involved in
developing a dictionary and have designed tools to efficiently do each task.
One such task is the need to generate a phonemic rendering of each word from
the orthographic rendering. FieldWorks enables you to transform your
orthographic script into IPA characters and then refine the transcription
wherever the orthography under-differentiates a phonemic (or phonetic)
distinction. The phonemic rendering can then go into the Pronunciation field
and be included in the published dictionary. This process can also help you
analyze your phonology and determine where your orthography may need
revision.
Ron Moe
_____
From: lexicographylist at yahoogroups.com
[mailto:lexicographylist at yahoogroups.com] On Behalf Of Thapelo Otlogetswe
Sent: Wednesday, March 26, 2008 8:10 AM
To: lexicographylist at yahoogroups.com
Subject: RE: [Lexicog] Incorporating an existing English/Vernacular word
list/dictionary into a dictionary project....
Ron
I found the part of your message about generating a phonemic rendering of
words from an orthographic list of words interesting. Certainly with
languages in which the pronunciation is predictable from the spelling such a
tool would assist in the production of pronunciation dictionaries, although
this may not immediately address the marking of tone in tonal language words
(??). The question that I wish to ask is if one had a wordlist of say 40,000
headwords and they wished to upload it into FIELDWORKS and then generate a
phonemic rendering of each word, how would they go about it? Assuming such
a process was successful computationally, how would one then go about
marking tone in phonemic words? In many African languages, while tone is
lexicalised, it is not marked orthographically. Will marking tone therefore
demand manual labour or there is an elegant ! computational way of doing it?
Many thanks
Thapelo
Ronald Moe <ron_moe at sil.-org> wrote:
Heather Souter wrote:
“Soon I will become part of a team that will be working to create the first
dictionary of our language that focuses on the vernacular.”
Hi Heather,
It would be a very simple matter to incorporate the existing
English-vernacular word list into a monolingual or bilingual
vernacular-English dictionary. There are tools available that can reverse a
dictionary. For instance we could take the following input:
\lx doctor
\de tabibu; mganga; daktari
and transform it into:
\lx tabibu
\de doctor
\lx mganga
\de doctor
\lx daktari
\de doctor
This can be done in a couple of minutes, no matter how large your dictionary
is.
There are also tools available that can help you update an orthography or
transliterate one script into another (e.g. orthography into IPA). The
length of time it would take would depend on how much you need to interact
with the changes. If the changes are regular, we could set up a table of
correspondences. The table could then be applied to your database in a
matter of minutes. However if your orthography does not accurately reflect
the phonology of the language, then you will need a tool that allows you to
interact with a Find/Replace function. The FieldWorks program has a tool
specifically designed for such a task. FieldWorks is available free of
charge from the SIL website. FieldWorks also includes a tool for collecting
and typing words using the DDP word collection method. I would highly
recommend that you use FieldWorks, since it has the most powerful tools that
I am aware of for rapidly developing a dictionary database.
Since time is of the essence in your situation, DDP is the most efficient
method of collecting lots of words in a short time. Many teams are
collecting 10,000 to 20,000 words in a few weeks. The number of words
collected depends on a number of factors, such as the number of mother
tongue speakers available to work on the project, how vigorous is language
use, etc. If you only have a few speakers of the language left, your results
might be far less, but will still be much better than other methods. You
should also collect as many texts as possible, since this will supplement
the DDP method and provide solid evidence for semantic research.
If you have other questions, post them to this discussion group and one of
us will try to help you.
Ron Moe
_____
From: lexicographylist at -yahoogroups.-com
[mailto:lexicographylist at -yahoogroups.-com] On Behalf Of Heather Souter
Sent: Monday, March 24, 2008 6:48 PM
To: lexicographylist at -yahoogroups.-com
Subject: [Lexicog] Incorporating an existing English/Vernacular word
list/dictionary into a dictionary project....
Hello! I am a community linguist (both some formal and informal training at
the master's level) and a member of community with a highly endangered
language. I have been involved in some basic phonological analysis and also
revitalization efforts (creation of basic pedagogical materials). Soon I
will become part of a team that will be working to create the first
dictionary of our language that focuses on the vernacular. In other words,
it will not be a translation of an English dictionary. It is exciting.
However, not being trained in lexicography, I am finding the learning curve
quite steep!
Here, I have a question. An English-vernacular word list/dictionary of our
language exists. The headwords are English and there are one, two or three
possible translations given in our language as well as some example
sentences. There is no grammatical information included at all. (Still, it
is a wonderful resource!) I would like to know how this could be included
in the dictionary project that will be starting shortly. To complicate
matters, the orthography is pretty good but not linguistically adequate
(being based on English spellings!). We likely will be using a different
orthography (as well as IPA for research purposes). The creation of a
digital version of the existing word list/dictionary is possible (once
permission is secured).
As the project is somewhat politically charged at present, I would prefer
not divulging the name of our language. I trust that you all understand how
touchy projects like this can be and will not press me to on this matter.
I thank you for you understand..-..
H.S.
PS: I have taken a look at the DDP developed by Ron Moe and have asked the
project leader to consider this approach. I think it could work very well
for us as time is of the essence! Our Elders are passing way every day....
No virus found in this incoming message.
Checked by AVG.
Version: 7.5.519 / Virus Database: 269.21.7/1333 - Release Date: 3/18/2008
8:10 AM
No virus found in this outgoing message.
Checked by AVG.
Version: 7.5.519 / Virus Database: 269.21.7/1333 - Release Date: 3/18/2008
8:10 AM
----------------------------------
Dr. Thapelo J. Otlogetswe
Corpus linguist & lexicographer
University of Botswana
Department of English
Private Bag 00703
Gaborone, Botswana
Tel: (+267) 355 2093
_____
Be a better friend, newshound, and know-it-all with Yahoo! Mobile. HYPERLINK
"http://us.rd.yahoo.com/evt=51733/*http:/mobile.yahoo.com/;_ylt=Ahu06i62sR8H
DtDypao8Wcj9tAcJ%20"Try it now.
No virus found in this incoming message.
Checked by AVG.
Version: 7.5.519 / Virus Database: 269.22.0/1344 - Release Date: 3/26/2008
8:52 AM
No virus found in this outgoing message.
Checked by AVG.
Version: 7.5.519 / Virus Database: 269.22.0/1344 - Release Date: 3/26/2008
8:52 AM
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/lexicography/attachments/20080326/3db7c332/attachment.htm>
More information about the Lexicography
mailing list