[Lexicog] Tshwanelex DTD Tree

Mike Maxwell maxwell at LDC.UPENN.EDU
Tue Sep 22 02:06:18 UTC 2009


(I'm going to reply to the list, but other list members should feel free
to ask us to take this off-line.  Of course if some people want it on
the list and others don't, the former win :-).  In the mean time, I'm
guessing that a discussion of dictionary structure will be of general
interest, if only because it seems like it ought to be simple, but in
fact it's quite complicated.)

pwyll4 wrote (and my original questions are after '>>'):
>> Which fields are repeating, and which are optional (or both)?
> 
> Actually almost all of them are optional, because some entries will 
> just be cross-references to the "main form".

OK, there are two cross-cutting things here.  One is optionality,
whether something must appear.  The other is repeatability, whether
something (if it appears) can appear multiple times.

>> That's understandable, but what are the Variants?
> 
> For example, the verb "to eat" may be both "debriñ" and "debro". I'll
>  use "debriñ" as the main entry, and "debro" will cross-refer to 
> "debriñ".

These are dialectal variants, right?

>> If so, do you want to label them with the dialect that they come
>> from?
> 
> I'll indicate the dialect and the speaker after every phonetic 
> transcription and every example etc, that's what I call "Source" in 
> my dictionary.

Since dialect and speaker are two different things, I would suggest
creating separate fields for them, e.g.
    source
       |----dialect
       |----speaker

>> Also, there is a single Pronunciation field for the lemma (citation
>>  form or stem, or maybe these are the same?),
> 
> well there will be several pronunciation fields after the lemma - 
> I'll make that feature repeatable.

Why are there several pronunciation fields after a single lemma?  You
already accounted for dialectal variants in your Variants element (at
least if my guess above is correct), and you account for morphological
forms in the Morphology field (see below).  What additional variation in
pronunciation do you have?

>> but under the Morphology element you have both Pronunciation and 
>> Forms. This doesn't seem to be parallel--what does the Forms field 
>> do?
> 
> It would be useful to indicate the pronunciation of plural forms, 
> past participles etc. For example, the infinitive of "to do/make" is 
> "gober", and the past participle is "gwraet". I have to say how both 
> are pronounced.

OK, here I must have mis-understood the Morphology element.  The
original tree looked like this:
|__Morphology
|        |__Pronunciation
|        |__Forms
|        |__Notes
|        |__Source

 From what I understand, you want each form (infinitive, past participle,
etc.) to have its individual pronunciation field, and maybe its
individual Notes field and Source field.  If that's the case, then it
seems to me that the above structure--the entire Morphology
element--needs to repeat, once for each distinct morphological form
(past tense, past participle,...).  And in that case you probably want
only a single Form (not Forms) inside a given Morphology element.  So
you would  have something like (I'll fake the pronunciation field):
    <Morphology>
       <Form>broken</Form>
       <Pronunciation>brookn</Pronunciation>
       <Notes>past participle</Notes>
       <Source>
           <Dialect>Midwestern</Dialect>
           <Speaker>Max Mikewell</Speaker>
       </Source>
    </Morphology>
    <Morphology>
       <Form>broke</Form>
       <Pronunciation>brook</Pronunciation>
       <Notes>past tense</Notes>
       <Source>
           <Dialect>Southern</Dialect>
           <Speaker>Sue Sweettalker</Speaker>
       </Source>
    </Morphology>
You might also consider using <Orthography> instead of <Form>, to
maintain the parallelism between the Orthography/Pronunciation of the 
Lemma and the Orthography/Pronunciation of the morphological variant.

>> For the ones you include, how are you notating their 
>> morphosyntactic properties (plural, past tense,...)?

What I was asking here is where the information that the current
<Morphology> element represents a past tense or past participle or etc.
goes.  I've taken a guess in the above examples (namely, that this
information goes in the Notes element).

> ...Homonym numbers are supplied, but the sense numbers, when
> supplied, are sorted one after another in the same paragraph, like
> this:
> 
> 1. First meaning 2. Second meaning Example for the first meaning 
> Source for the example for the first meaning.

To be blunt, I think that's unwise, and I don't know any dictionary that
does it that way.

> Tshwanelex put the meaning together and the examples after, I don't 
> know how to change that, so I've chosen to put the Sense Numbers by 
> myself -- till I find the way to change the "auto-sorting"...

I think Tshwanelex does it the right way (and as per David Joffe's
response today, that automatically solves the sense numbering problem).

In general, the people who built Tshwanelex (like the people who built
SIL's FLEx) have a pretty good sense about how dictionaries should be
structured.  Deciding to do something different is like replacing the
brakes on your car with some other kind of brakes.  You can probably do
it, but you had better be sure you know what you're doing, and have a 
good reason for doing so.

(Disclaimer: I was involved in some of the design work for FLEx.)

>> Part of Speech appears twice: once where I would expect it, as an 
>> immediate daughter of Lemma, and once embedded down inside Sense.
>>> That seems odd. I think the former is more normal.
> 
> Sometimes the same noun is feminine or masculine according to its 
> precise meaning ; many verbs have the same form and are both 
> transitive and intransitive : if I want to give examples of both in 
> separate paragraphs, I think I need to make two paragraphs.

Understood, but the structure is still messy.  One thing you lose by
doing it that way is an unambiguous way to determine the "scope" of a
given PoS.  If I see "masculine noun" up at the top of a lexical entry
with five senses, and then down in sense 3 I see "feminine noun", does
that mean that only sense 3 is feminine?  What is the gender of senses 4
and 5?  (By the way, there are dictionaries that do exactly that, and
the result is that even specialists disagree about the intended
interpretation.)

I would suggest instead making separate entries (lexemes) of such
masculine and feminine nouns.  (By the way, I'm guessing that the
citation form of these masculine/feminine nouns is the same, but that
other forms of the nouns--maybe the plural, or some other case-marked
form--distinguishes gender.  Right?)

    Mike Maxwell


------------------------------------

Yahoo! Groups Links

<*> To visit your group on the web, go to:
    http://groups.yahoo.com/group/lexicographylist/

<*> Your email settings:
    Individual Email | Traditional

<*> To change settings online go to:
    http://groups.yahoo.com/group/lexicographylist/join
    (Yahoo! ID required)

<*> To change settings via email:
    mailto:lexicographylist-digest at yahoogroups.com 
    mailto:lexicographylist-fullfeatured at yahoogroups.com

<*> To unsubscribe from this group, send an email to:
    lexicographylist-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
    http://docs.yahoo.com/info/terms/



More information about the Lexicography mailing list