[Lexicog] part-of-speech subcats

Sebastian Drude sebastian.Drude at GMAIL.COM
Fri Mar 13 01:52:11 UTC 2009


Dear Toolboxers and Lexicographers,


I hope I am not abusive when I put this question forward to the toolbox and
lexicography groups (sorry for possible cross-postings). As others posted in
the Toolbox group only this question popped up in the process of
transferring an older idiosyncratic lexical database to the MDF standard.  I
guess that some of my questions might be interesting for other users who
have similar needs or wishes, and maybe something could be included in
possible future versions of MDF.

Such a case is the representation of Part-of-Speech subcategories.

This post deals with Toolbox (Standard Format, MDF) databases in particular,
but I would like to know which solutions other or related technology offers
(in particular, LexiquePro). I am particularly interested in exporting from
and to Toolbox.


I an older (idiosyncratic and inconsistent, but MDF-based) system, I used a
combination of the standard *\ps* field (obligatory) and, in many cases, a *
\pss* field (*"part of speech subcategory*", optional).  In the first I
(ideally) put only abbreviations for the major word categories -- in my
case, *adv*(erb), *id*(eophone), *int*(er)*j*(ection), *n*, *num*(eral), *
part*(icle), *p*(ost)*p*(osition), *pron*(oun), *v*(erb).
In the second, I put labels for subcategories, such as *inal*(ienable) etc.
(for nouns), *dem*(onstrative), *pers*(on) (for pronouns),
*i*(n)*tr*(ansitive),
*st*(ative), *tr*(ansitive) (for verbs).  Some of these may hold for several
major classes (cross-classifications), for instance *inter*(rogative) or *
neg*(ation) (for adverbs, pronouns, particles,...).

All abbreviations in both fields are obligatorily linked (jump-path) with
corresponding entries in another database, where I give the long name and
explain what exactly this label stands for and which properties these words
have.

Using only MDF fields, I have only one *\ps* field (*\pn* for national
language, ok). As far as I can see, I have several options to organize the
same information:

1) one complex abbreviation in the \ps field, such as *n.inal* or *v.tr*.

2) several separate abbreviations in the *\ps* field.

3) main word classes in *\ps*, as before, subcategories in another field.


None of these fully satisfies me, for at least the following reasons:

*A)* With (3) I have the problem to choose an appropriate field. *\pd* seems
to be an obvious option (many of the subcategories are indeed relevant for
the paradigm structure). However, the subcategory abbreviation  will be put
at the end of the entry, and will be formatted with a label such as "*Parad:
*", which does not make sense or is at least counter-intuitive for those
subclasses which are of a rather semantic kind.

*B)* With (2), I would have to add manually punctuation after the first
(main) word class. This causes potential problems for consistency, for
defining range sets etc.

*C)* (1) and (2) are much clumsier for interlinearization, filtering and
sorting.
*C1)* Interlinearization: In the part-of-speech line the whole complex label
(1) can be too much information (be it only for formatting reasons), and (2)
does not interlinearize well at all or at least produces fields with
internal spaces (if I define the data type as "single item") which are
painful for exporting to other formats such as ELAN's *eaf*-files.
*C2)* Filters that make reference to word classes will be much more
difficult to formulate correctly.
*C3)* Sorting: Sometimes, I just want to sort by major word classes,
searching, say, for verbs ending in a certain letter. Depending on the
number of combined subcategories, I will have many internally alphabetically
ordered groups of verbs.

*D)* With (1), I would create many complex labels which are to administer
and which are, in the printed dictionary, much less esthetical and easy to
read than separate abbreviations.  True, many subcategories only apply only
to one major word class anyway; but this does not hold for others such as *
inter* or *neg* (see above).


I guess that a solution can be set up using appropriate cc-tables or some
other mechanism doing replacements with regular expressions, or by splitting
fields automatically for sort, jump, interlinearization and similar
functions, or by joining fields (as the MDF *\ps* and my *\pss* field), for
formatting and printing.
But this still has the disadvantage of being difficult set up generally and
in a sustainable way, and to have to keep track of different versions of the
'same' database for different purposes.


How do you all represent and organize this kind of information?
What would you recommend?

With your solution, what happens if you export MDF databases to LexiquePro,
LEXUS or other formats, and back to Toolbox?

Thank you in advance

Sebastian

-- 
| Sebastian Drude (Linguist)
| Sebastian.Drude at fu-berlin.de & Sebastian.Drude at googlemail.com
| http://www.germanistik.fu-berlin.de/il/pers/drude-en.html
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/lexicography/attachments/20090312/48b7a82f/attachment.htm>


More information about the Lexicography mailing list