[Lexicog] part-of-speech subcats

Fri Mar 13 20:07:50 UTC 2009

Hi Sebastian,

Modern dictionaries treat POS subcategories in a separate field. For
instance the Longman Dictionary of American English puts the subcategory
after the first sense number. (Their pronunciation field is IPA which I'm
too lazy to duplicate.)

liberty /'librti/ n 1 [C,U] the freedom to do.

C = 'count', U = 'uncounted'.

If a verb has more than one subcategory, it will be specified after each
sense number:

lighten /'laitn/ v 1 [T] to reduce the amount. 2 [I,T] to become lighter. 4
[I,T] to reduce the weight.

T = 'transitive' I = 'intransitive'

I think the reason for this is that students are taught the meaning of
'verb' and 'noun' in school. The subcategories are separated so that the
serious student has the information, but it isn't going to confuse the
lazy(?!) student who only wants to know the primary category. Labels like
'nC' or 'vi/t' are less likely to communicate.

MDF was great in its day, but has proven to be (1) limited and (2)
inflexible. The CC tables that it uses were far too complicated and
impossible for anyone to modify. All standards are somewhat inflexible by
their very nature. Some of the newer standards (e.g. LIFT) are attempting to
use a better model of the lexicon and to take into consideration the wide
range of linguistic typology.

With MDF you have two options. (1) Hijack an unused field that prints like
you want it to. It is often difficult to find a field that prints where and
how you want it to, so this rarely works. (2) Store your information in a
custom field. Then, just before you print, modify the database so that it
will print like you want it to. This requires using something like a CC
table to merge two fields and add printer codes. You have to be a bit of a
computer guru to pull this off. You do have a third option: (3) Use a
different program. For instance FieldWorks allows you to set up POS
(grammatical) subcategories, but also allows you to specify how each
subcategory will print. Because FieldWorks uses a POS hierarchy, there is no
problem with the interlinear. Subcategories inherit from their parent
category. So you can write general rules that apply to the major categories
and all the subcategories under them. 

Ron Moe

  _____  

From: lexicographylist at yahoogroups.com
[mailto:lexicographylist at yahoogroups.com] On Behalf Of Sebastian Drude
Sent: Thursday, March 12, 2009 6:52 PM
To: Shoebox/Toolbox Field Linguist's Toolbox; Lexicography
Subject: [Lexicog] part-of-speech subcats

Dear Toolboxers and Lexicographers,

I hope I am not abusive when I put this question forward to the toolbox and
lexicography groups (sorry for possible cross-postings). As others posted in
the Toolbox group only this question popped up in the process of
transferring an older idiosyncratic lexical database to the MDF standard.  I
guess that some of my questions might be interesting for other users who
have similar needs or wishes, and maybe something could be included in
possible future versions of MDF.

Such a case is the representation of Part-of-Speech subcategories.

This post deals with Toolbox (Standard Format, MDF) databases in particular,
but I would like to know which solutions other or related technology offers
(in particular, LexiquePro). I am particularly interested in exporting from
and to Toolbox.

I an older (idiosyncratic and inconsistent, but MDF-based) system, I used a
combination of the standard \ps field (obligatory) and, in many cases, a
\pss field ("part of speech subcategory", optional).  In the first I
(ideally) put only abbreviations for the major word categories -- in my
case, adv(erb), id(eophone), int(er)j(ection), n, num(eral), part(icle),
p(ost)p(osition), pron(oun), v(erb). 
In the second, I put labels for subcategories, such as inal(ienable) etc.
(for nouns), dem(onstrative), pers(on) (for pronouns), i(n)tr(ansitive),
st(ative), tr(ansitive) (for verbs).  Some of these may hold for several
major classes (cross-classifications), for instance inter(rogative) or
neg(ation) (for adverbs, pronouns, particles,...).

All abbreviations in both fields are obligatorily linked (jump-path) with
corresponding entries in another database, where I give the long name and
explain what exactly this label stands for and which properties these words
have.

Using only MDF fields, I have only one \ps field (\pn for national language,
ok). As far as I can see, I have several options to organize the same
information:

1) one complex abbreviation in the \ps field, such as n.inal or v.tr.

2) several separate abbreviations in the \ps field.

3) main word classes in \ps, as before, subcategories in another field.

None of these fully satisfies me, for at least the following reasons:

A) With (3) I have the problem to choose an appropriate field. \pd seems to
be an obvious option (many of the subcategories are indeed relevant for the
paradigm structure). However, the subcategory abbreviation  will be put at
the end of the entry, and will be formatted with a label such as "Parad:",
which does not make sense or is at least counter-intuitive for those
subclasses which are of a rather semantic kind.

B) With (2), I would have to add manually punctuation after the first (main)
word class. This causes potential problems for consistency, for defining
range sets etc.

C) (1) and (2) are much clumsier for interlinearization, filtering and
sorting.  
C1) Interlinearization: In the part-of-speech line the whole complex label
(1) can be too much information (be it only for formatting reasons), and (2)
does not interlinearize well at all or at least produces fields with
internal spaces (if I define the data type as "single item") which are
painful for exporting to other formats such as ELAN's eaf-files.
C2) Filters that make reference to word classes will be much more difficult
to formulate correctly.
C3) Sorting: Sometimes, I just want to sort by major word classes,
searching, say, for verbs ending in a certain letter. Depending on the
number of combined subcategories, I will have many internally alphabetically
ordered groups of verbs.

D) With (1), I would create many complex labels which are to administer and
which are, in the printed dictionary, much less esthetical and easy to read
than separate abbreviations.  True, many subcategories only apply only to
one major word class anyway; but this does not hold for others such as inter
or neg (see above).

I guess that a solution can be set up using appropriate cc-tables or some
other mechanism doing replacements with regular expressions, or by splitting
fields automatically for sort, jump, interlinearization and similar
functions, or by joining fields (as the MDF \ps and my \pss field), for
formatting and printing.  
But this still has the disadvantage of being difficult set up generally and
in a sustainable way, and to have to keep track of different versions of the
'same' database for different purposes.

How do you all represent and organize this kind of information?  
What would you recommend?

With your solution, what happens if you export MDF databases to LexiquePro,
LEXUS or other formats, and back to Toolbox?

Thank you in advance

Sebastian

-- 
| Sebastian Drude (Linguist)
| Sebastian.Drude@ <mailto:Sebastian.Drude at fu-berlin.de> fu-berlin.de &
Sebastian.Drude@ <mailto:Sebastian.Drude at googlemail.com> googlemail.com
| http://www.germanis
<http://www.germanistik.fu-berlin.de/il/pers/drude-en.html>
tik.fu-berlin.de/il/pers/drude-en.html

No virus found in this incoming message.
Checked by AVG - www.avg.com
Version: 8.0.237 / Virus Database: 270.11.10/1995 - Release Date: 03/12/09
10:38:00

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/lexicography/attachments/20090313/ba206249/attachment.htm>