[Lexicog] sorting of digraphs
Allan Johnson
allan_johnson at SIL.ORG
Sun Mar 12 07:21:03 UTC 2006
This is similar to an issue that has come up in the Philippines regarding the sorting of the digraph "ng". The way it is typically done is to put "ng" in the sort order as a unit, so it is handled as a digraph. And for purposes of linguistic analysis, that's probably what is needed. But one of our teams wrote up a report on a difficulty that came up with people not being able to find words where they looked for them when this "ng" was sorted as a digraph in print. Then one of our consultants did a survey in his language area to test this out, and came to the same conclusion - that for the purposes of sorting words alphabetically in vernacular dictionaries in the Philippines, "ng" needs to be treated as a sequence of two letters, "n" followed by "g", rather than as a digraph.
The same thing is routinely done in English dictionaries. In the ones I've checked, digraphs such as "ng", "th", "ch", and "sh" are alphabetized as sequences of individual letters, not as digraphs. The quote below from Sidney Landau says "Dictionaries usually alphabetize letter by letter rather than word by word". Maybe we could add to that, "Dictionaries usually alphabetize letter by letter rather than phoneme by phoneme".
Allan Johnson
----- Original Message -----
From: Paula Jones
To: lexicographylist at yahoogroups.com
Sent: Friday, March 10, 2006 8:55 PM
Subject: Re: [Lexicog] Sort order problem
Dear Wayne & Greg,
I am sorry I don't have time to read all of the correspondence carefully. I will just add the following -
From Dictionaries: the art and craft of lexicography, Sidney Landau
p.107
ALPHABETIZATION - Dictionaries usually alphabetize letter by letter rather than word by word. They place power, powerful, and power of attorney in that order, whereas a word-by-word arrangement would place power of attorney before powerful. Letter-by-letter arrangement has the great virtue that readers need not know whether a compound is spelled as one word, as a hyphenated word, or as two words. Since usage is often divided about compounds - witness database and data base, e-mail and email - and is constantly shifting, the ability to locate such terms is of considerable practical importance.
When I read this, even though it is written regarding English dictionaries, it gave me the incentive to ask about changing the sort order for Eduria & Barasano, Eastern Tucanoan, which has a confusing mixup of word-breaks that we have tried to make more consistent based on meaning (i.e. if the part separated by a space has no meaning, it is combined) that the speakers agree with most of the time but there are cases where they want the space based on the phonetics. This inconsistency caused us to lean toward a sort order that ignores the spaces or work breaks. Also, the subentry system the speakers desire also fed into this, but I won't give the details on that. When I asked about changing, I was told to put a _ where there was a space / word break in the \lx field and to add it to the sort order properties to be ignored. I am sure it is not the best way to do it (did some by hand and others in Note Pad), but this change has caused the Eduria & Barasana to be much happier with the sort order, though that may be the change in the subentry organization that was done at the same time. Paula S. Jones
----- Original Message -----
From: Wayne Leman
To: lexicographylist at yahoogroups.com
Sent: Thursday, March 09, 2006 1:23 AM
Subject: [Lexicog] Sort order problem
Greg, FWIW, I don't know if there is any lexicographical rationale for computer sort orders, such as having <space> float to the top. There's probably some rationale, but whatever it is comes from computer people. As you probably know, having <space> and punctuation marks and number before letters is what is called an ASCII sort, which just refers to the default way that computers sort. Of course, they have been programmed to sort that way. (Side thought: If people stopping programming that way for computers, they would be out of sorts, so to speak, of course!!)
I'm just glad that we can modify sort orders as much as we can with some lexicography programs. It would be too bad if we always had to accept whatever order computer sorted words into.
I'm not out of sorts now, but out of time,
it's time to hit the sack,
Wayne
-----
Wayne Leman
Cheyenne website: http://www.geocities.com/cheyenne_language
Wayne,
I have not gotten either Toolbox or LexiquePro to ignore spaces. However I am also interested in understanding the rationale for adopting a given sort order.
I assume Toolbox and LexiquePro users usually sort with spaces sorted to the top of the list.
Thus the order: child, child care, child restraint, childish, children.
This rationale is that entries involving the simplest form of the word take precedence over others.
I have been advised to sort ignoring the space.
Thus the sort order: child, child care, childish, children, child restraint.
The rationale for this order is that it is strictly alphabetical.
Another possibility is too sort with the space at the bottom.
Thus the sort order: child, childish, children, child care, child restraint.
The rationale for this order is that all morphological forms of the head word take precedence over phrasal forms.
Is there a preferred rationale and sort order in the world of lexicography?
Regards, Greg
----------------------------------------------------------------------------
YAHOO! GROUPS LINKS
a.. Visit your group "lexicographylist" on the web.
b.. To unsubscribe from this group, send an email to:
lexicographylist-unsubscribe at yahoogroups.com
c.. Your use of Yahoo! Groups is subject to the Yahoo! Terms of Service.
----------------------------------------------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/lexicography/attachments/20060312/d492bbf0/attachment.htm>
More information about the Lexicography
mailing list