[Lexicog] missing word

Mike Maxwell maxwell at LDC.UPENN.EDU
Thu Aug 5 17:59:32 UTC 2004


Chaz Mortensen wrote:
> The only word I was given was "constellation".

Google lists hits as follows:
range of symptoms	       29200
set of symptoms	       	       22600
constellation of symptoms	9410
cluster of symptoms		6520
complex of symptoms		2960
syndrome of symptoms		 410
confluence of symptoms		  13

I think this illustrates the problems of doing lexicography with Google
(and to some extent, with corpora in general), more than its usefulness.
  The technical term is (I think) "syndrome", but its meaning includes
the notion of being a set _of symptoms_; hence its appearance in the
phrase "syndrome of symptoms" is relatively rare, because the phrase is
redundant.

The terms "range of symptoms" and "set of symptoms", which are the
obvious winners in this popularity poll, are perhaps because people are
less familiar with the technical terms (or they're having senior moments
too).  So if you want technical terms to enrich your dictionary, this
method may not get you there.

A more sophisticated statistical method might, however, get single out
the more technical terms, like "constellation of symptoms".  What you'd
do is to look at the probability (frequency, to be technical) of each
phrase vs. its probability as estimated from the probability of the
individual words.  My guess (I haven't done the math) is that terms like
"constellation of symptoms" would turn out to be much more common than
the probability of the individual words would suggest, whereas "range of
symptoms" and "set of symptoms" would not be much more common than the
probability of their individual words would suggest.  This is another
way of saying that "constellation" often collocates with "symptoms".

Well, I decided to do a bit more Googling: "constellation of" appears
251,000 times, "set of" 14.4 million times, and "range of" 15.8 million
times.  Since the occurrence of "symptoms" is common to all these, we
can compare ratios:

   "range of symptoms" / "range of" = .0018
   "set of symptoms" / "set of" = .0016
   "constellation of symptoms"/ "constellation of" = .037
   "cluster of symptoms" / "cluster of" = .0044
   "complex of symptoms" / "complex of" = .0030
   "syndrome of symptoms" / "syndrome of" = .0019
   "confluence of symptoms" / "confluence of" = .000039

So given this analysis, "constellation of symptoms" does stand out from
the crowd.
--
	Mike Maxwell
	Linguistic Data Consortium
	maxwell at ldc.upenn.edu


------------------------ Yahoo! Groups Sponsor --------------------~-->
Yahoo! Domains - Claim yours for only $14.70
http://us.click.yahoo.com/Z1wmxD/DREIAA/yQLSAA/HKE4lB/TM
--------------------------------------------------------------------~->


Yahoo! Groups Links

<*> To visit your group on the web, go to:
    http://groups.yahoo.com/group/lexicographylist/

<*> To unsubscribe from this group, send an email to:
    lexicographylist-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
    http://docs.yahoo.com/info/terms/



More information about the Lexicography mailing list