[Corpora-List] Moving Lexical Semantics from Alchemy to Science

Lothar Lemnitzer lemnitzer at bbaw.de
Sun Jan 23 11:37:06 UTC 2011


If someone is interested in German data: I have been sifting through data of
online journals for over 10 years now, on a daily basis. This sums up to 1,9
billion words. For space and copyright reasons I do not  archive the
original documents but also the word type lists. German is nice to look at
because most compounds from one string. The data are not downloadable
publicly, but if you are interested in seeing them just drop me a line.

Regards

Lothar Lemnitzer
(www.wortwarte.de)

2011/1/21 <amsler at cs.utexas.edu>

> The comments re: 'shopping cart' and 'shopping trolley' seem to me to
> reinforce a problem that keeps the field of lexical semantics as alchemy
> rather than as a more scientific pursuit. We just don't have enough data
> about compound nouns to be certain of what they are doing in the language
> overall; to know whether they are manifestations of underlying rules or
> happenstance creations. The OED provides us with some historical dates for
> first occurrences of open compounds and large contemporary corpora provide
> us with statistics on the extant forms in use today, but until now we've
> lacked the access to the statistical (frequency) history of the open
> compounds over time. Fortunately, now the Google nGrams from Google books
> has filled in that void.
>
> The reason compounds are important is that while we also have access to
> isolated words, those can't easily be automatically disambiguated, so
> knowing their frequencies over time doesn't tell us as much as we need to
> know about what they meant in context. Most (not all) open compounds are
> unambigious (I still get taken in by 'solar system' when it is used to refer
> to a bank of solar panels!), but mostly we can depend on open comounds being
> unambiguous.
>
> To me, that means the next big advance in lexical semantics could come from
> a large database of statistics by language variant and yearly chronology of
> the frequencies of open compounds. I'd like to be able to easily compare the
> historical frequency record of 'shopping cart' and 'shopping trolley' in
> British and American (and Australian and ...) English to watch the growth of
> the terms in frequency year-by-year AS WELL AS to be able to easily find a
> list of all the other open compounds formed from 'shopping', 'cart' and
> 'trolley' over the same chronology.
>
> Until such time as we can reliably disambiguate the isolated word forms in
> histrical corpora, the open compounds may provide the next best clue to the
> discovery of the facts on which a science of lexical semantics can be built.
>
> ... P.S. Anyone have some other ambiguous open compounds they are familiar
> with, besides 'solar system'?
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>



-- 
Lothar Lemnitzer
DWDS
Berlin-Brandenburgische Akademie der Wissenschaften
Jägerstr. 22/23
10117 Berlin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20110123/bd278938/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list