[Corpora-List] Frequency of grammatical constructions or fine-grained parts of speech?
Linas Vepstas
linasvepstas at gmail.com
Tue Jul 7 21:33:05 UTC 2009
I recently made a curious graph of the frequency of grammatical
constructions in English, and am fishing for an explanation of its shape.
I'm using a parser (link-grammar) which allows me to attach to
every word of a sentence a pattern (a "disjunct") that defines how
that word was used in the sentence. One can think of the disjunct
as being a very fine-grained part of speech: for example, it
distinguishes not only transitive and intransitive verbs, but transitive
verbs from ditransitive ones, or those that took particles, or even
had singular vs. plural objects, etc. The disjunct precisely captures the
syntactical usage of a given word in a given sentence.
The attached graph shows rank versus frequency of usage, taken
from a corpus of about 1M sentences from Wikipedia articles.
There is a nice long tail, showing a Zipfian power-law distribution,
with exponent 1.5. There is also a knee at the highest ranks: the
most frequent disjuncts are less frequent than they "should be" for
a pure Zipfian distribution.
The questions are then:
1) Why a power law of 1.5?
2) Why is there a knee?
3) What about other languages?
I blogged this in slightly more detail at:
http://opencog.wordpress.com/2009/07/06/frequency-of-grammatical-disjuncts/
--linas
-------------- next part --------------
A non-text attachment was scrubbed...
Name: disjunct-true-rank.png
Type: image/png
Size: 4704 bytes
Desc: not available
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20090707/8805355b/attachment-0001.png>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list