[Corpora-List] Corpus Development

Tue Apr 29 13:17:50 UTC 2008

Marco,

>> ... however, can you define "flexible" queries over sequences of
>> annotations?
>> Suppose, e.g., that I want to study verb-noun
>> collocations in a certain corpus. I want to be able to mine them with
>> something like:
>> VERB ADV? DET? ADJ* NOUN
>> I could of course factor this into multiple queries (VERB NOUN, VERB ADJ NOUN, VERB ADJ ADJ NOUN, VERB DET NOUN, etc.),
>> but this would get very tedious very fast.

Via my architecture (http://corpus.byu.edu), this is possible (if I understand your query correctly). Suppose that you want nouns anywhere within the span five words to the right of a given verb (e.g. 'break'). In the interface, just select:

WORD(S): [break].[v*]  (i.e. all forms of 'break' as a verb)
CONTEXT: [nn*]  (and also select [0]  [5] )

In a little less than two seconds (for the 18,000 tokens in the 100 million word BNC) it gives: law, silence, leg, news, rules, etc. For the 72,000 tokens of [break].[v*] in the 360+ million word BYU Corpus of American English (http://www.americancorpus.org) it does take a bit longer -- about three seconds. The architecture is also quite scalable. For example, to find nouns near the 900,000+ tokens of [get].[v*] in the American Corpus, it's only about four seconds.

You can also do much more advanced collocational searches -- essentially anything (word, lemma, POS, synonym of a given word, words in a user-defined list, or any combination of these) "near" anything else. In addition, you can MI-rank the collocates, compare the collocates across genres (e.g. collocates of 'chair' in fiction vs. academic), or compare the collocates of two different words (e.g. little vs small, or democrats vs republicans).

Is this the type of query that you're referring to?

Best,

Mark Davies

============================================
Mark Davies
Professor of (Corpus) Linguistics
Brigham Young University
(phone) 801-422-9168 / (fax) 801-422-0906
Web: davies-linguistics.byu.edu

** Corpus design and use // Linguistic databases **
** Historical linguistics // Language variation **
** English, Spanish, and Portuguese **
============================================

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora