<html>

I'd be interested in references or pointers to any large corpora that

have the following characteristics:<br>

<br>

1) Fairly large -- at least 50 million words<br>

<br>

2) Public available / obtainable, hopefully even via the web<br>

<br>

3) [And most important:] The organization of the corpus is more of less

as follows-- <br>

<br>

The corpus itself is only marginally annotated, if at all.  However,

there are databases containing a list of all distinct n-grams (at least

(1, 2, 3 grams), which can be queried, and whose output can then be used

to search the actual corpus itself.  Most importantly, these

databases of n-grams are linked to other databases that contain info on

POS, lemma, synonyms, etc.  This joining of databases means that

searches can be made on not just the n-grams, but on the POS, lemma, as

well, providing searches like (for Spanish):<br>

<br>

*.pn_obj    querer.*    *.v_inf<br>

[a clitic followed by any form of "querer" (to want) followed

by an infinitive]<br>

<br>

<font size=4>!mandar.*    *   

*.v_subj_se<br>

</font>[all of the forms of any synonym of "mandar" (to

order)  followed two words later by a past subjunctive]<br>

<br>

Since the lists of n-grams are merely linked to other databases

containing POS, lemma, synonyms, etc, the levels of annotation is

essentially unlimited.  It's just a function of however many

separate databases a person wants to create and link to the main n-grams

database.  This would even allow users of the corpus to create their

own "custom lists" of words, which could be stored in a certain

database, and then used as part of the syntax for subsequent

searches.<br>

<br>

In addition, since the databases are fairly static, they can contain

frequency information that can be included as part of the search, i.e.

cases like all of the 2-grams whose second element is a synonym of a

given word, and which appear more than three times in a given segment of

the corpus.<br>

<br>

My reason for asking is two-fold.  First, I'm working on a corpus

similar to this for Spanish, and would like to look at other corpora that

have taken the same approach.  Second, I was talking to a colleague

last week, and his impression is that corpora such as these are quite

common, and that they've been around since the mid-1980s.  Since I

work primarily in Spanish, however, I'm less familiar with the underlying

structure of corpora in English and other languages, so I'm not so sure

that corpora such as these are in fact all that common.  Most of the

large publicly-available corpora that I'm familiar with have (I believe)

an organization in which most of the annotation is in the corpus itself,

rather than in separate databases (based on n-grams) whose output is then

linked to the corpus itself.<br>

<br>

At any rate, I'd appreciate any references that you might have, and will

post a summary if there's interest.<br>

<br>

Thanks,<br>

<br>

Mark Davies<br>

<br>

<br>

==================================================== <br>

Mark Davies, Associate Professor, Spanish Linguistics <br>

4300 Foreign Languages, Illinois State University, Normal, IL 61790-4300

<br>

309-438-7975 (voice) / 309-438-8083 (fax) <br>

<font color="#0000FF">    

<a href="http://mdavies.for.ilstu.edu/" eudora="autourl"><u>http://mdavies.for.ilstu.edu</a><br>

</u></font>** Historical and dialectal Spanish and Portuguese syntax **

<br>

** Corpus design and use / Web-database scripting /  Distance

education ** <br>

=====================================================<br>

</html>