<html>
I'd be interested in references or pointers to any large corpora that
have the following characteristics:<br>
<br>
1) Fairly large -- at least 50 million words<br>
<br>
2) Public available / obtainable, hopefully even via the web<br>
<br>
3) [And most important:] The organization of the corpus is more of less
as follows-- <br>
<br>
The corpus itself is only marginally annotated, if at all. However,
there are databases containing a list of all distinct n-grams (at least
(1, 2, 3 grams), which can be queried, and whose output can then be used
to search the actual corpus itself. Most importantly, these
databases of n-grams are linked to other databases that contain info on
POS, lemma, synonyms, etc. This joining of databases means that
searches can be made on not just the n-grams, but on the POS, lemma, as
well, providing searches like (for Spanish):<br>
<br>
*.pn_obj querer.* *.v_inf<br>
[a clitic followed by any form of "querer" (to want) followed
by an infinitive]<br>
<br>
<font size=4>!mandar.* *
*.v_subj_se<br>
</font>[all of the forms of any synonym of "mandar" (to
order) followed two words later by a past subjunctive]<br>
<br>
Since the lists of n-grams are merely linked to other databases
containing POS, lemma, synonyms, etc, the levels of annotation is
essentially unlimited. It's just a function of however many
separate databases a person wants to create and link to the main n-grams
database. This would even allow users of the corpus to create their
own "custom lists" of words, which could be stored in a certain
database, and then used as part of the syntax for subsequent
searches.<br>
<br>
In addition, since the databases are fairly static, they can contain
frequency information that can be included as part of the search, i.e.
cases like all of the 2-grams whose second element is a synonym of a
given word, and which appear more than three times in a given segment of
the corpus.<br>
<br>
My reason for asking is two-fold. First, I'm working on a corpus
similar to this for Spanish, and would like to look at other corpora that
have taken the same approach. Second, I was talking to a colleague
last week, and his impression is that corpora such as these are quite
common, and that they've been around since the mid-1980s. Since I
work primarily in Spanish, however, I'm less familiar with the underlying
structure of corpora in English and other languages, so I'm not so sure
that corpora such as these are in fact all that common. Most of the
large publicly-available corpora that I'm familiar with have (I believe)
an organization in which most of the annotation is in the corpus itself,
rather than in separate databases (based on n-grams) whose output is then
linked to the corpus itself.<br>
<br>
At any rate, I'd appreciate any references that you might have, and will
post a summary if there's interest.<br>
<br>
Thanks,<br>
<br>
Mark Davies<br>
<br>
<br>
==================================================== <br>
Mark Davies, Associate Professor, Spanish Linguistics <br>
4300 Foreign Languages, Illinois State University, Normal, IL 61790-4300
<br>
309-438-7975 (voice) / 309-438-8083 (fax) <br>
<font color="#0000FF">
<a href="http://mdavies.for.ilstu.edu/" eudora="autourl"><u>http://mdavies.for.ilstu.edu</a><br>
</u></font>** Historical and dialectal Spanish and Portuguese syntax **
<br>
** Corpus design and use / Web-database scripting / Distance
education ** <br>
=====================================================<br>
</html>