[Corpora-List] ACL proceedings paper in the American National Corpus

Nancy Ide ide at cs.vassar.edu
Mon Sep 30 17:14:51 UTC 2002


On Monday, September 30, 2002, at 08:14 AM, Adam Kilgarriff wrote:
> This sort of thing can only be avoided by not having too much
> data of any single specialised data type.  I would recommend a limit of
> 0.5% for lingusitics papers in general, with no subspecialism (eg
> computational lingusitics, or, worse still, parsing) taking more than
> a quarter of that, and a limit of 10,000 words from any single
> document.

Until we actually extract the papers from the ACL data, I have no idea
what the size of the portion included in the ANC will be. However, it
will certainly be a tiny percentage of the core corpus, if not the
entire ANC.

Our goal is to eventually produce a core corpus containing 100 million
words, comparable in representative distribution to the BNC (for
comparison purposes). However, note that unlike the BNC, the ANC will
include, beyond the 100 million word core, a "varied" component
consisting of whatever we can get our hands on. These texts will be
identified by source/genre, and can be used or discarded as desired by
the user. If in fact the ACL materials seem to comprise a larger
percentage of the core corpus than reasonable, the rest will be put
into the varied component.

>
> BNC used a sample size from a single document of 40,000 words as its
> default.  However most of these documents weren't too specialised so
> ti didn't cause too many problems.  It's the combination of
> substantial samples with narrow text-types that is invidious.

We are certainly aware of this and working to ensure a broad sample. We
too are sampling texts, taking only a certain number of words from each.

>
> I've only referred to distorted frequency lists in the above.  They
> are the easiest effect of distortion to describe.  There will also be
> distortions of all sorts of other language-model components (bigrams,
> trigrams, grammars, induced lexicons etc)  - the problem is,
> it's hard to describe what or how and the distortions will usually
> go unnoticed, or even feature as "interesting discoveries about the
> lg".  That's why it's important to beware these balance issues when
> building a corpus in the first place.

We are. And we appreciate input such as yours, above, on how to best
achieve something reasonable.


> Technical language is an important part of
> language, and we are undermining an open-minded view of language if we
> exclude technical langauge wholesale.

Agreed!

> Maybe the corpus just needs to
> be much (MUCH) bigger so it can include substantial quantities of lots
> of different specialist text types, with none making up more than 0.1%
> of the whole (hey, I know a corpus like that, it's called the web ;-) )

This is really what the ANC hopes to be in the end. The rationale
behind the varied component is just that: put in what you can get, and
it should be possible to construct a sub-corpus from that data on the
basis of your own criteria, given that one can make a selection based
on text type/source.

As for the web, yes you have lots of specialized text types there--and
that is just the problem if one wants data that covers generalized
language usage. A small experiment we did and reported at LREC last May
suggested that web language on the whole is dramatically skewed toward
dense, academic-like prose (see Ide, N., Reppen, R., Suderman, K.
(2002). The American National Corpus: More Than the Web Can Provide.
Proceedings of the Third Language Resources and Evaluation Conference
(LREC), Las Palmas, Canary Islands, Spain, 839-44. Available at
http://www.cs.vassar.edu/~ide/papers/anc-lrec02.ps). We argue,
therefore, that no matter how much data you cull from the web, it will
be significantly skewed toward one end of the spectrum of "style" or
type.

A final point: The first release of 10 million words of the ANC, due
out in a month or so, will not be at all balanced--it will consist of
whatever data we have so far, as we are constrained by which texts have
been provided at what point and how much processing is required to put
them in a usable format. The intent of the first release is to provide
something quickly for our consortium members, and to enable them to
test the (very minimal) search and access interface and provide input
for the design of the final one, but we assume that many researchers
will also use it, whatever the content.

Nancy
======================================================

Nancy Ide

Professor and Chair
Department of Computer Science, Vassar College
Poughkeepsie, NY 12604-0520 USA
Tel: +1 845 437-5988 Fax: +1 845 437-7498
ide at cs.vassar.edu

Chercheur Associe
Equipe Langue et Dialogue, LORIA/CNRS
Campus Scientifique - BP 239
54506 Vandoeuvre-les-Nancy FRANCE
Tel: +33 (0)3 83 59 20 47 Fax: +33 (0)3 83 41 30 79
ide at loria.fr

=======================================================



More information about the Corpora mailing list