[Corpora-List] What I came away with from the "What is a Corpus" discussion

amsler at cs.utexas.edu amsler at cs.utexas.edu
Sat Oct 6 16:08:25 UTC 2012


The simplest summary I came away with is that a corpus is a set of
texts that has a proposed purpose of study. At least one person must
have an intention for the collection to serve a purpose. The
unanswered question is whether a corpus has to even be texts, or can
it be a corpus of other types of data; such as corpus of lexical
items, a corpus of musical recordings, or a corpus of video clips.

This definition of a corpus means that it may not be recognized as a
corpus by anyone else other than its collector/creator. It may appear
to be a random set of pages, a hapstance collection of books, etc.
unless you figure out what they share in common. And note that
'randomness' is a purpose. Some of the most important corpora are
those whose purpose is to be a random sample (or 'representative')
sample of something. The Brown Corpus tried to be representative by
being random. I suppose randomness requires every instance of the set
collected from had an equal chance of being included--and
representativeness requires enough items are collected to reflect the
properties of the set collected from. Ah... but what "properties", eh.

This is why a corpus needs an explanation of its properties, its
reason for it being a corpus, to guarantee its recognition as a corpus
and its utility to others.

The discussion as to whether something deserves to be called a corpus  
is picky.
AS they say, we want big tent that invites in as many as possible.

We should be discussing what constitutes "best practices" and not  
trying to deny membership in the set of corpora to collections that  
don't meet all the criteria. I'd be happier to learn of the levels of  
qualifications that a corpus should have. Good documentation.  
Availability. Size. "Representativeness" (of what?). Annotations.  
Indexes of elements (spellings, phrases, named entities,  
disambiguation of senses).

How to make a corpus that adheres to "best practices" would be more  
useful than deciding on whether someone's purposeful collection of  
text qualified to be called a corpus by everyone.





----- End forwarded message -----


_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list