[Corpora-List] What I came away with from the "What is a Corpus" discussion

Ken Litkowski ken at clres.com
Sat Oct 6 16:29:38 UTC 2012


I think Robert puts it pretty well. My reaction was simply to look up 
'corpus' in the online Oxford dictionary, where 'corpus' has two senses: 
the main sense is "a collection of written texts, especially the entire 
works of a particular author or a body of writing on a particular 
subject (e.g., the Darwinian corpus)" and a *subsense*, "a collection of 
written or spoken material in machine-readable form, assembled for the 
purpose of linguistic research". I think these pretty well subsume and 
obviate all the points made in this discussion.

     Ken

On 10/6/2012 12:08 PM, amsler at cs.utexas.edu wrote:
> The simplest summary I came away with is that a corpus is a set of
> texts that has a proposed purpose of study. At least one person must
> have an intention for the collection to serve a purpose. The
> unanswered question is whether a corpus has to even be texts, or can
> it be a corpus of other types of data; such as corpus of lexical
> items, a corpus of musical recordings, or a corpus of video clips.
>
> This definition of a corpus means that it may not be recognized as a
> corpus by anyone else other than its collector/creator. It may appear
> to be a random set of pages, a hapstance collection of books, etc.
> unless you figure out what they share in common. And note that
> 'randomness' is a purpose. Some of the most important corpora are
> those whose purpose is to be a random sample (or 'representative')
> sample of something. The Brown Corpus tried to be representative by
> being random. I suppose randomness requires every instance of the set
> collected from had an equal chance of being included--and
> representativeness requires enough items are collected to reflect the
> properties of the set collected from. Ah... but what "properties", eh.
>
> This is why a corpus needs an explanation of its properties, its
> reason for it being a corpus, to guarantee its recognition as a corpus
> and its utility to others.
>
> The discussion as to whether something deserves to be called a corpus 
> is picky.
> AS they say, we want big tent that invites in as many as possible.
>
> We should be discussing what constitutes "best practices" and not 
> trying to deny membership in the set of corpora to collections that 
> don't meet all the criteria. I'd be happier to learn of the levels of 
> qualifications that a corpus should have. Good documentation. 
> Availability. Size. "Representativeness" (of what?). Annotations. 
> Indexes of elements (spellings, phrases, named entities, 
> disambiguation of senses).
>
> How to make a corpus that adheres to "best practices" would be more 
> useful than deciding on whether someone's purposeful collection of 
> text qualified to be called a corpus by everyone.
>
>
>
>
>
> ----- End forwarded message -----
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>

-- 
Ken Litkowski                     TEL.: 301-482-0237
CL Research                       EMAIL: ken at clres.com
9208 Gue Road                     Home Page: http://www.clres.com
Damascus, MD 20872-1025 USA       Blog: http://www.clres.com/blog

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20121006/ff6dda33/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list