<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<tt>I think Robert puts it pretty well. My reaction was simply to
look up 'corpus' in the online Oxford dictionary, where 'corpus'
has two senses: the main sense is "<span class="definition">a
collection of written texts, especially the entire works of a
particular author or a body of writing on a particular subject
(e.g., the Darwinian corpus)" and a <b>subsense</b>, "<span
class="definition">a collection of written or spoken material
in machine-readable form, assembled for the purpose of
linguistic research". I think these pretty well subsume and
obviate all the points made in this discussion.<br>
<br>
Ken<br>
<br>
</span></span></tt>
<div class="moz-cite-prefix">On 10/6/2012 12:08 PM,
<a class="moz-txt-link-abbreviated" href="mailto:amsler@cs.utexas.edu">amsler@cs.utexas.edu</a> wrote:<br>
</div>
<blockquote
cite="mid:20121006110825.ibpw38lv4scg8kck@webmail.utexas.edu"
type="cite">The simplest summary I came away with is that a corpus
is a set of
<br>
texts that has a proposed purpose of study. At least one person
must
<br>
have an intention for the collection to serve a purpose. The
<br>
unanswered question is whether a corpus has to even be texts, or
can
<br>
it be a corpus of other types of data; such as corpus of lexical
<br>
items, a corpus of musical recordings, or a corpus of video clips.
<br>
<br>
This definition of a corpus means that it may not be recognized as
a
<br>
corpus by anyone else other than its collector/creator. It may
appear
<br>
to be a random set of pages, a hapstance collection of books, etc.
<br>
unless you figure out what they share in common. And note that
<br>
'randomness' is a purpose. Some of the most important corpora are
<br>
those whose purpose is to be a random sample (or 'representative')
<br>
sample of something. The Brown Corpus tried to be representative
by
<br>
being random. I suppose randomness requires every instance of the
set
<br>
collected from had an equal chance of being included--and
<br>
representativeness requires enough items are collected to reflect
the
<br>
properties of the set collected from. Ah... but what "properties",
eh.
<br>
<br>
This is why a corpus needs an explanation of its properties, its
<br>
reason for it being a corpus, to guarantee its recognition as a
corpus
<br>
and its utility to others.
<br>
<br>
The discussion as to whether something deserves to be called a
corpus is picky.
<br>
AS they say, we want big tent that invites in as many as possible.
<br>
<br>
We should be discussing what constitutes "best practices" and not
trying to deny membership in the set of corpora to collections
that don't meet all the criteria. I'd be happier to learn of the
levels of qualifications that a corpus should have. Good
documentation. Availability. Size. "Representativeness" (of
what?). Annotations. Indexes of elements (spellings, phrases,
named entities, disambiguation of senses).
<br>
<br>
How to make a corpus that adheres to "best practices" would be
more useful than deciding on whether someone's purposeful
collection of text qualified to be called a corpus by everyone.
<br>
<br>
<br>
<br>
<br>
<br>
----- End forwarded message -----
<br>
<br>
<br>
_______________________________________________
<br>
UNSUBSCRIBE from this page: <a class="moz-txt-link-freetext" href="http://mailman.uib.no/options/corpora">http://mailman.uib.no/options/corpora</a>
<br>
Corpora mailing list
<br>
<a class="moz-txt-link-abbreviated" href="mailto:Corpora@uib.no">Corpora@uib.no</a>
<br>
<a class="moz-txt-link-freetext" href="http://mailman.uib.no/listinfo/corpora">http://mailman.uib.no/listinfo/corpora</a>
<br>
<br>
</blockquote>
<br>
<pre class="moz-signature" cols="72">--
Ken Litkowski TEL.: 301-482-0237
CL Research EMAIL: <a class="moz-txt-link-abbreviated" href="mailto:ken@clres.com">ken@clres.com</a>
9208 Gue Road Home Page: <a class="moz-txt-link-freetext" href="http://www.clres.com">http://www.clres.com</a>
Damascus, MD 20872-1025 USA Blog: <a class="moz-txt-link-freetext" href="http://www.clres.com/blog">http://www.clres.com/blog</a>
</pre>
</body>
</html>