[Corpora-List] What is corpora and what is not?

Khalid CHOUKRI choukri at elda.org
Mon Oct 8 15:01:59 UTC 2012



Some of you, interested in general concepts, may want to read the 
article of Ole Norling-Christensen: Habeas corpus
published in the ELRA newsletter in ... 1996

http://www.elra.info/Newsletters-from-1996.html#1996
Best regards
Khalid



Laurence Anthony wrote, On 08/10/2012 16:46:
> On Mon, Oct 8, 2012 at 9:32 PM, Krishnamurthy, Ramesh
> <r.krishnamurthy at aston.ac.uk>  wrote:
>
>>> It seems to me that many corpus studies attempt to describe *language
>>> usage in some target domain* based on the analysis of a corpus.
>> Language description may have been the focus in earlier corpus linguistics.
>> The field has developed since then, and many corpus studies use language
>> description as part of the means to making statements about wider social
>> issues, eg forensics, pedagogy, politics, etc?
> Agreed. But as you write yourself, "many corpus studies use language
> description" to make statements about these wider issues. In this
> case, your "language description" is my "describe language usage" and
> your "wider issues" is my "target domain". So, your example simply
> supports my earlier statement rather than contracts it.
>
>
>>> I assume that the implication here is that the corpus is in some way
>>> representative of the target domain (for a particular feature). If it
>>> isn't and the corpus is simply "a (digitized) collection of texts", it
>>> means that none of these authors can *assume* that their results are
>>> generalizable in any way.
>> As in all fields, the corpus/dataset we have collected is all we can actually analyse.
>> The representativeness or not of this dataset to some other notional dataset is part
>> of the claim being made by the researcher, and readers can evaluate the degree of
>> validity of the claim.
> Ahh, my point is more subtle. If a researcher intends to make claims
> about the target domain based on a corpus and at the same time the
> researcher makes no assumptions about the representativeness of the
> corpus itself, i.e.,it's just "a collection of texts", then the claim
> itself is unfounded. Only when the researcher assumes (rightly or
> wrongly) that the corpus is representative, can the claim be made. Of
> course, readers can then assess that claim based on a comparison with
> other data (corpora) and the data (corpus) itself, but this is a
> separate issue. My point here is looking at the claim for the
> researcher's perspective.
>
>> Surely no researcher can *assume* anything? The generalizability or not of their
>> statements/results is again a matter for reader judgment?
> Yes they can! Researchers can assume A, and then derive B from A. We
> do it all the time in science. Do a concordance search of any math
> corpus and you'll find assumptions everywhere. Here's a concrete
> example,
>
> "Assuming the Brown corpus is representative of general English, we
> find that the most frequent word used in general English is 'the'."
>
> Of course, the assumption may be wrong, and that's what the reader can
> judge ("Is the Brown corpus really representative of general
> English?"). But, the audience can also judge the correctness of the
> statement *in the case that the assumption is true*. Here, the
> judgement would be, "Is the most frequent word used in General
> English, based on frequencies in the Brown Corpus, the word "the"?
>
> They are two different levels of question. The first leads to better
> assumptions  and thus advancement in our understanding of the concept
> (e.g, 'general English'). The second leads to more accurate results
> based on assumptions (e..g, ways to count word frequencies).
>
>
>> #5 Laurence wrote:
>>
>>> In our field, the corpus is the starting point. By comparing the
>>> results of previous corpora studies, we build *better* corpora (for a
>>> particular language feature), and ultimately better models (of that
>>> language feature).
>>
>>
>> I'm not sure what you mean by 'language feature'. The corpus is collected
>> on external criteria, the 'language features' emerge from the analysis?
> Here, I mean things like "connectives", "past tense", etc. "Past
> tense" does not emerge from the analysis. If we want to investigate
> "past tense usage in research paper methods sections ", we could
> collect methods sections in applied linguistics research papers.  But,
> someone could build a better corpus by collecting methods sections in
> multiple disciplines. I hope that clarifies this point.
>
>
>
>> Its (corpus) appropriacy or not - for a subsequently specified purpose - is
>> an evaluation we make in response to its use as a research dataset and
>> the 'possibly related' observations made from its analysis. The reader judges
>> the correlation between the data and the findings, and the extent of the
>> predictive (probabilistic) power of the statements to other texts/datasets.
> Agreed. But, see my earlier point about the need for the researcher to
> have a basis for making a claim in the first place. I would argue that
> the author *must* start with the assumption (see the point above) that
> the corpus is representative of the target domain (rightly or
> wrongly), before any claims can be made.
>
> So perhaps a good definition of corpus is the following:
>
> "A corpus is a collection of (digitized) texts that is *assumed to be*
> representative of a target domain."
>
> To me this captures everything we have discussed in this thread and
> also addresses the issues of representativeness mentioned above. The
> "assumed to be" wording is critical because it is the foundation of
> all linguistic inquiry but it also addresses the reality that the
> assumption might be wrong.
>
> Sorry to drag this discussion on even longer!
>
> Laurence.
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

-- 
*Khalid Choukri *
ELRA General secretary & ELDA CEO
email: choukri at elda.org;
Web: www.elra.info www.elda.org
Tel. +33 1 43 13 33 33 - Fax. +33 1 43 13 33 30

****************************************************
** Info on LREC 2012 : www.lrec-conf.org
***************************************************
*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20121008/da330756/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list