[Corpora-List] What is corpora and what is not?

Wed Oct 3 20:04:07 UTC 2012

The Corpus Iuris Civilis is before my time, but
with the currently available electronics texts in
web pages, wikii, news articles, SMS messages,
emails, databases, and many other sources, the
problem today is to organize a corpus based on the
criteria you want to explore.  

So given a buncha texts from these sources, the
issue is more about how to compartmentalize the
samples by whatever criteria you want to study,
partition them into equivalence classes by the
criteria you annotate them to render, and THEN you
have a corpus (or even corpora).  But the
restriction to only linguistics purposes is no
longer representative of usage; there are many,
many criteria you could use.  

For example, reputation management is used widely
today by big companies wanting to know what their
community is saying to each other about said big
companies, or by market influences, or patent
texts, or economic metrics, or many other
criteria.  

It's gotten a lot less simple than just
linguistics,
-Rich

Sincerely,
Rich Cooper
EnglishLogicKernel.com
Rich AT EnglishLogicKernel DOT com
9 4 9 \ 5 2 5 - 5 7 1 2
-----Original Message-----
From: corpora-bounces at uib.no
[mailto:corpora-bounces at uib.no] On Behalf Of
Graham White
Sent: Wednesday, October 03, 2012 10:57 AM
To: corpora at uib.no
Subject: Re: [Corpora-List] What is corpora and
what is not?

So the Corpus Iuris Civilis is not a corpus? This
seems an unusual way 
to define things, firstly because it unduly
privileges the medium of 
representation (and, as a computer scientist, that
seems to me to be a 
mistake), and, secondly, because it rather orphans
corpora which happen 
to be machine-readable: it ignores the
considerable continuities between 
what scholars do with machine-readable texts and
what scholars do with 
non-machine-readable texts. Why, after all, do we
have machine-readable 
corpora other than that we are interested in human
linguistic practices? 
Machine-readable corpora don't drop from space,
after all.

Graham

On 03/10/12 18:43, WILLIAMS Geoffrey wrote:
> Are we not slightly reinventing the wheel?
>
> The nature of corpora has been discussed for
years, EAGLES was about
> defining it. In 2005, John Sinclair enlarged
upon the 1996 definition
> when he wrote :
>
>> A corpus is a collection of pieces of language
text in electronic
>> format, selected according to external criteria
to represent, as far
>> as possible, a language or language variety as
a source of data for
>> linguistic research.
> Sinclair J. McH. . 2005. 'Corpus and Text: Basic
Principles'. In Wynne,
> M (ed). 2005. pp. 1-16.  Wynne, M (ed). 2005.
Developing Linguistic
> Corpora: A Guide to Good Practice. Oxford: AHDS
6 -
>
> It is also on the web!
>
> Surely anyone involved in corpora has read the
seminal works and does
> not need reminding that corpora are
machine-readable, maybe samples or
> whole works etc. What has changed is the rise of
internet corpora, but
> here too Kilgarriff and others have commented
the situation in a way
> that both NLP and corpus linguistic users can
feel at home with.
>
> Best
>
> Geoffrey
>
>
> _______________________________________________
> UNSUBSCRIBE from this page:
http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

_______________________________________________
UNSUBSCRIBE from this page:
http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora