[Corpora-List] Web/Corpora Questions

peetm peet.morris at comlab.ox.ac.uk
Mon Oct 20 14:37:02 UTC 2003


I, like a lot of people, am interested in the idea of using the web as a
data source for corpus construction.

 

Saying that, I have some basic questions that I'd really appreciate hearing
views on.

 

1.	What do (various groups of) users of corpora actually want, need or
wish for from a corpus: and, would 'web-text' meet these requirements?
2.	What are user's selection criteria - in choosing a corpus?
3.	Does anyone know: what kinds of texts are available on the web, of
what quality, and in what quantities (is there any data on this)?
4.	How would one estimate the necessary size of a corpus (to be useful
for some purpose) built from web-texts using sampling theory etc?

 

If anyone knows of any papers on any/all of this - please do tell!

 

I'd also be interested in opinions on the statement (in answer to '3'), 'who
can tell?', i.e. it's nonsensical  to even ask '3', because, as the web is
constantly changing, what can really be said about quantity, quality and the
text-types available etc??  Does this also invalidate the second part of '1'
- if one cannot tell what one might find, how could one judge ahead of time
whether or not it'd meet 'any' requirement?

 

Lastly, I think that the web contains some text-types that are unique to it,
e.g., chat-room and blog texts.  However, I'm on a sticky wicket as I have
no proof that that such text-types actually differ from texts found in
conventional corpora.  Does anyone know if there has been any examination of
this type of prose at all?  OR, if there hasn't, can someone suggest how
such an examination could be achieved?

 

Many thanks,

peetm

email: peet.morris at clg.ox.ac.uk

addr: Computational Linguistics Group

      University of Oxford

      The Clarendon Institute

      Walton Street

      Oxford

      OX1 2HG

=======================================

Important: This email is intended for the use of the individual addressee(s)
named above and may contain information that is confidential, privileged or
unsuitable for overly sensitive persons with low self-esteem, no sense of
humour or irrational religious beliefs.

If you are not the intended recipient, then social etiquette demands that
you fully appropriate the message without trace of the former sender and
triumphantly claim it as your own. Leaving a former sender's signature on a
"forwarded" email is very bad form and, while being only a technical breach
of the Olympic ideal, does in fact constitute an irritating social faux pas.

Further, sending this email to a colleague does not appear to breach the
provisions of the Copyright Amendment (Digital Agenda) Act 2000 of the
Commonwealth, because chances are none of the thoughts contained in this
email are in any sense original...

Finally, if you have received this email in error, shred it immediately,
then add it to some nutmeg, egg whites and caster sugar. Whisk until stiff
peaks form, then place it in a warm oven for 40 minutes. Remove promptly and
let it stand for 2 hours before adding the decorative kiwi fruit and cream.
Then notify me immediately by return email and eat the original message.

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20031020/5214bbbf/attachment.htm>


More information about the Corpora mailing list