[Corpora-List] Web/Corpora Questions

Mon Oct 20 18:04:18 UTC 2003

Peet, you'll find several of these questions addressed (not necessary answered satisfactorily) in papers on my website
http://kwicfinder.com/.  
Some of the papers I cite in the references will be useful as well.
(see esp. http://kwicfinder.com/AAACL2002whf.pdf i.a. Cavaglià and Kilgarriff, and Ide, Reppen and Suderman)

I haven't seen any recent estimates of the total number pages on the Web, distribution of text types and languages -- follow up the stale bibliography in 
http://kwicfinder.com/FletcherCLLT2001.pdf .  
(I intend to search more assiduously for recent estimates for an update of that paper later this year, and have concrete plans to proceed with the linguistic search engine / web archive outlined in the TaLC paper during a sabbatical in 2004-05.)

Personally I believe for the major languages the Web is most useful for compiling ad-hoc corpora of texts dealing with specific domains or emerging usage, or else for answering specific questions such as the ongoing discussion about "personal price", where even large reference corpora such as the BNC have too few citations to  give the whole picture. De Schryver makes a useful distinction between "Web as corpus" and "Web for (compiling a) corpus", in his case as a source of data for African languages with little if any  machine-readable data.

( De Schryver, Gilles-Maurice, 2002. Web for / as Corpus: a Perspective for the African Languages.  Nordic Journal of African Studies 11(2): 266-282.
http://www.up.ac.za/academic/libarts/afrilang/webtocorpus.pdf )

I'm looking forward to other responses to this posting!

Best regards,
Bill Fletcher

>>> "peetm" <peet.morris at comlab.ox.ac.uk> 10/20/03 10:37 AM >>>
I, like a lot of people, am interested in the idea of using the web as a
data source for corpus construction.

Saying that, I have some basic questions that I'd really appreciate hearing
views on.

1.	What do (various groups of) users of corpora actually want, need or
wish for from a corpus: and, would 'web-text' meet these requirements?
2.	What are user's selection criteria - in choosing a corpus?
3.	Does anyone know: what kinds of texts are available on the web, of
what quality, and in what quantities (is there any data on this)?
4.	How would one estimate the necessary size of a corpus (to be useful
for some purpose) built from web-texts using sampling theory etc?

If anyone knows of any papers on any/all of this - please do tell!

I'd also be interested in opinions on the statement (in answer to '3'), 'who
can tell?', i.e. it's nonsensical  to even ask '3', because, as the web is
constantly changing, what can really be said about quantity, quality and the
text-types available etc??  Does this also invalidate the second part of '1'
- if one cannot tell what one might find, how could one judge ahead of time
whether or not it'd meet 'any' requirement?

Lastly, I think that the web contains some text-types that are unique to it,
e.g., chat-room and blog texts.  However, I'm on a sticky wicket as I have
no proof that that such text-types actually differ from texts found in
conventional corpora.  Does anyone know if there has been any examination of
this type of prose at all?  OR, if there hasn't, can someone suggest how
such an examination could be achieved?

Many thanks,

peetm

email: peet.morris at clg.ox.ac.uk

addr: Computational Linguistics Group

      University of Oxford

      The Clarendon Institute

      Walton Street

      Oxford

      OX1 2HG

=======================================

Important: This email is intended for the use of the individual addressee(s)
named above and may contain information that is confidential, privileged or
unsuitable for overly sensitive persons with low self-esteem, no sense of
humour or irrational religious beliefs.

If you are not the intended recipient, then social etiquette demands that
you fully appropriate the message without trace of the former sender and
triumphantly claim it as your own. Leaving a former sender's signature on a
"forwarded" email is very bad form and, while being only a technical breach
of the Olympic ideal, does in fact constitute an irritating social faux pas.

Further, sending this email to a colleague does not appear to breach the
provisions of the Copyright Amendment (Digital Agenda) Act 2000 of the
Commonwealth, because chances are none of the thoughts contained in this
email are in any sense original...

Finally, if you have received this email in error, shred it immediately,
then add it to some nutmeg, egg whites and caster sugar. Whisk until stiff
peaks form, then place it in a warm oven for 40 minutes. Remove promptly and
let it stand for 2 hours before adding the decorative kiwi fruit and cream.
Then notify me immediately by return email and eat the original message.