[Corpora-List] Do we still need language corpora?
CRuehlemann at aol.com
CRuehlemann at aol.com
Fri Feb 4 19:29:42 UTC 2011
Martin,
Many important points have already been raised in the discussion; cf. Janne
Bondi's posting which sums up most of the relevant advantages of language
corpora.
I'd like to add just a quick note to deepen a little what was said on
annotation. I fully agree that the availability of annotation in corpora gives
them a clear edge over the web-as-corpus though I'm aware that not
everybody will agree that annotation is vital or even helpful. Sinclair, for
example, argued against annotated corpora and, instead, for raw-text corpora,
trusting the text more than linguistic markup. But raw text alone (of which
there are masses on the web) - how far will it get us in our quest for
understanding language and its main use, communication? Raw-text analyses have
given/can give us invaluable insights into the nature and forms of what
Sinclair called the 'idiom principle', whereby words are co-selected
phraseologically rather than selected individually slot by slot. So, yes, raw-text
analyses can get us far in understanding how discourse is
structured/pre-structured lexically and web-as-corpus analyses will probably not be much worse
at that. Raw-text analyses can get us less far though when it comes to
understanding how discourse is structured in terms of lexis-independent
discourse units. To illustrate, an analysis which just crawls the following text
surface will probably miss the very point of it:
Did I tell you about her little one ... who had stomach pains? ... As
she come back she said Dad ... what? How long's our Mum going to be before
she comes in? Another hour. Oh. Why? Oh well I’ve got a bit of a stomach ache
and I want to talk to her you know it's women problems
What can lexically-driven 'idiom principle' analyses tell us about this?
What good will analyses of ngrams/concgrams, collocation, colligation,
collostruction, semantic preference, semantic prosody, textual colligation, etc.
be in terms of what is, quite obviously, going on here discoursively, viz.
a report of a sequence of speech turns? Not very much. If the same text is
annotated for reporting units (direct [MDD] and free direct [MDF], in this
case), analyses can be more revealing:
Did I tell you about her little one ... who had stomach pains? ... As
she come back she said [MDDDad] ... [MDFwhat?] [MDFHow long's our Mum going
to be before she comes in?] [MDFAnother hour.] [MDFOh.] [MDF Why?] [MDFOh
well I’ve got a bit of a stomach ache and I want to talk to her you know
it's women problems.]
To be sure, implementing that kind of discourse annotation in not just a
small text but a larger corpus is hard and time-consuming work, but worth it
in that it can help us ask and answer questions that lie beyond the scope
of purely linear-surface driven analyses. I'm therefore confident that
carefully annotated language corpora still have a future lying before them,
which is bright particularly for small ones with markup targeted to specific
discourse phenomena that can, at present, not be implemented automatically.
Best
Chris
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20110204/7ad1e653/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list