[Corpora-List] Do we still need language corpora?

CRuehlemann at aol.com CRuehlemann at aol.com
Fri Feb 4 19:29:42 UTC 2011


Martin,
 
Many important points have already been raised in the discussion; cf. Janne 
 Bondi's posting which sums up most of the relevant advantages of language  
corpora. 
 
I'd like to add just a quick note to deepen a little what was said on  
annotation. I fully agree that the availability of annotation in corpora gives  
them a clear edge over the web-as-corpus though I'm aware that not  
everybody will agree that annotation is vital or even helpful. Sinclair, for  
example, argued against annotated corpora and, instead, for raw-text  corpora, 
trusting the text more than linguistic markup. But raw text alone (of  which 
there are masses on the web) - how far will it get us in our quest for  
understanding language and its main use, communication? Raw-text analyses  have 
given/can give us invaluable insights into the nature and forms  of what 
Sinclair called the 'idiom principle', whereby words are co-selected  
phraseologically rather than selected individually slot by slot. So, yes,  raw-text 
analyses can get us far in understanding how discourse is  
structured/pre-structured lexically and web-as-corpus analyses will  probably not be much worse 
at that. Raw-text analyses can  get us less far though when it comes to 
understanding how discourse is  structured in terms of lexis-independent 
discourse units. To illustrate, an  analysis which just crawls the following text 
surface will probably miss  the very point of it:
 
 
Did I tell you about her little one  ...  who had stomach pains? ...  As 
she come back she said Dad ... what? How long's our Mum going to be before  
she comes in? Another  hour. Oh. Why? Oh well I’ve got a bit of a stomach ache 
and  I want to talk to her you know it's women  problems 
What can lexically-driven 'idiom  principle' analyses tell us about this? 
What good will analyses of  ngrams/concgrams, collocation, colligation, 
collostruction, semantic preference,  semantic prosody, textual colligation, etc. 
be in terms of what is, quite  obviously, going on here discoursively, viz. 
a report of a sequence of speech  turns? Not very much. If the same text is 
annotated for reporting units (direct  [MDD] and free direct [MDF], in this 
case), analyses can be more  revealing: 
Did I tell you about her little one  ...  who had stomach pains? ...  As 
she come back she said  [MDDDad] ...  [MDFwhat?] [MDFHow long's our Mum going 
to be before she  comes in?] [MDFAnother hour.] [MDFOh.] [MDF Why?] [MDFOh 
well I’ve got a bit of a stomach ache and  I want to talk to her you know 
it's women problems.]  

To be sure, implementing that kind of discourse annotation in not just  a 
small text but a larger corpus is hard and time-consuming work, but worth  it 
in that it can help us ask and answer questions that lie beyond the scope 
of  purely linear-surface driven analyses. I'm therefore confident that 
carefully  annotated language corpora still have a future lying before them, 
which is  bright particularly for small ones with markup targeted to specific  
discourse phenomena that can, at present, not be implemented  automatically.
 
Best
Chris  
 
 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20110204/7ad1e653/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list