<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML xmlns:o = "urn:schemas-microsoft-com:office:office"><HEAD>
<META http-equiv=Content-Type content="text/html; charset=UTF-8">
<META content="MSHTML 6.00.6001.18542" name=GENERATOR></HEAD>
<BODY id=role_body style="FONT-SIZE: 10pt; COLOR: #000000; FONT-FAMILY: Arial" bottomMargin=7 leftMargin=7 topMargin=7 rightMargin=7><FONT id=role_document face=Arial color=#000000 size=2>
<DIV>Martin,</DIV>
<DIV> </DIV>
<DIV>Many important points have already been raised in the discussion; cf. Janne
Bondi's posting which sums up most of the relevant advantages of language
corpora. </DIV>
<DIV> </DIV>
<DIV>I'd like to add just a quick note to deepen a little what was said on
annotation. I fully agree that the availability of annotation in corpora gives
them a clear edge over the web-as-corpus though I'm aware that not
everybody will agree that annotation is vital or even helpful. Sinclair, for
example, argued against annotated corpora and, instead, for raw-text
corpora, trusting the text more than linguistic markup. But raw text alone (of
which there are masses on the web) - how far will it get us in our quest for
understanding language and its main use, communication? Raw-text analyses
have given/can give us invaluable insights into the nature and forms
of what Sinclair called the 'idiom principle', whereby words are co-selected
phraseologically rather than selected individually slot by slot. So, yes,
raw-text analyses can get us far in understanding how discourse is
structured/pre-structured lexically and web-as-corpus analyses will
probably not be much worse at that. Raw-text analyses can
get us less far though when it comes to understanding how discourse is
structured in terms of lexis-independent discourse units. To illustrate, an
analysis which just crawls the following text surface will probably miss
the very point of it:</DIV>
<DIV> </DIV>
<DIV>
<P class=MsoNormal style="MARGIN: 0cm 0cm 0pt"><SPAN lang=EN-GB style="mso-ansi-language: EN-GB; mso-fareast-language: DE"><FONT size=3><FONT face=Calibri>Did I tell you about her little one<SPAN style="mso-spacerun: yes"> </SPAN>...<SPAN style="mso-spacerun: yes"> </SPAN>who had stomach pains? ...<SPAN style="mso-spacerun: yes"> </SPAN>As she come back she said <SPAN style="mso-bidi-font-weight: bold">Dad</SPAN> ... <SPAN style="mso-bidi-font-weight: bold">what? How long's our Mum going to be before
she comes in?</SPAN> <SPAN style="mso-bidi-font-weight: bold">Another
hour.</SPAN> <SPAN style="mso-bidi-font-weight: bold">Oh.</SPAN><SUP>
</SUP><SPAN style="mso-bidi-font-weight: bold">Why?</SPAN> <SPAN style="mso-bidi-font-weight: bold">Oh well I’ve got a bit of a stomach ache and
I want to talk to her you know it's women
problems</SPAN></FONT></FONT></SPAN></P>
<P class=MsoNormal style="MARGIN: 0cm 0cm 0pt"><SPAN lang=EN-GB style="mso-ansi-language: EN-GB; mso-fareast-language: DE"><FONT size=3><FONT face=Calibri><SPAN style="mso-bidi-font-weight: bold"></SPAN></FONT></FONT></SPAN> </P>
<P class=MsoNormal style="MARGIN: 0cm 0cm 0pt"><SPAN lang=EN-GB style="mso-ansi-language: EN-GB; mso-fareast-language: DE"><FONT size=+0><SPAN style="mso-bidi-font-weight: bold">What can lexically-driven 'idiom
principle' analyses tell us about this? What good will analyses of
ngrams/concgrams, collocation, colligation, collostruction, semantic preference,
semantic prosody, textual colligation, etc. be in terms of what is, quite
obviously, going on here discoursively, viz. a report of a sequence of speech
turns? Not very much. If the same text is annotated for reporting units (direct
[MDD] and free direct [MDF], in this case), analyses can be more
revealing:</SPAN></FONT></SPAN></P>
<P class=MsoNormal style="MARGIN: 0cm 0cm 0pt"><SPAN lang=EN-GB style="mso-ansi-language: EN-GB; mso-fareast-language: DE"><o:p><FONT face=Calibri size=3> </FONT></o:p></SPAN></P>
<P class=MsoNormal style="MARGIN: 0cm 0cm 0pt"><SPAN lang=EN-GB style="mso-ansi-language: EN-GB; mso-fareast-language: DE"><FONT size=3><FONT face=Calibri>Did I tell you about her little one<SPAN style="mso-spacerun: yes"> </SPAN>...<SPAN style="mso-spacerun: yes"> </SPAN>who had stomach pains? ...<SPAN style="mso-spacerun: yes"> </SPAN>As she come back she said
[<SUP>MDD</SUP><SPAN style="mso-bidi-font-weight: bold">Dad</SPAN>] ...<SPAN style="mso-spacerun: yes"> </SPAN>[<SUP>MDF</SUP><SPAN style="mso-bidi-font-weight: bold">what?]</SPAN> [<SUP>MDF</SUP><SPAN style="mso-bidi-font-weight: bold">How long's our Mum going to be before she
comes in?]</SPAN> [<SUP>MDF</SUP><SPAN style="mso-bidi-font-weight: bold">Another hour.]</SPAN> [<SUP>MDF</SUP><SPAN style="mso-bidi-font-weight: bold">Oh.]</SPAN> [<SUP>MDF </SUP><SPAN style="mso-bidi-font-weight: bold">Why?]</SPAN> [<SUP>MDF</SUP><SPAN style="mso-bidi-font-weight: bold">Oh well I’ve got a bit of a stomach ache and
I want to talk to her you know it's women problems.]</SPAN>
</FONT></FONT></SPAN></P>
<P class=MsoNormal style="MARGIN: 0cm 0cm 0pt"><SPAN lang=EN-GB style="mso-ansi-language: EN-GB; mso-fareast-language: DE"><FONT size=3><FONT face=Calibri></FONT></FONT></SPAN> </P></DIV>
<DIV>To be sure, implementing that kind of discourse annotation in not just
a small text but a larger corpus is hard and time-consuming work, but worth
it in that it can help us ask and answer questions that lie beyond the scope of
purely linear-surface driven analyses. I'm therefore confident that carefully
annotated language corpora still have a future lying before them, which is
bright particularly for small ones with markup targeted to specific
discourse phenomena that can, at present, not be implemented
automatically.</DIV>
<DIV> </DIV>
<DIV>Best</DIV>
<DIV>Chris </DIV>
<DIV> </DIV>
<DIV> </DIV></FONT></BODY></HTML>