[Corpora-List] Do we still need language corpora?

Serge Sharoff s.sharoff at leeds.ac.uk
Sat Feb 5 10:35:59 UTC 2011


Dear Janne, Chris and all,

this all goes back to the old-ish argument (just wanted to type old, but
can we say that an argument running since 2003-2005 is old) that you
cannot remain at the mercy of the major search engines to do linguistic
analysis.  However, this means that Web-as-Corpus needs to proceed in
the way of getting web pages, annotating them automatically (POS,
parsing, domain, genre, etc) and using through a language-aware query
interface.  Adam Kilgarriff was probably the first to put this argument
forward, Phil Resnik, Marco Baroni, myself and many other people since
then contributed in terms of corpora, tools and interfaces.  Uni Oslo's
noWaC is a case in point, Marco and his colleagues created a two-billion
ukWac and parsed it syntactically (using Malt parser), I added a
BNC-like domain and genre annotation layer to it, so the web is at your
finger tips.

Serge


On Fri, 2011-02-04 at 19:29 +0000, CRuehlemann at aol.com wrote:
> Martin,
>  
> Many important points have already been raised in the discussion; cf.
> Janne Bondi's posting which sums up most of the relevant advantages of
> language corpora. 
>  
> I'd like to add just a quick note to deepen a little what was said on
> annotation. I fully agree that the availability of annotation in
> corpora gives them a clear edge over the web-as-corpus though I'm
> aware that not everybody will agree that annotation is vital or even
> helpful. Sinclair, for example, argued against annotated corpora and,
> instead, for raw-text corpora, trusting the text more than linguistic
> markup. But raw text alone (of which there are masses on the web) -
> how far will it get us in our quest for understanding language and its
> main use, communication? Raw-text analyses have given/can give us
> invaluable insights into the nature and forms of what Sinclair called
> the 'idiom principle', whereby words are co-selected phraseologically
> rather than selected individually slot by slot. So, yes, raw-text
> analyses can get us far in understanding how discourse is
> structured/pre-structured lexically and web-as-corpus analyses will
> probably not be much worse at that. Raw-text analyses can get us less
> far though when it comes to understanding how discourse is structured
> in terms of lexis-independent discourse units. To illustrate, an
> analysis which just crawls the following text surface will probably
> miss the very point of it:
>  
> Did I tell you about her little one  ...  who had stomach pains? ...
> As she come back she said Dad ... what? How long's our Mum going to be
> before she comes in? Another hour. Oh. Why? Oh well I’ve got a bit of
> a stomach ache and I want to talk to her you know it's women problems
> 
>  
> 
> What can lexically-driven 'idiom principle' analyses tell us about
> this? What good will analyses of ngrams/concgrams, collocation,
> colligation, collostruction, semantic preference, semantic prosody,
> textual colligation, etc. be in terms of what is, quite obviously,
> going on here discoursively, viz. a report of a sequence of speech
> turns? Not very much. If the same text is annotated for reporting
> units (direct [MDD] and free direct [MDF], in this case), analyses can
> be more revealing:
> 
>  
> 
> Did I tell you about her little one  ...  who had stomach pains? ...
> As she come back she said [MDDDad] ...  [MDFwhat?] [MDFHow long's our
> Mum going to be before she comes in?] [MDFAnother hour.] [MDFOh.] [MDF
> Why?] [MDFOh well I’ve got a bit of a stomach ache and I want to talk
> to her you know it's women problems.] 
> 
>  
> 
> 
> To be sure, implementing that kind of discourse annotation in not just
> a small text but a larger corpus is hard and time-consuming work, but
> worth it in that it can help us ask and answer questions that lie
> beyond the scope of purely linear-surface driven analyses. I'm
> therefore confident that carefully annotated language corpora still
> have a future lying before them, which is bright particularly for
> small ones with markup targeted to specific discourse phenomena that
> can, at present, not be implemented automatically.
>  
> Best
> Chris  
>  
>  



_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list