[Corpora-List] corpus linguistics and the real world

Bryar Family bryar at vermontel.net
Fri Sep 14 20:49:46 UTC 2007


I'm delighted to see that there is actually a discussion touching on the
purpose of corpora within the field of linguistic analysis. 

Many of us who originally joined this listserve came at the subject from
working or academic backgrounds in content management. Certainly we wanted
to know what corpora might be available. More importantly, we wanted to know
about developer/researcher experiences when using corpora to assess the
subject, relevance, authenticity or urgency and other meaning to be
discovered in a  given body of content.

We wanted to know: 
What were unique words or statistically significant linguistic clusters
within a corpora that could identify the subject areas of note in a body of
content, or statistically significant clusters within a body of content that
could trigger some sort of response reaction.

We originally looked at corpora to help us with linguistic analysis. We
hoped such an approach would allow us to construct statistical models to:

* Distinguish THE ORIGIN of a piece of content, as American, Brits, South
Asians and South Africans use the language differently. 
* Distinguish the AUTHORITY of a given piece of content. In many cases,
political cranks and non-authorities in a given subject area can be
distinguished from actual authorities by analysis of their language
patterns.
* Distinguish the SUBJECT of a given piece of content. When assessing the
subject matter of a given body of text, statistical models can be
constructed from corpora to enhance or validate the results of other content
analysis approaches. Such an approach can yield a fairly astounding level of
precision. 

Similar models can be constructed from corpora to distinguish high priority
content from other less "actionable" materials. 

>>From reading many of CORPORA-LIST contributions, it seems clear that corpora
is being used to distinguish and identify idiomatic expressions and jargon
in order to map them to their equivalents in other languages -- a terrific
help when contemplating the construction of automated translation systems in
the real world. 

Content categorizers, Taxonomists, and NLP developers of all sorts have
found Corpora to be an extremely useful tool for modeling a wide variety of
linguistic analysis applications. 

Given that fact, I have been surprised by how few "how do I make this work"
conversations occur on the Corpora list. There are even fewer, "is this a
reasonable approach to [some problem]?" discussions. 

I enjoy reading such conversations when they occur. I hope there will be
more of them. 

Jack Bryar
-----Original Message-----
From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf Of
Khurshid Ahmad
Sent: Friday, September 14, 2007 11:08 AM
To: Yorick Wilks
Cc: Antoinette Renouf; corpora at uib.no
Subject: Re: [Corpora-List] corpus linguistics

Dear Antoinette
Congratulations on assuming the presidency of ICAME. I am afraid I cannot
agree with you.  You publish a book and you can post it on your LIST; you
have a new journal and you can post that on the list.  The other name of
such an act is sometimes called product placement - public service
broadcasters in the UK are not allowed to do this kind of stuff.

And, now we have something interesting and, in my view main stream corpus
linguistics stuff.  It is very rare that people come out of their silos
and start talking about grammar, syntax and EVIDENCE in corpora.

The spirit of any internet based activity is that people vote with their
keyboard and if there are only two protagonists, then people tend to get
bored.   Rob is replying to Yorick and then John Sowa joins in: I have
learnt a lot and thanks to your List and long may it continue.

Yours truly

Khurshid
> Antoinette
> The problem with the Freeman-Sowa debate is that it actually was
> about corpus linguistics because Rob F has a view of how to extract
> patters and significance from corpora
> that got tangled up with VERY abstract concepts. But those who cannot
> stand those abstract words should remember where the debate started
> and what Rob's motivation was--something quite close to home--just
> look at his website! I think if people can't follow the hard bits
> they should be a bit tolerant, and/or just not read them and wait for
> a bit they like to come by--as opposed to trying to shut people down.
> Best
> Yorick
>
>
> On 14 Sep 2007, at 14:09, Antoinette Renouf wrote:
>
>> Dear List members,
>> Corpora-list was set up by ICAME (International Computer Archive of
>> Modern and Mediaeval English) to impart information and discuss
>> matters
>> of relevance to the field. We have been receiving complaints that
>> Corpora-list has recently been dominated by 3 or 4 people talking
>> about
>> topics having little to do with corpus linguistics. Far be it from
>> me to
>> spoil the fun or stifle debate, but we are all anyway swamped with
>> information beyond our capacity to process, so please do restrict your
>> comments to issues of more central relevance to the particular list
>> community, and/or consider whether private emails might be a more
>> appropriate mechanism for continuing debate once you have found a
>> conversation partner.
>> All good wishes
>> Antoinette Renouf
>> Chair of ICAME
>>
>> --------------------------
>> Antoinette Renouf
>> Professor of English Language and Linguistics
>> School of English
>> University of Central England in Birmingham
>>
>>
>> _______________________________________________
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/listinfo/corpora
>>
>
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>


Khurshid Ahmad

Professor of Computer Science
Department of Computer Science
Trinity College,
DUBLIN-2
IRELAND
Phone 00 353 1 896 8429

Web Page: http://people.tcd.ie/kahmad


_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora

No virus found in this incoming message.
Checked by AVG Free Edition. 
Version: 7.5.485 / Virus Database: 269.13.16/1005 - Release Date: 9/13/2007
11:45 AM

No virus found in this outgoing message.
Checked by AVG Free Edition. 
Version: 7.5.485 / Virus Database: 269.13.16/1005 - Release Date: 9/13/2007
11:45 AM
 




_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list