[Corpora-List] Bootcamp: 'Quantitative Corpus Linguistics with R'--re Louw's endorsement

Tue Aug 19 08:20:52 UTC 2008

Dear Linas,

It is nice that you also know who you are. It helps.

You have turned to corpora as the mentalist parsers failed to work, which is
quite normal as fortunately language is about a well of means of expressions
that users have to interpret. This means that meaning is negotiated by use of
contextual features. If you do not understand the context, macro - context of
culture, or micro - context of situation, you will simply end up with useless
chunks. If you are interested in understanding meaning then I recommend the
writings of Patrick Hanks as he is both a corpus linguist and a lexicographer.
The former studies language for what it is, a means of human communication, the
second has to 'tame' it to illustrate meaning potentials in a dictionary. 'Do
word meanings exist' (Computers & the Humanities, 2000, 34:205-215) is a good
starting point, if you want to understand corpus linguistics, the John
Sinclair's 'Corpus, Concordance, Collocation' (OUP, 1991) must remain your first
port of call. It has never been bettered in the simple way it introduces the
complexities of real language.

On the other hand, you might not be interested in corpus linguistics and meaning
at all if you are primarily concerned with extracting patterns. This would mean
that you might not need a corpus, in our terms, at all, but a mass of data. This
will be fine to extract patterns, but will tell you nothing about meaning for
which the interplay of lexis and syntax in context are essential. This is not a
problem as many a linguist would be delighted to test and refine the tool once
it has become available on sourceforge.

To say you don't need a corpus is confusing, but the sense of the word in corpus
linguistics is restricted. The title of a paper by Gunnel Engwall, 'Not Chance,
but choice' (in Atkins & Zampolli 1994, Computational Approaches to the Lexicon,
Oxford, Clarendon Press) nicely sums up the situation. John Sinclair gave the
most used definition in the 1996 EAGLES recommendations (still available online
at : http://www.ilc.cnr.it/EAGLES96/home.html). He updated this for the
Developing Linguistic Corpora book (Wynne (ed)2005 AHDS) consultable online at :
http://ahds.ac.uk/creating/guides/linguistic-corpora/chapter1.htm

This boils down to Firth's insistence on Context of Situation. Without knowledge
of the environment that produces and uses a text your 'corpus' is worthless. 

The Bank of English was built following reproducible criteria. However, unlike
the BNC it is a monitor corpus, that is to say it has grown over time to reach
its present size. This does mean that it contains language over a long period of
time as opposed to the fixed period for corpora like the BNC. A corpus like the
UKWac used by sketch engine (http://www.sketchengine.co.uk) is truly vast at 2
035 621 120, but restriction apply and it is only available on line, as far as I
know.

Copyright is always a major headache, the excuse of academic research does not
work as the data holders are after money and know that some language engineering
firms will cough up. I used to be told that 'old news is no news', but many
newspapers believe they are on an electronic gold mine and are too stupid and
rapacious to understand that the information systems they want will not come
from tentacular giants but from researchers, who will then be plagiarized to
death by the tentacular giants but that is another story.

Mark is right in saying that you should build your own. There are web-as-corpus
tools that would let you control some of the parameters. You could easily get
the vast amount of data you seek, but I would hesitate to call it a corpus.
Obviously what you get will be flawed as all web corpora suffers from the
deficiency that not all data types are readily available on the web, unless your
crawler breaks passwords and gets into the sites of newspapers and journals.
Admittedly all corpora are flawed, a corpus linguist just has to be aware of the
limitations of the corpus.

Hope this clears up some issues, even if he does not get you he data you desire.

Best

Geoffrey 

Surlignage Linas Vepstas <linasvepstas at gmail.com>:

> 2008/8/18 Geoffrey Williams <geoffrey.williams at univ-ubs.fr>:
> >
> > However, I also know that I am a corpus linguist, I do not do Natural
> > Language Processing, nor Cognitive Linguistics, not because I am not
> interested,
> > nor because I consider them irrelevant, but because I am primarily
> interested in
> > language in the corpus. To quote John Sinclair, I Â« Trust the Text Â».
> Trusting
> > the Text is what corpus linguistics is all about. It is instructive to go
> back
> > to the writings of Firth who refuted all mentalism.
> 
> Hmm.  I'm a mathematician by  vocation, and an utter
> novice in linguistics.  For me, everything looks like math,
> and of course, I naturally refute everything that isn't math.
> 
> A large corpus of text allows me to model that text with
> mathematical exprssions. At the most basic level, these
> are low, vile statistical measures. At the next level, these
> statistical measures allow me to discern patterns and
> structures. For me, I perceive these patterns as "lambda
> expressions", but the linguistics community calls them
> "parsers". Olde-fashioned parsers were hand-built by
> means of "mentalism", and judged by means of
> mentalistically-annotated reference corpii.
> 
> Newer work tries to discover these patterns automatically,
> de novo, from text, with minimal a-priori assumptions.
> The cognitive folks, such as John Sowa, are trying to find
> patterns within patterns -- with the eventual goal of extracting
> meaning, in the sense of building a generally intelligent
> machine that can listen and talk -- talk properly, or hep-cat
> prosody.  Access to a large body of text is essential to this
> effort.
> 
> > However, saying that Cognitive linguist accepts corpus linguistics does
> sound
> > rather pretentious. I am glad they accept our existence, but saying so
> sounds a
> > bit like the so-called Unification Church that likes to takes bits from a
> > variety of religions whilst respecting the basic tenets of none.
> 
> It doesn't just "sound like", but rather "it is", and very
> intentionally so.  The only basic tenet is "make it work",
> in the sense of  "all is fair in love and war". Using a
> hodge-podge of techniques, borrowed and bastardized
> "crown jewels" from some discipline or another, that's
> what its about.  The theft is not only from various branches
> of linguistics -- its agnostic to where the ideas come from,
> and they come from everywhere.
> 
> --linas
> 

-- 
Geoffrey Williams, MSc, PhD
Professeur en sciences du langage
Directeur du département d'ingénierie du document
Université de Bretagne Sud - Faculté LSHS
4 rue Jean Zay
BP 92116
56321 LORIENT CEDEX
FRANCE

Tél: 33 (0) 2 97 87 29 20

--------------------------------------------------------------------------------
Université de Bretagne sud                               http://www.univ-ubs.fr/

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora