[Corpora-List] Quantitive Corpus Linguistics

Dom Widdows widdows at google.com
Fri Aug 22 04:03:05 UTC 2008


Dear All,

I certainly agree that studying the relevant philosophy has been an
important part of many (if not most) successful scientific endeavours,
though it can also mislead if applied in the wrong contexts (the same
can be said of mathematics.

Peter Helias is not someone I'd come across before, and he's not the
easiest to find out about online - I have started a stub Wikipedia
article (at http://en.wikipedia.org/wiki/Peter_Helias), but his
contribution to the theory of substance and accidence is still unclear
to me. Christian scholars often trace this through Aquinas (important
in the theory of transubstantiation - body and blood of christ are
substance, bread and wine are accidence), and perhaps through
Augustine to Aristotle.
(I know most of this through dinner conversations with my father, so
don't really know the references well). A more pluralistic story might
be to trace the influence of Aristotle through Averroes and al-Farabi,
who certainly wrote some fascinating things on the way words would
become reused, formally or informally, to refer to many different but
related concepts - perhaps anticipating generative lexicon theory.

I'm surprised to hear the notion that "collocation is everything"
coming through a voice in this tradition, I haven't yet found such
arch-empiricist quotes from Helias himself was (but need to find more
corpus data here!). I think of this "data is everything, there is no
need for a mind" attitude  associated with David Hume and the Scottish
enlightenment, sometimes described as a kind of reaction to Descartes'
"reason is everything" (or at least "I am a thing that thinks", as
contrasted with a thing that experiences and learns). Leibniz and Kant
are both supposed to have tried to find different middle-grounds
between these extremes. (Here I could find probably find good quotes,
but it's getting late ... write to me if you want me to try and back
this up with sources.)

There are a couple of themes behind this ramble, honest ...

The first is that every branch and period of science struggles over
this learning vs. reasoning territory, and we are very much in the
midst of this struggle in computational linguistics. If we can learn
anything from the story of other sciences (even mechanics), corralling
one side or the other into putting their tools away never leads to the
full story.

Secondly, there is an Aristotelean theme throughout - Aristotle's
influence isn't opposed to Plato's, it emphasizes a framework for
learning in a world that still has a lot of underlying form to it.

On 8/21/08, Mike Maxwell <maxwell at umiacs.umd.edu> wrote:
> J Washtell wrote:
>  > I find it a bit optimistic (given my own intuitions of course. But I
>  > should say that I do not find it beyond the realms of possibility)
>  > that the evidence necessary to solve all of our linguistic and
>  > (unavoidably?!) cognitive-linguistic ponderings is to be found in the
>  > text (not in the brain, say, or in the extra-corporal context).

Hence I agree with this reservation - trying to find everything in the
text alone would be like Hume trying to find everything in the data
alone without any contribution from reasoning. (Please come out an
correct me, Hume scholars, if I'm out of line here.)

> Not to mention that if you limit yourself to studying things that
>  require large corpora, you rule out studying perhaps 99% of the
>  languages in the world.

This I'd disagree with - you can learn things about the structure of
language in general by considering available large corpora, and use
this knowledge to try and enhance what you can do with small datasets.
Linear B was a comparatively small corpus, but using knowledge of
classical Greek, it could be decifered. Perhaps this is a canned
example since the languages are in a sense "the same" - but even for
completely unrelated languages, a good linguist uses information
learned about familiar languages to build expertise on language in
general, and can then apply this expertise and technique to fresh
languages with small amounts of data. It's only if corpus linguistics
explicitly rules out generalization that  a strictly empiricist
approach leads to no cross-lingual extrapolation.

Best wishes,
Dominic

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list