[Corpora-List] Reducing n-gram output
J Washtell
lec3jrw at leeds.ac.uk
Wed Oct 29 10:49:30 UTC 2008
Quoting Yannick Versley <versley at sfs.uni-tuebingen.de>:
> ...since data-driven approaches like this can easily
> lead onto the slippery slope to cargo-cult science where people blindly use
> nontrivial tool X to achieve a simple problem Y that actually has good
> solutions somewhere else (e.g, X=compression programs, Y=language modeling,
> where the speech community has been working for decades on n-gram- and
> syntax-based language models which also do a much better job at it).
I would whole-heartedly agree with Yannick that there is no sense in
applying any method blindly. It is blindness that often holds us back.
The sciences are full of very similar approaches to different tasks
(and very different approaches to similar tasks) developed entirely
independently in their respective domains, and yet which each remain
more-or-less oblivious to each other and their respective users.
Sometimes these are all but equivalent: generalized mean and minkowski
distance; cosine distance and Pearson's correlation. Sometimes they
are just surprisingly similar: efforts in language modelling and
compression being a case in point. Sometimes they are even in the same
domain: Nick Nolte and Gary Busey.
I once viewed the move towards multidisciplinary research as some kind
of misplaced scientific political correctness. However, now my present
explorations include the ("surprisingly obvious once you consider it")
application of a very simple sixty-year old biogeographical method to
language modelling. I am now more inclined to think of
multidisciplinarity as the messiah of enlightenment (though I dare say
I am due for a revision). My take on this advice therefore might be
something like: Take X and Y. Try and ascertain their individual
advantages, limitations and similarities with respect to the problem.
If neither X or Y are ideal, consider if they suggest a third, better,
approach Z. Check to see if anything like Z has already been discussed
anywhere in the (wider) literature. If not try it out. Or if it has,
repeat process, with X, Y and now Z. Due to inconsistent nomenclature,
and the general isolation of the disciplines in the literature, it
makes for very heavy-going research, but I believe that the dividends
are [more than] proportionately larger.
I would agree with Yannick that the slippery slope of which he warns
is a real danger (one can find one or two people whizzing past on it
in the literature). But I might suggest that it would be even less
flattering to the science if we were to take a diametrically opposed
stance. It would be remiss to imply that compression algorithms, for
example, are only deserving of limited investigation in light of NLP's
successes without them (and I would resolutely contend that they are
non-trivial with comparison to language modelling). Like many
established research areas, compression provides a set of broadly
applicable tools and knowledge which are readily accessible for
exploration and ripe for tearing-apart and re-synthesizing. As long as
there is a discerning intellect at the helm, this can only be a good
thing.
Justin Washtell
University of Leeds
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list