[Corpora-List] Reducing n-gram output

Wed Oct 29 10:49:30 UTC 2008

Quoting Yannick Versley <versley at sfs.uni-tuebingen.de>:

> ...since data-driven approaches like this can easily
> lead onto the slippery slope to cargo-cult science where people blindly use
> nontrivial tool X to achieve a simple problem Y that actually has good
> solutions somewhere else (e.g, X=compression programs, Y=language modeling,
> where the speech community has been working for decades on n-gram- and
> syntax-based language models which also do a much better job at it).

I would whole-heartedly agree with Yannick that there is no sense in  
applying any method blindly. It is blindness that often holds us back.  
The sciences are full of very similar approaches to different tasks  
(and very different approaches to similar tasks) developed entirely  
independently in their respective domains, and yet which each remain  
more-or-less oblivious to each other and their respective users.  
Sometimes these are all but equivalent: generalized mean and minkowski  
distance; cosine distance and Pearson's correlation. Sometimes they  
are just surprisingly similar: efforts in language modelling and  
compression being a case in point. Sometimes they are even in the same  
domain: Nick Nolte and Gary Busey.

I once viewed the move towards multidisciplinary research as some kind  
of misplaced scientific political correctness. However, now my present  
explorations include the ("surprisingly obvious once you consider it")  
application of a very simple sixty-year old biogeographical method to  
language modelling. I am now more inclined to think of  
multidisciplinarity as the messiah of enlightenment (though I dare say  
I am due for a revision). My take on this advice therefore might be  
something like: Take X and Y. Try and ascertain their individual  
advantages, limitations and similarities with respect to the problem.  
If neither X or Y are ideal, consider if they suggest a third, better,  
approach Z. Check to see if anything like Z has already been discussed  
anywhere in the (wider) literature. If not try it out. Or if it has,  
repeat process, with X, Y and now Z. Due to inconsistent nomenclature,  
and the general isolation of the disciplines in the literature, it  
makes for very heavy-going research, but I believe that the dividends  
are [more than] proportionately larger.

I would agree with Yannick that the slippery slope of which he warns  
is a real danger (one can find one or two people whizzing past on it  
in the literature). But I might suggest that it would be even less  
flattering to the science if we were to take a diametrically opposed  
stance. It would be remiss to imply that compression algorithms, for  
example, are only deserving of limited investigation in light of NLP's  
successes without them (and I would resolutely contend that they are  
non-trivial with comparison to language modelling). Like many  
established research areas, compression provides a set of broadly  
applicable tools and knowledge which are readily accessible for  
exploration and ripe for tearing-apart and re-synthesizing. As long as  
there is a discerning intellect at the helm, this can only be a good  
thing.

Justin Washtell
University of Leeds

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora