[Corpora-List] robust statistics

Justin Washtell lec3jrw at leeds.ac.uk
Sat Mar 27 01:32:53 UTC 2010


> The methods are, in diverse areas, I find. On the other hand, I was surprised not to find them in the NaturalLanguageProcessing task view of my favorite programming language R.

Stefan Th. Gries may be the man to speak to about this.

> There is a difference between "Robust" ad "Nonparametric". In general "Robust" is more apt to handle outliers.

I would say your observation is correct, concerning the nomenclature. However, methods falling into both classes are appropriate (and probably underused) in circumstances when one knows little about the nature/distribution of one's data. That, to me, is the interesting thing. There seem to be plenty of potential outlier-related "fixes" for inherently non-robust (including parametric) methods. These are not so interesting, and probably don't take us very far forward.

>> corpus-driven NLP often deals with huge datasets
> Everybody claims this, for his industry. I found application of Robust statistic in data mining for a international bank (huge amount of customers).

Ok, then let us say that in these various fields the datasets are all huge by virtue of the things that we try to do with them (i.e. as much as our computers will allow). Certain robust methods are markedly slower or more resource-intensive than their non-robust counterparts, so the problem remains (I dare say international banks have more money than us available for offsetting such things with extra computing resources - I don't know).

Justin
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list