[Corpora] [Corpora-List] Calculating statistical significant

Stefan Evert stefanML at collocations.de
Mon Nov 10 15:14:49 UTC 2014


On 10 Nov 2014, at 15:06, Angus Grieve-Smith <grvsmth at panix.com> wrote:

>> A bit struggling of calculating the statistical significant between the output of two systems. Suppose Ive got the following two results from two independent systems (performing sequence labelling task):
>> 
>> System 01: 
>> precision:  81.57%; recall:  57.12%; FB1:  67.19%
>> 
>> System 02: 
>> precision:  84.07%; recall:  62.47%; FB1:  71.68%

>> Could someone pinpoint me to the way of calculating the statistical significant between them?

That depends very much on what your task looks like.  It might be easiest – and is often done in computational linguistics – to carry out a ten-fold cross-validation and apply a paired t-test to the quality measure of your choice (e.g. F-score).  To be precise, sample A would be the F-scores achieved by Sys 1 across the ten folds, and sample B the F-scores achieved by Sys 2 on _exactly the same_ folds (and in _exactly the same order_).

Despite valid concerns such as those raised by Angus (and purely mathematical issues such as whether the assumption of a Gaussian distribution of the individual F-scores is justified), this is a reasonable procedure to determine whether there is a significant difference between the two systems.

With a single train/test split, you can only test significance under special circumstances that allow you to treat the test set as a random sample (from the population of all texts the systems are expected to process).  This is indeed the case for a tagging task, which your description as "sequence labelling" suggests.  However, in that case you would use accuracy as an evaluation criterion; precision/recall are only defined for a single label (unless you took a weighted average across labels).

For a tagging task evaluated in terms of accuracy, you can apply McNemar's test to the output of the two systems.  The samples correspond to all tokens in the test set, and the observed values are (i) whether Sys 1 is correct on this token and (ii) whether Sys 2 is correct on this token.

Hope this helps,
Stefan



_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list