[Corpora] [Corpora-List] Calculating statistical significant

K. Taraka Rama taraka at fripost.org
Tue Nov 11 14:04:21 UTC 2014


Hi. How about Wilcoxon signed-rank test? It takes care of ranking the 
differences in hypotheses as well as the sign of the difference. It is a 
test that subsumes sign test and paired t-test for sample size >= 45.  
Grzegorz points that the exact binomial test is very simple to compute 
even using a simple spreadsheet software.

Taraka Rama.

On 2014-11-11 12:00, corpora-request at uib.no wrote:
> Today's Topics:
>
>     1. Re:  Calculating statistical significant (Stefan Evert)
>     2. Re:  Calculating statistical significant (Noura Farra)
>     3. Re:  Calculating statistical significant (Stefan Evert)
>     4. Re:  Calculating statistical significant (Grzegorz Chrupa?a)
>     5.  RELEASE OF sar-graph 2.0 (Feiyu Xu)
>     6. Re:  Calculating statistical significant (Myroslava Dzikovska)
>     7. Re:  Annotation tool to map sentences to their	logical form
>        (Graham Katz)
>     8.  Phrase similarity (Alexander Osherenko)
>     9. Re:  Annotation tool to map sentences to their logical form
>        (Kilian Evang)
>    10. Re:  Phrase similarity (Matthias Hartung)
>    11. Re:  Phrase similarity (Taher Pilehvar)
>    12. Re:  Phrase similarity (Yannick Versley)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Mon, 10 Nov 2014 19:01:22 +0100
> From: Stefan Evert <stefanML at collocations.de>
> Subject: Re: [Corpora-List] Calculating statistical significant
> To: Corpora Mailing List <corpora at uib.no>
>
>> This seems overly conservative to me. Suppose there is a lot of variance across the folds, but system 1 does exactly 0.5% better than system 2 on every fold. It seems like what you want to do is a t-test on the difference in performance.
> That's the _paired_ t-test I suggested.
>
>> That said, there are definitely machine learning / stats papers that argue against computing variance across cross-validation folds. I can't find the exact reference I'm thinking of, but the related work section of Demsar (JMLR 2006) seems like a useful starting point.
>> http://machinelearning.wustl.edu/mlpapers/paper_files/Demsar06.pdf
> Thanks for the interesting reference.  I wonder in what sense variance is underestimated by the cross-validation procedure (except wrt. to the dependency of the results on the training data, but that's something that is usually ignored in machine learning).
>
>> One could also apply a sign test in this case, which I personally find easier to understand. The trouble is that you may not have access to Sys 2's outputs on each instance (suppose you only know its reported accuracy); in this case, you can't apply the sign test or McNemar's test.
> Sign tests are intended for a situation where you have numerical (or at least ordinal) measurements.  If you enforce this by coding e.g. a correct tag as 1 and a wrong tag as 0, then the sign test should give you exactly the same result as McNemar's test.
>
> Best,
> Stefan
>
>
> ------------------------------
>
> Message: 2
> Date: Mon, 10 Nov 2014 12:28:54 -0500
> From: Noura Farra <noura at cs.columbia.edu>
> Subject: Re: [Corpora-List] Calculating statistical significant
> To: Zeljko Agic <zeljko.agic at gmail.com>
> Cc: corpora at uib.no
>
> Hi,
>
> I have used McNemar's test for calculating significance in the train set/
> test set scenario, evaluating for accuracy and weighted f-measure across
> labels. Here's a good calculator:
>
> http://vassarstats.net/propcorr.html
>
> You need to input 4 numbers:
> a: # times where both system 1 & system 2 are correct
> b: # times where system 1 is correct & system 2 is incorrect
> c: # times where system 2 is correct & system 1 is incorrect
> d: # times where both are incorrect
>
> Cheers,
> Noura
>
> On Mon, Nov 10, 2014 at 10:36 AM, Zeljko Agic <zeljko.agic at gmail.com> wrote:
>
>> On 2014-11-10 12:28, Jack Alan wrote:
>>
>>> <cut />
>>> Could someone pinpoint me to the way of calculating the statistical
>>> significant between them?
>>>
>> Hi,
>>
>> take a look at these two papers, and give the bootstrap test and
>> approximate randomization a try as well. :-)
>>
>> http://www.aclweb.org/anthology/D12-1091
>> http://www.aclweb.org/anthology/W/W14/W14-1601.pdf
>>
>> Bests,
>> Z.
>>
>>
>> _______________________________________________
>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/listinfo/corpora
>>
> -------------- next part --------------
> A non-text attachment was scrubbed...
> Name: not available
> Type: text/html
> Size: 2347 bytes
> Desc: not available
> URL: <http://www.uib.no/mailman/public/corpora/attachments/20141110/e394458c/attachment.txt>
>
> ------------------------------
>
> Message: 3
> Date: Mon, 10 Nov 2014 19:06:54 +0100
> From: Stefan Evert <stefanML at collocations.de>
> Subject: Re: [Corpora-List] Calculating statistical significant
> To: Corpora Mailing List <corpora at uib.no>
>
>> If you have the outputs of both systems on each instance, you may try bootstrap resampling, as done here: http://genomebiology.com/2008/9/S2/S2
> Indeed, if you have the full system outputs and if you believe that your test data form a random sample from the population of interest, you can apply bootstrap resampling in order to obtain confidence intervals for non-trivial evaluation criteria such as P, R and F-score.
>
> If you just want to know whether there is a significant difference between the two systems, you can simply apply McNemar's test.  The bootstrap resampling ? if implemented correctly ? will give you the same answer at much greater computational cost.
>
> If you're satisfied with accuracy as an evaluation criterion, you can also compute (binomial) confidence intervals for the two systems directly without bootstrapping.  A confidence interval for the difference in accuracy can be derived from McNemar's test ? I've implemented something along those lines for my PhD thesis long, long ago.
>
> Best,
> Stefan
>
>
>
>
>
> ------------------------------
>
> Message: 4
> Date: Mon, 10 Nov 2014 19:23:37 +0100
> From: Grzegorz Chrupa?a <pitekus at gmail.com>
> Subject: Re: [Corpora-List] Calculating statistical significant
> Cc: Corpora Mailing List <corpora at uib.no>
>
> It seems that the only reason to use McNemar's test these days is
> tradition. The exact binomial test can be computed in a fraction of a
> second with current hardware and software.
> --
>
> Grzegorz Chrupa?a
> Communication and Information Sciences
> Tilburg University
> PO Box 90153
> 5000 LE Tilburg
> The Netherlands
>
> Web: grzegorz.chrupala.me
> Phone: +31 13 466 3106
> Email: g.chrupala at uvt.nl
>
>
> On Mon, Nov 10, 2014 at 7:06 PM, Stefan Evert <stefanML at collocations.de> wrote:
>>> If you have the outputs of both systems on each instance, you may try bootstrap resampling, as done here: http://genomebiology.com/2008/9/S2/S2
>> Indeed, if you have the full system outputs and if you believe that your test data form a random sample from the population of interest, you can apply bootstrap resampling in order to obtain confidence intervals for non-trivial evaluation criteria such as P, R and F-score.
>>
>> If you just want to know whether there is a significant difference between the two systems, you can simply apply McNemar's test.  The bootstrap resampling ? if implemented correctly ? will give you the same answer at much greater computational cost.
>>
>> If you're satisfied with accuracy as an evaluation criterion, you can also compute (binomial) confidence intervals for the two systems directly without bootstrapping.  A confidence interval for the difference in accuracy can be derived from McNemar's test ? I've implemented something along those lines for my PhD thesis long, long ago.
>>
>> Best,
>> Stefan
>>
>>
>>
>> _______________________________________________
>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/listinfo/corpora
>
>
> ------------------------------
>
> Message: 5
> Date: Mon, 10 Nov 2014 19:31:05 +0100
> From: Feiyu Xu <feiyu at dfki.de>
> Subject: [Corpora-List] RELEASE OF sar-graph 2.0
> To: corpora <corpora at uib.no>, elsnet-list at elsnet.org,
> 	sigsem at aclweb.org,	siglex-board at googlegroups.com, nodali at helsinki.fi,
> 	meta-net-all at meta-net.eu, meta-technology-council at meta-net.eu,
> 	meta-net-lwwg at meta-net.eu, t4me-all at meta-net.eu, clef at dei.unipd.it,
> 	lt-all at dfki.de, UK SB <uk-sb at dfki.de>, mpi-all at mpi-sb.mpg.de,
> 	fbmit at cs.uni-sb.de, ontolog-forum at ontolog.cim3.net,
> 	semantic-web at w3.org
>
> Apologies for cross-posting
> Please forward this message to colleagues in the areas of interest
>
>
> =============================
>     RESOURCE ANNOUNCEMENT
>     RELEASE OF sar-graph 2.0
>     http://sargraph.dfki.de
> =============================
>
>
> ============================================================================
> Changes at a glance:
> ----------------------------------------------------------------------------
> - integration of WSD results into relation extraction patterns
> - annotation of word senses of content words in the patterns and sar-graphs
> - new representation of vertices
> - new Java API
> - WSD results and relevancy assessments of synsets wrt. semantic relations available as separate download
> ============================================================================
>
>
> The resource is available at http://sargraph.dfki.de.
>
> A sar-graph is a graph containing linguistic knowledge at syntactic and lexical semantic levels for a given language and target relation. A sar-graph for a targeted relation assembles many linguistic patterns that are used in texts to mention this relation.  The term "semantically associated relations" graph was chosen since the patterns may either express the target relation directly or by expressing a semantically associated relation. The nodes in a sar-graph contain information from various levels of abstraction, including semantic arguments of a target relation, content words, word senses, etc.; all of them needed to express and recognize an instance of the target relation. The nodes are connected by two kinds of edges: syntactic dependency-structure relations and lexical semantic relations, thus they are labelled with dependency-structure tags provided by a parser or lexical-semantic relation tags. A definition can be found in (Uszkoreit and Xu, 2013). The individual patterns are assembled in one graph per target relation for an easier combination of mentions gathered across sentences, but all patterns could also be employed individually.
>
> For a more detailed description see:
>   From Strings to Things -- SAR-Graphs: A New Type of Resource for Connecting Knowledge and Language
>   Hans Uszkoreit and Feiyu Xu (2013)
>   In Proceedings of 1st International Workshop on NLP and DBpedia (NLP&DBPedia), volume 1064, Sydney, NSW, Australia, CEUR Workshop Proceedings, 10/2013
>
> The current sar-graph version 2.0 contains syntactic dependency relations between content words, word senses, and semantic arguments; future versions will also integrate lexical semantic relations between word senses.
>
> In the current release, the patterns have been automatically learned by the web-scale version (Krause et al., 2012) of the relation extraction system DARE (Xu et al., 2007) from dependency structures obtained by parsing sentential mentions of the target relation. The vertices in a sar-graph are either semantic arguments of a target relation or content words (to be more exact, their word senses) needed to express/recognize an instance of the target relation. Several dependency parsers have been employed, but the current set of sar-graphs is built from parsing results of the MALT parser. In contrast to the first release mid-2014, this release includes results from word-sense disambiguation on the source sentences of patterns and sar-graphs. This WSD information, plus target-relation-relevancy assessments of BabelNet synsets are made available for additional download. Also, a new, more flexible API has been implemented, in particular wrt. to future extensions of the sar-graph data structure. This includes a simplified XML format as well as updated GraphML export functionality.
>
> Applications of sar-graphs are information extraction, question answering and summarisation. The resource might also be useful for research on paraphrases, textual entailment and syntactic variation within a language.
>
> ============================================================================
> Release 2.0 has the following properties:
>
> * Language: English
> * Number of target relations: 25
> * Arity of relations: n-ary relations (2?n?5)
> * Domains of relations: biographic information, corporations, awards
> * Format of patterns: DARE patterns in lemon format and specific xml schema (DTD provided)
> * Format of sar-graphs: specific xml schema (DTD provided)
> * API supports: reading and storing patterns and sar-graphs, accessing vertex
>   and edge information of DARE patterns and sar-graphs, pattern visualization
>
> Download: http://sargraph.dfki.de/download.html
> Statistics: http://sargraph.dfki.de/statistics.html
> More references: http://sargraph.dfki.de/publications.html
> Feedback via email: sargraph at dfki.de
> ============================================================================
>
>
> Sar-graphs were conceived and defined at DFKI LT-Lab Berlin and then realized in a collaboration between DFKI LT-Lab and the BabelNet group at Sapienza University of Rome.
>
> The development of sar-graphs is partially supported by
> * the German Federal Ministry of Education and Research (BMBF) through the project Deependance (contract 01IW11003)
> * the project LUcKY, a Google Focused Research Award in the area of Natural Language Understanding.
>
>
> Enjoy!
>
>
> Feiyu Xu
>
>
>
>
> ----------------------------------
> Dr. Feiyu Xu
>
> Senior Researcher
> DFKI Research Fellow
>
>
> DFKI  Projektbüro Berlin
> Alt Moabit 91c
> D-10559 Berlin
> Germany
> Phone +49-30-23895-1812
> Sek      +49-30-23895-1800
> Fax      +49-30-23895-1810
>
>
> E-mail: feiyu at dfki.de
>
> homepage: http://www.dfki.de/~feiyu
>
>
> ------------------------------------------------------------
>
> Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
> Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern
> Geschaeftsfuehrung:
> Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
> Dr. Walter Olthoff
>
> Vorsitzender des Aufsichtsrats:
> Prof. Dr. h.c. Hans A. Aukes
>
> Amtsgericht Kaiserslautern, HRB 2313
>
> ------------------------------------------------------------
>
>
>
>
>
>
>
>
>
>
> -------------- next part --------------
> A non-text attachment was scrubbed...
> Name: not available
> Type: text/html
> Size: 14835 bytes
> Desc: not available
> URL: <http://www.uib.no/mailman/public/corpora/attachments/20141110/7b34e6f3/attachment.txt>
>
> ------------------------------
>
> Message: 6
> Date: Tue, 11 Nov 2014 00:02:07 +0000
> From: Myroslava Dzikovska <mdzikovs at inf.ed.ac.uk>
> Subject: Re: [Corpora-List] Calculating statistical significant
> To: Jack Alan <j.o.alan2012 at gmail.com>, corpora at uib.no
>
> No one seems to have suggested this paper yet:
>
> Alexander Yeh. 2000. More accurate tests for the statistical
> significance of result differences. COLING 2000 Volume 2: The 18th
> International Conference on Computational Linguistics,
> http://www.aclweb.org/anthology/C00-2137
>
> It has a good explanation of problems with T-test as applied to
> precision in particular, and a suggested replacement, explained in a way
> that can be calculated.
>
> I also found the following presentation helpful
> http://masanjin.net/sigtest.pdf
>
>
> Myrosia
>> On 10/11/14 11:28, Jack Alan wrote:
>> Hi folks,
>>
>> A bit struggling of calculating the statistical significant between the
>> output of two systems. Suppose Ive got the following two results from
>> two independent systems (performing sequence labelling task):
>>
>> System 01:
>> precision:  81.57%; recall:  57.12%; FB1:  67.19%
>>
>> System 02:
>> precision:  84.07%; recall:  62.47%; FB1:  71.68%
>>
>>
>> Could someone pinpoint me to the way of calculating the statistical
>> significant between them?
>>
>> p.s. I've no folds applied (just one go "training and test")
>>
>> J.
>>
>>
>> _______________________________________________
>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/listinfo/corpora
>>
>



_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list