[Corpora-List] How to measure n-grams (n>2)

Adam Kilgarriff adam at lexmasterclass.com
Thu Oct 25 09:32:33 UTC 2012


Dear Mihail,

I've been looking for a statistic that works well where n>2, and have
reviewed various papers making proposals over the last ten years, but have
not been convinced by any!

Here are two observations:

1) introducing some grammar helps a lot more than changing the statistic
2) once you have some grammar, it is not clear that you need any
statistics.  Oxford University Press lexicographers said they liked to see
(grammatical) collocations in plain frequency order (and we added a switch
to our word sketches, to switch between sorting on frequencies vs on
'salience': both are useful, depending on what the user is looking for)

An excellent paper also reaching these conclusions is
Joachim Wermter, Udo
Hahn<http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/h/Hahn:Udo.html>:
You Can't Beat Frequency (Unless You Use Linguistic Knowledge) - A
Qualitative Evaluation of Association Measures for Collocation and Term
Extraction. ACL
2006<http://www.informatik.uni-trier.de/~ley/db/conf/acl/acl2006.html#WermterH06>

We recently wrote up our own work on this area in

   - Finding Multiwords of More Than Two
Words<http://trac.sketchengine.co.uk/attachment/wiki/AK/Papers/multiwords.pdf?format=raw><http://trac.sketchengine.co.uk/raw-attachment/wiki/AK/Papers/multiwords.pdf?format=raw>
      - Adam Kilgarriff, Pavel Rychly, Vojtech Kovar, Vit Baisa
      - in: *Proc. EURALEX *, Oslo, August.

One final thought: in our termfinding work, we often want to find the most
distinctive terms (for a given corpus, in contrast to a reference corpus)
 of length 2, 3 and 4.  As we can't find a statistic that gives the same
range of values for each case, we are taking a different approach.  A
lexicographer friend worked through a term-candidate list and found around
400 interesting 2-word terms, 40 3-words and 4 of 4 words,  Based on that
evidence, we simply show the terminologist 90% of the highest-salience
2-grams, 9% of highest-salience 3-grams and 1% of highest-salience 4-grams.

Adam

On 24 October 2012 14:22, Mihail Kopotev <Mihail.Kopotev at helsinki.fi> wrote:

>  Dear Corpora-listers,
>
> As I see, the commonly used approach to extract collocations from n-grams
> (n>2) is to treat them somehow as as pseudo-bigrams. I'm wondering whether
> there are more immediate techniques that are non-derivative of bigrams.
> Could you please point me any articles/resources especially those, where
> these techniques are evaluated against the data?
>
> Thank you,
> Mikhail Kopotev
>
> --
> Mikhail Kopotev, PhD, Adj.Prof.
> University Lecturer
> Department of Modern Languages
> University of Helsinkihttp://www.helsinki.fi/~kopotev
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>


-- 
========================================
Adam Kilgarriff <http://www.kilgarriff.co.uk/>
adam at lexmasterclass.com
Director                                    Lexical Computing
Ltd<http://www.sketchengine.co.uk/>

Visiting Research Fellow                 University of
Leeds<http://leeds.ac.uk>

*Corpora for all* with the Sketch Engine <http://www.sketchengine.co.uk>

                        *DANTE: a lexical database for
English<http://www.webdante.com>
                  *
========================================
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20121025/1293a71c/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list