Dear Mihail,<div><br></div><div>I've been looking for a statistic that works well where n>2, and have reviewed various papers making proposals over the last ten years, but have not been convinced by any!</div><div><br>
</div><div>Here are two observations:</div><div><br></div><div>1) introducing some grammar helps a lot more than changing the statistic</div><div>2) once you have some grammar, it is not clear that you need any statistics. Oxford University Press lexicographers said they liked to see (grammatical) collocations in plain frequency order (and we added a switch to our word sketches, to switch between sorting on frequencies vs on 'salience': both are useful, depending on what the user is looking for)</div>
<div><br></div><div>An excellent paper also reaching these conclusions is </div><div><span style="font-family:'Times New Roman';font-size:medium;background-color:rgb(255,255,255)">Joachim Wermter, </span><a href="http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/h/Hahn:Udo.html" style="color:rgb(0,0,0);font-family:'Times New Roman';font-size:medium;background-color:rgb(255,255,255)">Udo Hahn</a><span style="font-family:'Times New Roman';font-size:medium;background-color:rgb(255,255,255)">: You Can't Beat Frequency (Unless You Use Linguistic Knowledge) - A Qualitative Evaluation of Association Measures for Collocation and Term Extraction. </span><a href="http://www.informatik.uni-trier.de/~ley/db/conf/acl/acl2006.html#WermterH06" style="color:rgb(0,0,0);font-family:'Times New Roman';font-size:medium;background-color:rgb(255,255,255)">ACL 2006</a><br>
<br>We recently wrote up our own work on this area in</div><ul style="font-family:Verdana,sans-serif;font-size:12px;line-height:16.78333282470703px;background-color:rgb(255,255,255)"><li><a class="attachment" href="http://trac.sketchengine.co.uk/attachment/wiki/AK/Papers/multiwords.pdf?format=raw" title="Attachment 'multiwords.pdf' in AK/Papers" style="text-decoration:none;color:rgb(51,113,186);border-bottom-width:1px;border-bottom-style:dotted;border-bottom-color:rgb(187,187,187)">Finding Multiwords of More Than Two Words</a><a class="trac-rawlink" href="http://trac.sketchengine.co.uk/raw-attachment/wiki/AK/Papers/multiwords.pdf?format=raw" title="Download" style="text-decoration:none;color:rgb(51,113,186);border-bottom-style:none;background-image:url(http://trac.sketchengine.co.uk/chrome/common/download.png);padding-right:16px;background-repeat:no-repeat no-repeat"></a><ul>
<li>Adam Kilgarriff, Pavel Rychly, Vojtech Kovar, Vit Baisa</li><li>in: <em>Proc. EURALEX </em>, Oslo, August.</li></ul></li></ul><div><div class="gmail_quote">One final thought: in our termfinding work, we often want to find the most distinctive terms (for a given corpus, in contrast to a reference corpus) of length 2, 3 and 4. As we can't find a statistic that gives the same range of values for each case, we are taking a different approach. A lexicographer friend worked through a term-candidate list and found around 400 interesting 2-word terms, 40 3-words and 4 of 4 words, Based on that evidence, we simply show the terminologist 90% of the highest-salience 2-grams, 9% of highest-salience 3-grams and 1% of highest-salience 4-grams.</div>
<div class="gmail_quote"><br></div><div class="gmail_quote">Adam</div><div class="gmail_quote"><br></div><div class="gmail_quote">On 24 October 2012 14:22, Mihail Kopotev <span dir="ltr"><<a href="mailto:Mihail.Kopotev@helsinki.fi" target="_blank">Mihail.Kopotev@helsinki.fi</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div bgcolor="#FFFFFF" text="#000000">
<font size="-1"><font face="Verdana">Dear Corpora<font size="-1">-lis<font size="-1">ters</font></font></font></font>, <br>
<br>
As I see, the commonly used approach to extract collocations from
n-grams (n>2) is to treat them somehow as as pseudo-bigrams. I'm
wondering whether there are more immediate techniques that are
non-derivative of bigrams.<br>
Could you please point me any articles/resources especially those,
where these techniques are evaluated against the data?<br>
<br>
Thank you, <br>
Mikhail Kopotev<span class="HOEnZb"><font color="#888888"><br>
<pre cols="72">--
Mikhail Kopotev, PhD, Adj.Prof.
University Lecturer
Department of Modern Languages
University of Helsinki
<a href="http://www.helsinki.fi/~kopotev" target="_blank">http://www.helsinki.fi/~kopotev</a> </pre>
</font></span></div>
<br>_______________________________________________<br>
UNSUBSCRIBE from this page: <a href="http://mailman.uib.no/options/corpora" target="_blank">http://mailman.uib.no/options/corpora</a><br>
Corpora mailing list<br>
<a href="mailto:Corpora@uib.no">Corpora@uib.no</a><br>
<a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/listinfo/corpora</a><br>
<br></blockquote></div><br><br clear="all"><div><br></div>-- <br>========================================<br><a href="http://www.kilgarriff.co.uk/" target="_blank">Adam Kilgarriff</a> <a href="mailto:adam@lexmasterclass.com" target="_blank">adam@lexmasterclass.com</a> <br>
Director <a href="http://www.sketchengine.co.uk/" target="_blank">Lexical Computing Ltd</a> <br>Visiting Research Fellow <a href="http://leeds.ac.uk" target="_blank">University of Leeds</a> <div>
<i><font color="#006600">Corpora for all</font></i> with <a href="http://www.sketchengine.co.uk" target="_blank">the Sketch Engine</a> </div><div> <i><a href="http://www.webdante.com" target="_blank">DANTE: <font color="#009900">a lexical database for English</font></a><font color="#009900"> </font> </i><div>
========================================</div></div><br>
</div>