Thanks everybody for the useful replies.<br><br>Ted, this is exactly what I needed for a preliminary experiment and so many thanks for giving the links to those prepared lists of MWEs. I'm new in working with collocations and it seems that hot discussions about the extraction of different types of idioms exist out there!<br>

<br>There are two requirements that I would probably go for later in my research. The first one is a method of scoring my sentences based on their degree of idiosyncrasy. The second is a more advanced MWE extractor that is capable of recognizing those idioms which do not necessarily appear as sequential ngrams (e.g., "On one hand S1 On the other hand S2" or "break POS heart"). <br>

There must be agorithms  to detect such structures with rather different matching methods. If a candidate list of such idioms is ready out there in a slightly different format (maybe regular expressions), then the second job of matching them with the sentences in a desired text would be as well an easy job (just as Martin suggested me using linux grep to do that).<br>

<br>Thanks Rich, no this is not what I'm going to do, though the method might be applicable in my job as well.<br><br>Best,<br>Fatemeh<br><br><br><br><br><br><br><div class="gmail_quote">On Mon, Jan 31, 2011 at 12:39 AM, Rich Cooper <span dir="ltr"><<a href="mailto:rich@englishlogickernel.com">rich@englishlogickernel.com</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

<div link="blue" vlink="purple" lang="EN-US">

<div>

<p class="MsoNormal"><font color="blue" size="3" face="Arial"><span style="font-size: 12pt; font-family: Arial; color: blue;">Hi Fatmeh,</span></font></p>

<p class="MsoNormal"><font color="blue" size="3" face="Arial"><span style="font-size: 12pt; font-family: Arial; color: blue;"> </span></font></p>

<p class="MsoNormal"><font color="blue" size="3" face="Arial"><span style="font-size: 12pt; font-family: Arial; color: blue;">Are you interested in developing a corpus

of issued patents as recorded by the USPTO, which contain numerous large

columns with unstructured text?  I have tools that will help you do that.  </span></font></p>

<p class="MsoNormal"><font color="blue" size="3" face="Arial"><span style="font-size: 12pt; font-family: Arial; color: blue;"> </span></font></p>

<p class="MsoNormal"><font color="blue" size="3" face="Arial"><span style="font-size: 12pt; font-family: Arial; color: blue;">If you have done so, you can then use

another tool (in alpha condition now) called Linguistics Lab which text mines

for exactly such MWEs in string format.  </span></font></p>

<p class="MsoNormal"><font color="blue" size="3" face="Arial"><span style="font-size: 12pt; font-family: Arial; color: blue;"> </span></font></p>

<p class="MsoNormal"><font color="blue" size="3" face="Arial"><span style="font-size: 12pt; font-family: Arial; color: blue;">Would that help you?</span></font></p>

<p class="MsoNormal"><font color="blue" size="3" face="Arial"><span style="font-size: 12pt; font-family: Arial; color: blue;"> </span></font></p>

<p class="MsoNormal"><font color="blue" size="3" face="Arial"><span style="font-size: 12pt; font-family: Arial; color: blue;">-Rich</span></font></p>

<div>

<p class="MsoNormal"><font color="black" size="3" face="Times New Roman"><span style="font-size: 12pt; color: black;"> </span></font></p>

<p class="MsoNormal"><font color="black" size="3" face="Times New Roman"><span style="font-size: 12pt; color: black;"> </span></font></p>

<p class="MsoNormal"><font color="black" size="3" face="Times New Roman"><span style="font-size: 12pt; color: black;">Sincerely,</span></font></p>

<p class="MsoNormal"><font color="black" size="3" face="Times New Roman"><span style="font-size: 12pt; color: black;">Rich Cooper</span></font></p>

<p class="MsoNormal"><font color="black" size="3" face="Times New Roman"><span style="font-size: 12pt; color: black;">EnglishLogicKernel.com</span></font><font color="blue"><span style="color: blue;"></span></font></p>

<p class="MsoNormal"><font color="black" size="3" face="Times New Roman"><span style="font-size: 12pt; color: black;">Rich AT EnglishLogicKernel DOT com</span></font><font color="blue"><span style="color: blue;"></span></font></p>

<p class="MsoNormal"><font color="black" size="3" face="Times New Roman"><span style="font-size: 12pt; color: black;">9 4 9 \ 5 2 5 - 5 7 1 2</span></font></p>

</div>

<div>

<div class="MsoNormal" style="text-align: center;" align="center"><font size="3" face="Times New Roman"><span style="font-size: 12pt;">

<hr align="center" size="2" width="100%">

</span></font></div>

<p class="MsoNormal"><b><font size="2" face="Tahoma"><span style="font-size: 10pt; font-family: Tahoma; font-weight: bold;">From:</span></font></b><font size="2" face="Tahoma"><span style="font-size: 10pt; font-family: Tahoma;">

<a href="mailto:corpora-bounces@uib.no" target="_blank">corpora-bounces@uib.no</a> [mailto:<a href="mailto:corpora-bounces@uib.no" target="_blank">corpora-bounces@uib.no</a>] <b><span style="font-weight: bold;">On Behalf Of </span></b>Fatemeh Torabi Asr<br>

<b><span style="font-weight: bold;">Sent:</span></b> Sunday, January 30, 2011

5:39 AM<br>

<b><span style="font-weight: bold;">To:</span></b> <a href="mailto:corpora@uib.no" target="_blank">corpora@uib.no</a><br>

<b><span style="font-weight: bold;">Subject:</span></b> [Corpora-List] MWE

extraction from a desired text</span></font></p>

</div><div><div></div><div class="h5">

<p class="MsoNormal"><font size="3" face="Times New Roman"><span style="font-size: 12pt;"> </span></font></p>

<div>

<p class="MsoNormal"><font size="3" face="Times New Roman"><span style="font-size: 12pt;"><br>

Dears,<br>

<br>

I wonder if anyone knows a software that takes a text as input and outputs a

list of included sentences in which common Multi Word Expressions (MWE) appear.

I have already found some tools but the underlying algorithm is also important

for me. I don't want the algorithm to work based on the frequencies in the

input text but [probably] it should have an offline ready list of MWEs (or a

similar data structure) based on which it parses the text. Any kind of

idiomatic exression (unusual ones e.g., "by and large" or well-formed

ones e.g., "break one's heart") are acceptable.<br>

<br>

Best,<br>

<font color="#888888"><span style="color: rgb(136, 136, 136);">Fatemeh</span></font></span></font></p>

</div>

<p class="MsoNormal"><font size="3" face="Times New Roman"><span style="font-size: 12pt;"><br>

<br clear="all">

<br>

-- <br>

Fatemeh</span></font></p>

</div></div></div>

</div>

</blockquote></div><br><br clear="all"><br>-- <br>Fatemeh<br>