<table cellspacing="0" cellpadding="0" border="0" ><tr><td valign="top" style="font: inherit;"><P style="MARGIN: 0cm 0cm 0pt" class=MsoNormal><SPAN style="COLOR: black"><FONT size=3><FONT face="Times New Roman">Hi Fatemeh,<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" /><o:p></o:p></FONT></FONT></SPAN></DIV>
<P style="MARGIN: 0cm 0cm 0pt" class=MsoNormal><SPAN style="COLOR: black"><FONT size=3><FONT face="Times New Roman"> <o:p></o:p></FONT></FONT></SPAN></DIV>
<P style="MARGIN: 0cm 0cm 0pt" class=MsoNormal><SPAN style="COLOR: black"><FONT size=3><FONT face="Times New Roman">I have a tool that quite efficiently extracts noun phrases up to 4 words from text files. The extraction procedure itself does not use any statistical information and thus does not depend on the text size, but then you can score and extract only key NPs based on the parameters of interest or their combination<SPAN style="mso-spacerun: yes"> </SPAN>(regular statistical scoring measures and others, such as location, length, etc.)<SPAN style="mso-spacerun: yes"> </SPAN>The tool has an interface that allows administering the system knowledge to train it on particular domains or extract other POS, e.g. verbs or VPs, etc. It is also portable across languages. <o:p></o:p></FONT></FONT></SPAN></DIV>
<P style="MARGIN: 0cm 0cm 0pt" class=MsoNormal><SPAN style="COLOR: black"><FONT size=3><FONT face="Times New Roman"> <o:p></o:p></FONT></FONT></SPAN></DIV>
<P style="MARGIN: 0cm 0cm 0pt" class=MsoNormal><SPAN style="COLOR: black"><FONT size=3 face="Times New Roman">If you think you might use it in your research please visit the site </FONT><A href="http://www.lanaconsult.com/"><FONT size=3 face="Times New Roman">http://www.lanaconsult.com</FONT></A><o:p></o:p></SPAN></DIV>
<P style="MARGIN: 0cm 0cm 0pt" class=MsoNormal><SPAN style="COLOR: black"><FONT size=3><FONT face="Times New Roman"> <o:p></o:p></FONT></FONT></SPAN></DIV>
<P style="MARGIN: 0cm 0cm 0pt" class=MsoNormal><SPAN style="COLOR: black"><FONT size=3><FONT face="Times New Roman">Regards,<o:p></o:p></FONT></FONT></SPAN></DIV>
<P style="MARGIN: 0cm 0cm 0pt" class=MsoNormal><SPAN style="COLOR: black"><FONT size=3><FONT face="Times New Roman"><SPAN style="mso-spacerun: yes"> </SPAN><SPAN style="mso-spacerun: yes"> </SPAN>Svetlana<o:p></o:p></FONT></FONT></SPAN></DIV><BR><BR>--- En date de : <B>Dim 30.1.11, Fatemeh Torabi Asr <I><torabiasr@gmail.com></I></B> a écrit :<BR>
<BLOCKQUOTE style="BORDER-LEFT: rgb(16,16,255) 2px solid; PADDING-LEFT: 5px; MARGIN-LEFT: 5px"><BR>De: Fatemeh Torabi Asr <torabiasr@gmail.com><BR>Objet: Re: [Corpora-List] MWE extraction from a desired text<BR>À: "Rich Cooper" <rich@englishlogickernel.com><BR>Cc: corpora@uib.no<BR>Date: Dimanche 30 janvier 2011, 22h13<BR><BR>
<DIV id=yiv1601569546>Thanks everybody for the useful replies.<BR><BR>Ted, this is exactly what I needed for a preliminary experiment and so many thanks for giving the links to those prepared lists of MWEs. I'm new in working with collocations and it seems that hot discussions about the extraction of different types of idioms exist out there!<BR><BR>There are two requirements that I would probably go for later in my research. The first one is a method of scoring my sentences based on their degree of idiosyncrasy. The second is a more advanced MWE extractor that is capable of recognizing those idioms which do not necessarily appear as sequential ngrams (e.g., "On one hand S1 On the other hand S2" or "break POS heart"). <BR>There must be agorithms to detect such structures with rather different matching methods. If a candidate list of such idioms is ready out there in a slightly different format (maybe regular expressions), then the second job of
matching them with the sentences in a desired text would be as well an easy job (just as Martin suggested me using linux grep to do that).<BR><BR>Thanks Rich, no this is not what I'm going to do, though the method might be applicable in my job as well.<BR><BR>Best,<BR>Fatemeh<BR><BR><BR><BR><BR><BR><BR>
<DIV class=yiv1601569546gmail_quote>On Mon, Jan 31, 2011 at 12:39 AM, Rich Cooper <SPAN dir=ltr><<A href="http://fr.mc295.mail.yahoo.com/mc/compose?to=rich@englishlogickernel.com" rel=nofollow target=_blank ymailto="mailto:rich@englishlogickernel.com">rich@englishlogickernel.com</A>></SPAN> wrote:<BR>
<BLOCKQUOTE style="BORDER-LEFT: rgb(204,204,204) 1px solid; MARGIN: 0pt 0pt 0pt 0.8ex; PADDING-LEFT: 1ex" class=yiv1601569546gmail_quote>
<DIV lang=EN-US>
<DIV>
<P class=yiv1601569546MsoNormal><FONT color=blue size=3 face=Arial><SPAN style="FONT-FAMILY: Arial; COLOR: blue; FONT-SIZE: 12pt">Hi Fatmeh,</SPAN></FONT></DIV>
<P class=yiv1601569546MsoNormal><FONT color=blue size=3 face=Arial><SPAN style="FONT-FAMILY: Arial; COLOR: blue; FONT-SIZE: 12pt"> </SPAN></FONT></DIV>
<P class=yiv1601569546MsoNormal><FONT color=blue size=3 face=Arial><SPAN style="FONT-FAMILY: Arial; COLOR: blue; FONT-SIZE: 12pt">Are you interested in developing a corpus of issued patents as recorded by the USPTO, which contain numerous large columns with unstructured text? I have tools that will help you do that. </SPAN></FONT></DIV>
<P class=yiv1601569546MsoNormal><FONT color=blue size=3 face=Arial><SPAN style="FONT-FAMILY: Arial; COLOR: blue; FONT-SIZE: 12pt"> </SPAN></FONT></DIV>
<P class=yiv1601569546MsoNormal><FONT color=blue size=3 face=Arial><SPAN style="FONT-FAMILY: Arial; COLOR: blue; FONT-SIZE: 12pt">If you have done so, you can then use another tool (in alpha condition now) called Linguistics Lab which text mines for exactly such MWEs in string format. </SPAN></FONT></DIV>
<P class=yiv1601569546MsoNormal><FONT color=blue size=3 face=Arial><SPAN style="FONT-FAMILY: Arial; COLOR: blue; FONT-SIZE: 12pt"> </SPAN></FONT></DIV>
<P class=yiv1601569546MsoNormal><FONT color=blue size=3 face=Arial><SPAN style="FONT-FAMILY: Arial; COLOR: blue; FONT-SIZE: 12pt">Would that help you?</SPAN></FONT></DIV>
<P class=yiv1601569546MsoNormal><FONT color=blue size=3 face=Arial><SPAN style="FONT-FAMILY: Arial; COLOR: blue; FONT-SIZE: 12pt"> </SPAN></FONT></DIV>
<P class=yiv1601569546MsoNormal><FONT color=blue size=3 face=Arial><SPAN style="FONT-FAMILY: Arial; COLOR: blue; FONT-SIZE: 12pt">-Rich</SPAN></FONT></DIV>
<DIV>
<P class=yiv1601569546MsoNormal><FONT color=black size=3 face="Times New Roman"><SPAN style="COLOR: black; FONT-SIZE: 12pt"> </SPAN></FONT></DIV>
<P class=yiv1601569546MsoNormal><FONT color=black size=3 face="Times New Roman"><SPAN style="COLOR: black; FONT-SIZE: 12pt"> </SPAN></FONT></DIV>
<P class=yiv1601569546MsoNormal><FONT color=black size=3 face="Times New Roman"><SPAN style="COLOR: black; FONT-SIZE: 12pt">Sincerely,</SPAN></FONT></DIV>
<P class=yiv1601569546MsoNormal><FONT color=black size=3 face="Times New Roman"><SPAN style="COLOR: black; FONT-SIZE: 12pt">Rich Cooper</SPAN></FONT></DIV>
<P class=yiv1601569546MsoNormal><FONT color=black size=3 face="Times New Roman"><SPAN style="COLOR: black; FONT-SIZE: 12pt">EnglishLogicKernel.com</SPAN></FONT><FONT color=blue><SPAN style="COLOR: blue"></SPAN></FONT></DIV>
<P class=yiv1601569546MsoNormal><FONT color=black size=3 face="Times New Roman"><SPAN style="COLOR: black; FONT-SIZE: 12pt">Rich AT EnglishLogicKernel DOT com</SPAN></FONT><FONT color=blue><SPAN style="COLOR: blue"></SPAN></FONT></DIV>
<P class=yiv1601569546MsoNormal><FONT color=black size=3 face="Times New Roman"><SPAN style="COLOR: black; FONT-SIZE: 12pt">9 4 9 \ 5 2 5 - 5 7 1 2</SPAN></FONT></DIV></DIV>
<DIV>
<DIV style="TEXT-ALIGN: center" class=yiv1601569546MsoNormal align=center><FONT size=3 face="Times New Roman"><SPAN style="FONT-SIZE: 12pt">
<HR align=center SIZE=2 width="100%">
</SPAN></FONT></DIV>
<P class=yiv1601569546MsoNormal><B><FONT size=2 face=Tahoma><SPAN style="FONT-FAMILY: Tahoma; FONT-SIZE: 10pt; FONT-WEIGHT: bold">From:</SPAN></FONT></B><FONT size=2 face=Tahoma><SPAN style="FONT-FAMILY: Tahoma; FONT-SIZE: 10pt"> <A href="http://fr.mc295.mail.yahoo.com/mc/compose?to=corpora-bounces@uib.no" rel=nofollow target=_blank ymailto="mailto:corpora-bounces@uib.no">corpora-bounces@uib.no</A> [mailto:<A href="http://fr.mc295.mail.yahoo.com/mc/compose?to=corpora-bounces@uib.no" rel=nofollow target=_blank ymailto="mailto:corpora-bounces@uib.no">corpora-bounces@uib.no</A>] <B><SPAN style="FONT-WEIGHT: bold">On Behalf Of </SPAN></B>Fatemeh Torabi Asr<BR><B><SPAN style="FONT-WEIGHT: bold">Sent:</SPAN></B> Sunday, January 30, 2011 5:39 AM<BR><B><SPAN style="FONT-WEIGHT: bold">To:</SPAN></B> <A href="http://fr.mc295.mail.yahoo.com/mc/compose?to=corpora@uib.no" rel=nofollow target=_blank ymailto="mailto:corpora@uib.no">corpora@uib.no</A><BR><B><SPAN
style="FONT-WEIGHT: bold">Subject:</SPAN></B> [Corpora-List] MWE extraction from a desired text</SPAN></FONT></DIV></DIV>
<DIV>
<DIV></DIV>
<DIV class=yiv1601569546h5>
<P class=yiv1601569546MsoNormal><FONT size=3 face="Times New Roman"><SPAN style="FONT-SIZE: 12pt"> </SPAN></FONT></DIV>
<DIV>
<P class=yiv1601569546MsoNormal><FONT size=3 face="Times New Roman"><SPAN style="FONT-SIZE: 12pt"><BR>Dears,<BR><BR>I wonder if anyone knows a software that takes a text as input and outputs a list of included sentences in which common Multi Word Expressions (MWE) appear. I have already found some tools but the underlying algorithm is also important for me. I don't want the algorithm to work based on the frequencies in the input text but [probably] it should have an offline ready list of MWEs (or a similar data structure) based on which it parses the text. Any kind of idiomatic exression (unusual ones e.g., "by and large" or well-formed ones e.g., "break one's heart") are acceptable.<BR><BR>Best,<BR><FONT color=#888888><SPAN style="COLOR: rgb(136,136,136)">Fatemeh</SPAN></FONT></SPAN></FONT></DIV></DIV>
<P class=yiv1601569546MsoNormal><FONT size=3 face="Times New Roman"><SPAN style="FONT-SIZE: 12pt"><BR><BR clear=all><BR>-- <BR>Fatemeh</SPAN></FONT></DIV></DIV></DIV></DIV></DIV></BLOCKQUOTE></DIV><BR><BR clear=all><BR>-- <BR>Fatemeh<BR></DIV><BR>-----La pièce jointe associée suit-----<BR><BR>
<DIV class=plainMail>_______________________________________________<BR>Corpora mailing list<BR><A href="http://fr.mc295.mail.yahoo.com/mc/compose?to=Corpora@uib.no" ymailto="mailto:Corpora@uib.no">Corpora@uib.no</A><BR><A href="http://mailman.uib.no/listinfo/corpora" target=_blank>http://mailman.uib.no/listinfo/corpora</A><BR></DIV></BLOCKQUOTE></td></tr></table><br>