<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

  <head>

    <meta content="text/html; charset=ISO-8859-1"

      http-equiv="Content-Type">

  </head>

  <body text="#000000" bgcolor="#ffffff">

    Hi Fatemah, <br>

    <br>

    I wrote a Perl program, find-compounds.pl,  to find the longest

    compound words of the text. <br>

    It is part of the Text-NSP package. The following link is the

    description. <br>

    <br>

    <a

href="http://search.cpan.org/%7Etpederse/Text-NSP-1.21/bin/utils/find-compounds.pl">http://search.cpan.org/~tpederse/Text-NSP-1.21/bin/utils/find-compounds.pl</a><br>

    <br>

    <span class="Apple-style-span" style="border-collapse: separate;

      color: rgb(0, 0, 0); font-family: 'Times New Roman'; font-style:

      normal; font-variant: normal; font-weight: normal; letter-spacing:

      normal; line-height: normal; orphans: 2; text-indent: 0px;

      text-transform: none; white-space: normal; widows: 2;

      word-spacing: 0px; font-size: medium;"><span

        class="Apple-style-span" style="font-family: arial,sans-serif;">

        <p>The original text contains "This is the new york city". In

          the compound word list, it has</p>

        <p>new_york <br>

          new_york_city</p>

        <p>The find-compounds.pl will find the longest match. After

          replace the compound words, the text is "This is the

          new_york_city".</p>

      </span></span><br>

    This code needs to input an offline ready list of the compound words

    you are interested in. <br>

    The output is the text file with compound words replaced. In order

    to pick out the sentences <br>

    which contain the compound words, you need to further process the

    output text. Hope this helpful. <br>

    <br>

    Thanks,<br>

    Ying<br>

    <br>

    <br>

    <br>

    <br>

    <br>

    On 2011/1/30 10:43, Ted Pedersen wrote:

    <blockquote

      cite="mid:AANLkTikgFVhwGJGXf8U=aXkKduxxMT+PFVPFLna953__@mail.gmail.com"

      type="cite">

      <pre wrap="">Hi Fatemah,

Let me suggest a few simple tools that might at least represent a

starting point for you in identifying MWEs in text (from a

pre-existing list, which is often what I want to do too)...

In WordNet::Similarity there is a utility called "compounds.pl" that

will generate a list of all the compounds found in WordNet. Compounds

are the WordNet approximations of MWEs, at least in a narrow sense..

<a class="moz-txt-link-freetext" href="http://wn-similarity.sourceforge.net/">http://wn-similarity.sourceforge.net/</a>

In any case, if you have this module installed, you can run the

following command to generate a list for your version of WordNet (I am

using 3.0). Note that I'm using the Linux command line.

ted@marimba:~$ compounds.pl > wn-30-compounds.txt

ted@marimba:~$ wc wn-30-compounds.txt

  64331   64331 1031846 wn-30-compounds.txt

If you don't have this module or WordNet, I've taken the liberty of

putting that list of compounds here :

<a class="moz-txt-link-freetext" href="http://marimba.d.umn.edu/WordNet_Compounds/">http://marimba.d.umn.edu/WordNet_Compounds/</a>

Now...you have a list of compounds (either that you generated or

downloaded). You can provide that list to the find-compounds.pl

utility which is a fairly recent addition to the Ngram Statistics

Package, and look up those compounds in your text.

<a class="moz-txt-link-freetext" href="http://ngram.sourceforge.net/">http://ngram.sourceforge.net/</a>

Again, from the linux command line a short input text and then the

command to find the compounds in it...

ted@marimba:~$ cat test.txt

My friend Winston Churchill took me to Churchill Downs in the

United States of America to see an analog computer. By and

large I love analog computers!! Suffice to say this was a

red-letter day in my life!! Next week I shall tell you about

my broken heart.

ted@marimba:~$ find-compounds.pl test.txt wn-30-compounds.txt

My friend Winston_Churchill took me to Churchill_Downs in the

United_States_of_America to see an analog_computer. By_and_large I

love analog computers!! Suffice to say this was a red-letter_day in my

life!! Next week I shall tell you about my broken_heart.

Now, this is just a simple string look up so it will miss

morphological variants (note that analog_computer is found, but not

analog computers), but it does do some nice things like not get fooled

by line boundaries, capitalization or punctuation.  So, this might at

least be a bit of a starting point that  I hope is easy enough to use.

The possibly nice thing about this is you could add to your compound

list to customize this to your needs.

Anyway, I posted this to the list overall as it seems to tie in

somewhat with our recent discussion of compounds, and might be generic

enough to be of some general interest. I'm also interested in knowing

about more sophisticated MWE or term mapping tools that might be

similarly easy to use.

I hope this helps.

Enjoy,

Ted

On Sun, Jan 30, 2011 at 7:38 AM, Fatemeh Torabi Asr <a class="moz-txt-link-rfc2396E" href="mailto:torabiasr@gmail.com"><torabiasr@gmail.com></a> wrote:

</pre>

      <blockquote type="cite">

        <pre wrap="">

Dears,

I wonder if anyone knows a software that takes a text as input and outputs a

list of included sentences in which common Multi Word Expressions (MWE)

appear. I have already found some tools but the underlying algorithm is also

important for me. I don't want the algorithm to work based on the

frequencies in the input text but [probably] it should have an offline ready

list of MWEs (or a similar data structure) based on which it parses the

text. Any kind of idiomatic exression (unusual ones e.g., "by and large" or

well-formed ones e.g., "break one's heart") are acceptable.

Best,

Fatemeh

--

Fatemeh

_______________________________________________

Corpora mailing list

<a class="moz-txt-link-abbreviated" href="mailto:Corpora@uib.no">Corpora@uib.no</a>

<a class="moz-txt-link-freetext" href="http://mailman.uib.no/listinfo/corpora">http://mailman.uib.no/listinfo/corpora</a>

</pre>

      </blockquote>

      <pre wrap="">

</pre>

    </blockquote>

    <br>

  </body>

</html>