<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body text="#000000" bgcolor="#ffffff">
Hi Fatemah, <br>
<br>
I wrote a Perl program, find-compounds.pl, to find the longest
compound words of the text. <br>
It is part of the Text-NSP package. The following link is the
description. <br>
<br>
<a
href="http://search.cpan.org/%7Etpederse/Text-NSP-1.21/bin/utils/find-compounds.pl">http://search.cpan.org/~tpederse/Text-NSP-1.21/bin/utils/find-compounds.pl</a><br>
<br>
<span class="Apple-style-span" style="border-collapse: separate;
color: rgb(0, 0, 0); font-family: 'Times New Roman'; font-style:
normal; font-variant: normal; font-weight: normal; letter-spacing:
normal; line-height: normal; orphans: 2; text-indent: 0px;
text-transform: none; white-space: normal; widows: 2;
word-spacing: 0px; font-size: medium;"><span
class="Apple-style-span" style="font-family: arial,sans-serif;">
<p>The original text contains "This is the new york city". In
the compound word list, it has</p>
<p>new_york <br>
new_york_city</p>
<p>The find-compounds.pl will find the longest match. After
replace the compound words, the text is "This is the
new_york_city".</p>
</span></span><br>
This code needs to input an offline ready list of the compound words
you are interested in. <br>
The output is the text file with compound words replaced. In order
to pick out the sentences <br>
which contain the compound words, you need to further process the
output text. Hope this helpful. <br>
<br>
Thanks,<br>
Ying<br>
<br>
<br>
<br>
<br>
<br>
On 2011/1/30 10:43, Ted Pedersen wrote:
<blockquote
cite="mid:AANLkTikgFVhwGJGXf8U=aXkKduxxMT+PFVPFLna953__@mail.gmail.com"
type="cite">
<pre wrap="">Hi Fatemah,
Let me suggest a few simple tools that might at least represent a
starting point for you in identifying MWEs in text (from a
pre-existing list, which is often what I want to do too)...
In WordNet::Similarity there is a utility called "compounds.pl" that
will generate a list of all the compounds found in WordNet. Compounds
are the WordNet approximations of MWEs, at least in a narrow sense..
<a class="moz-txt-link-freetext" href="http://wn-similarity.sourceforge.net/">http://wn-similarity.sourceforge.net/</a>
In any case, if you have this module installed, you can run the
following command to generate a list for your version of WordNet (I am
using 3.0). Note that I'm using the Linux command line.
ted@marimba:~$ compounds.pl > wn-30-compounds.txt
ted@marimba:~$ wc wn-30-compounds.txt
64331 64331 1031846 wn-30-compounds.txt
If you don't have this module or WordNet, I've taken the liberty of
putting that list of compounds here :
<a class="moz-txt-link-freetext" href="http://marimba.d.umn.edu/WordNet_Compounds/">http://marimba.d.umn.edu/WordNet_Compounds/</a>
Now...you have a list of compounds (either that you generated or
downloaded). You can provide that list to the find-compounds.pl
utility which is a fairly recent addition to the Ngram Statistics
Package, and look up those compounds in your text.
<a class="moz-txt-link-freetext" href="http://ngram.sourceforge.net/">http://ngram.sourceforge.net/</a>
Again, from the linux command line a short input text and then the
command to find the compounds in it...
ted@marimba:~$ cat test.txt
My friend Winston Churchill took me to Churchill Downs in the
United States of America to see an analog computer. By and
large I love analog computers!! Suffice to say this was a
red-letter day in my life!! Next week I shall tell you about
my broken heart.
ted@marimba:~$ find-compounds.pl test.txt wn-30-compounds.txt
My friend Winston_Churchill took me to Churchill_Downs in the
United_States_of_America to see an analog_computer. By_and_large I
love analog computers!! Suffice to say this was a red-letter_day in my
life!! Next week I shall tell you about my broken_heart.
Now, this is just a simple string look up so it will miss
morphological variants (note that analog_computer is found, but not
analog computers), but it does do some nice things like not get fooled
by line boundaries, capitalization or punctuation. So, this might at
least be a bit of a starting point that I hope is easy enough to use.
The possibly nice thing about this is you could add to your compound
list to customize this to your needs.
Anyway, I posted this to the list overall as it seems to tie in
somewhat with our recent discussion of compounds, and might be generic
enough to be of some general interest. I'm also interested in knowing
about more sophisticated MWE or term mapping tools that might be
similarly easy to use.
I hope this helps.
Enjoy,
Ted
On Sun, Jan 30, 2011 at 7:38 AM, Fatemeh Torabi Asr <a class="moz-txt-link-rfc2396E" href="mailto:torabiasr@gmail.com"><torabiasr@gmail.com></a> wrote:
</pre>
<blockquote type="cite">
<pre wrap="">
Dears,
I wonder if anyone knows a software that takes a text as input and outputs a
list of included sentences in which common Multi Word Expressions (MWE)
appear. I have already found some tools but the underlying algorithm is also
important for me. I don't want the algorithm to work based on the
frequencies in the input text but [probably] it should have an offline ready
list of MWEs (or a similar data structure) based on which it parses the
text. Any kind of idiomatic exression (unusual ones e.g., "by and large" or
well-formed ones e.g., "break one's heart") are acceptable.
Best,
Fatemeh
--
Fatemeh
_______________________________________________
Corpora mailing list
<a class="moz-txt-link-abbreviated" href="mailto:Corpora@uib.no">Corpora@uib.no</a>
<a class="moz-txt-link-freetext" href="http://mailman.uib.no/listinfo/corpora">http://mailman.uib.no/listinfo/corpora</a>
</pre>
</blockquote>
<pre wrap="">
</pre>
</blockquote>
<br>
</body>
</html>