[Corpora-List] MWE extraction from a desired text

Ted Pedersen tpederse at d.umn.edu
Sun Jan 30 16:43:35 UTC 2011


Hi Fatemah,

Let me suggest a few simple tools that might at least represent a
starting point for you in identifying MWEs in text (from a
pre-existing list, which is often what I want to do too)...

In WordNet::Similarity there is a utility called "compounds.pl" that
will generate a list of all the compounds found in WordNet. Compounds
are the WordNet approximations of MWEs, at least in a narrow sense..

http://wn-similarity.sourceforge.net/

In any case, if you have this module installed, you can run the
following command to generate a list for your version of WordNet (I am
using 3.0). Note that I'm using the Linux command line.

ted at marimba:~$ compounds.pl > wn-30-compounds.txt

ted at marimba:~$ wc wn-30-compounds.txt
  64331   64331 1031846 wn-30-compounds.txt

If you don't have this module or WordNet, I've taken the liberty of
putting that list of compounds here :

http://marimba.d.umn.edu/WordNet_Compounds/

Now...you have a list of compounds (either that you generated or
downloaded). You can provide that list to the find-compounds.pl
utility which is a fairly recent addition to the Ngram Statistics
Package, and look up those compounds in your text.

http://ngram.sourceforge.net/

Again, from the linux command line a short input text and then the
command to find the compounds in it...

ted at marimba:~$ cat test.txt
My friend Winston Churchill took me to Churchill Downs in the
United States of America to see an analog computer. By and
large I love analog computers!! Suffice to say this was a
red-letter day in my life!! Next week I shall tell you about
my broken heart.

ted at marimba:~$ find-compounds.pl test.txt wn-30-compounds.txt
My friend Winston_Churchill took me to Churchill_Downs in the
United_States_of_America to see an analog_computer. By_and_large I
love analog computers!! Suffice to say this was a red-letter_day in my
life!! Next week I shall tell you about my broken_heart.

Now, this is just a simple string look up so it will miss
morphological variants (note that analog_computer is found, but not
analog computers), but it does do some nice things like not get fooled
by line boundaries, capitalization or punctuation.  So, this might at
least be a bit of a starting point that  I hope is easy enough to use.
The possibly nice thing about this is you could add to your compound
list to customize this to your needs.

Anyway, I posted this to the list overall as it seems to tie in
somewhat with our recent discussion of compounds, and might be generic
enough to be of some general interest. I'm also interested in knowing
about more sophisticated MWE or term mapping tools that might be
similarly easy to use.

I hope this helps.

Enjoy,
Ted

On Sun, Jan 30, 2011 at 7:38 AM, Fatemeh Torabi Asr <torabiasr at gmail.com> wrote:
>
> Dears,
>
> I wonder if anyone knows a software that takes a text as input and outputs a
> list of included sentences in which common Multi Word Expressions (MWE)
> appear. I have already found some tools but the underlying algorithm is also
> important for me. I don't want the algorithm to work based on the
> frequencies in the input text but [probably] it should have an offline ready
> list of MWEs (or a similar data structure) based on which it parses the
> text. Any kind of idiomatic exression (unusual ones e.g., "by and large" or
> well-formed ones e.g., "break one's heart") are acceptable.
>
> Best,
> Fatemeh
>
>
>
> --
> Fatemeh
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>


-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list