[Corpora-List] MWE extraction from a desired text

Ying Liu liux0395 at umn.edu
Sun Jan 30 17:38:00 UTC 2011


Hi Fatemah,

I wrote a Perl program, find-compounds.pl,  to find the longest compound 
words of the text.
It is part of the Text-NSP package. The following link is the description.

http://search.cpan.org/~tpederse/Text-NSP-1.21/bin/utils/find-compounds.pl 
<http://search.cpan.org/%7Etpederse/Text-NSP-1.21/bin/utils/find-compounds.pl>

The original text contains "This is the new york city". In the compound 
word list, it has

new_york
new_york_city

The find-compounds.pl will find the longest match. After replace the 
compound words, the text is "This is the new_york_city".


This code needs to input an offline ready list of the compound words you 
are interested in.
The output is the text file with compound words replaced. In order to 
pick out the sentences
which contain the compound words, you need to further process the output 
text. Hope this helpful.

Thanks,
Ying





On 2011/1/30 10:43, Ted Pedersen wrote:
> Hi Fatemah,
>
> Let me suggest a few simple tools that might at least represent a
> starting point for you in identifying MWEs in text (from a
> pre-existing list, which is often what I want to do too)...
>
> In WordNet::Similarity there is a utility called "compounds.pl" that
> will generate a list of all the compounds found in WordNet. Compounds
> are the WordNet approximations of MWEs, at least in a narrow sense..
>
> http://wn-similarity.sourceforge.net/
>
> In any case, if you have this module installed, you can run the
> following command to generate a list for your version of WordNet (I am
> using 3.0). Note that I'm using the Linux command line.
>
> ted at marimba:~$ compounds.pl>  wn-30-compounds.txt
>
> ted at marimba:~$ wc wn-30-compounds.txt
>    64331   64331 1031846 wn-30-compounds.txt
>
> If you don't have this module or WordNet, I've taken the liberty of
> putting that list of compounds here :
>
> http://marimba.d.umn.edu/WordNet_Compounds/
>
> Now...you have a list of compounds (either that you generated or
> downloaded). You can provide that list to the find-compounds.pl
> utility which is a fairly recent addition to the Ngram Statistics
> Package, and look up those compounds in your text.
>
> http://ngram.sourceforge.net/
>
> Again, from the linux command line a short input text and then the
> command to find the compounds in it...
>
> ted at marimba:~$ cat test.txt
> My friend Winston Churchill took me to Churchill Downs in the
> United States of America to see an analog computer. By and
> large I love analog computers!! Suffice to say this was a
> red-letter day in my life!! Next week I shall tell you about
> my broken heart.
>
> ted at marimba:~$ find-compounds.pl test.txt wn-30-compounds.txt
> My friend Winston_Churchill took me to Churchill_Downs in the
> United_States_of_America to see an analog_computer. By_and_large I
> love analog computers!! Suffice to say this was a red-letter_day in my
> life!! Next week I shall tell you about my broken_heart.
>
> Now, this is just a simple string look up so it will miss
> morphological variants (note that analog_computer is found, but not
> analog computers), but it does do some nice things like not get fooled
> by line boundaries, capitalization or punctuation.  So, this might at
> least be a bit of a starting point that  I hope is easy enough to use.
> The possibly nice thing about this is you could add to your compound
> list to customize this to your needs.
>
> Anyway, I posted this to the list overall as it seems to tie in
> somewhat with our recent discussion of compounds, and might be generic
> enough to be of some general interest. I'm also interested in knowing
> about more sophisticated MWE or term mapping tools that might be
> similarly easy to use.
>
> I hope this helps.
>
> Enjoy,
> Ted
>
> On Sun, Jan 30, 2011 at 7:38 AM, Fatemeh Torabi Asr<torabiasr at gmail.com>  wrote:
>> Dears,
>>
>> I wonder if anyone knows a software that takes a text as input and outputs a
>> list of included sentences in which common Multi Word Expressions (MWE)
>> appear. I have already found some tools but the underlying algorithm is also
>> important for me. I don't want the algorithm to work based on the
>> frequencies in the input text but [probably] it should have an offline ready
>> list of MWEs (or a similar data structure) based on which it parses the
>> text. Any kind of idiomatic exression (unusual ones e.g., "by and large" or
>> well-formed ones e.g., "break one's heart") are acceptable.
>>
>> Best,
>> Fatemeh
>>
>>
>>
>> --
>> Fatemeh
>>
>> _______________________________________________
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/listinfo/corpora
>>
>>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20110130/13393493/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list