[Corpora-List] MWE extraction from a desired text

Svetlana Sheremetyeva linklana at yahoo.com
Mon Jan 31 07:19:01 UTC 2011


Hi Fatemeh,
 
I have a tool that quite efficiently extracts noun phrases up to 4 words from text files. The extraction procedure itself does not use any statistical information and thus does not depend on the text size, but then you can score and extract only key NPs based on the parameters of interest or their combination  (regular statistical scoring measures and others, such as location, length, etc.)  The tool has an interface that allows administering the system knowledge to train it on particular domains or extract other POS, e.g. verbs or VPs, etc. It is also portable across languages. 
 
If you think you might use it in your research please visit the site http://www.lanaconsult.com
 
Regards,
                       Svetlana

--- En date de : Dim 30.1.11, Fatemeh Torabi Asr <torabiasr at gmail.com> a écrit :


De: Fatemeh Torabi Asr <torabiasr at gmail.com>
Objet: Re: [Corpora-List] MWE extraction from a desired text
À: "Rich Cooper" <rich at englishlogickernel.com>
Cc: corpora at uib.no
Date: Dimanche 30 janvier 2011, 22h13


Thanks everybody for the useful replies.

Ted, this is exactly what I needed for a preliminary experiment and so many thanks for giving the links to those prepared lists of MWEs. I'm new in working with collocations and it seems that hot discussions about the extraction of different types of idioms exist out there!

There are two requirements that I would probably go for later in my research. The first one is a method of scoring my sentences based on their degree of idiosyncrasy. The second is a more advanced MWE extractor that is capable of recognizing those idioms which do not necessarily appear as sequential ngrams (e.g., "On one hand S1 On the other hand S2" or "break POS heart"). 
There must be agorithms  to detect such structures with rather different matching methods. If a candidate list of such idioms is ready out there in a slightly different format (maybe regular expressions), then the second job of matching them with the sentences in a desired text would be as well an easy job (just as Martin suggested me using linux grep to do that).

Thanks Rich, no this is not what I'm going to do, though the method might be applicable in my job as well.

Best,
Fatemeh







On Mon, Jan 31, 2011 at 12:39 AM, Rich Cooper <rich at englishlogickernel.com> wrote:




Hi Fatmeh,
 
Are you interested in developing a corpus of issued patents as recorded by the USPTO, which contain numerous large columns with unstructured text?  I have tools that will help you do that.  
 
If you have done so, you can then use another tool (in alpha condition now) called Linguistics Lab which text mines for exactly such MWEs in string format.  
 
Would that help you?
 
-Rich

 
 
Sincerely,
Rich Cooper
EnglishLogicKernel.com
Rich AT EnglishLogicKernel DOT com
9 4 9 \ 5 2 5 - 5 7 1 2




From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf Of Fatemeh Torabi Asr
Sent: Sunday, January 30, 2011 5:39 AM
To: corpora at uib.no
Subject: [Corpora-List] MWE extraction from a desired text



 


Dears,

I wonder if anyone knows a software that takes a text as input and outputs a list of included sentences in which common Multi Word Expressions (MWE) appear. I have already found some tools but the underlying algorithm is also important for me. I don't want the algorithm to work based on the frequencies in the input text but [probably] it should have an offline ready list of MWEs (or a similar data structure) based on which it parses the text. Any kind of idiomatic exression (unusual ones e.g., "by and large" or well-formed ones e.g., "break one's heart") are acceptable.

Best,
Fatemeh



-- 
Fatemeh


-- 
Fatemeh

-----La pièce jointe associée suit-----


_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



      
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20110131/2c8f5b0e/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list