[Corpora-List] MWE extraction from a desired text

Mon Jan 31 06:13:30 UTC 2011

Thanks everybody for the useful replies.

Ted, this is exactly what I needed for a preliminary experiment and so many
thanks for giving the links to those prepared lists of MWEs. I'm new in
working with collocations and it seems that hot discussions about the
extraction of different types of idioms exist out there!

There are two requirements that I would probably go for later in my
research. The first one is a method of scoring my sentences based on their
degree of idiosyncrasy. The second is a more advanced MWE extractor that is
capable of recognizing those idioms which do not necessarily appear as
sequential ngrams (e.g., "On one hand S1 On the other hand S2" or "break POS
heart").
There must be agorithms  to detect such structures with rather different
matching methods. If a candidate list of such idioms is ready out there in a
slightly different format (maybe regular expressions), then the second job
of matching them with the sentences in a desired text would be as well an
easy job (just as Martin suggested me using linux grep to do that).

Thanks Rich, no this is not what I'm going to do, though the method might be
applicable in my job as well.

Best,
Fatemeh

On Mon, Jan 31, 2011 at 12:39 AM, Rich Cooper
<rich at englishlogickernel.com>wrote:

>  Hi Fatmeh,
>
>
>
> Are you interested in developing a corpus of issued patents as recorded by
> the USPTO, which contain numerous large columns with unstructured text?  I
> have tools that will help you do that.
>
>
>
> If you have done so, you can then use another tool (in alpha condition now)
> called Linguistics Lab which text mines for exactly such MWEs in string
> format.
>
>
>
> Would that help you?
>
>
>
> -Rich
>
>
>
>
>
> Sincerely,
>
> Rich Cooper
>
> EnglishLogicKernel.com
>
> Rich AT EnglishLogicKernel DOT com
>
> 9 4 9 \ 5 2 5 - 5 7 1 2
>   ------------------------------
>
> *From:* corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] *On Behalf
> Of *Fatemeh Torabi Asr
> *Sent:* Sunday, January 30, 2011 5:39 AM
> *To:* corpora at uib.no
> *Subject:* [Corpora-List] MWE extraction from a desired text
>
>
>
>
> Dears,
>
> I wonder if anyone knows a software that takes a text as input and outputs
> a list of included sentences in which common Multi Word Expressions (MWE)
> appear. I have already found some tools but the underlying algorithm is also
> important for me. I don't want the algorithm to work based on the
> frequencies in the input text but [probably] it should have an offline ready
> list of MWEs (or a similar data structure) based on which it parses the
> text. Any kind of idiomatic exression (unusual ones e.g., "by and large" or
> well-formed ones e.g., "break one's heart") are acceptable.
>
> Best,
> Fatemeh
>
>
>
>
> --
> Fatemeh
>

-- 
Fatemeh
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20110131/111afc14/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora