[Corpora-List] MWE extraction from a desired text

Martin Reynaert reynaert at uvt.nl
Sun Jan 30 15:03:11 UTC 2011


Dear Fatemeh,

Seems to me the Unix/Linux text utility 'grep' might do wonders for you. 
The -f parameter would allow you to give it your offline ready list of 
MWEs. Getting back only the sentences that match would require that your 
corpus has undergone sentence splitting first. This is quite often 
performed by tools called 'tokenizers'.

With some searching on the web you will dig up a Windows version of 
'grep' and other indispensable text utility tools ('tr', 'sed', etc.) .

Welcome,

Martin Reynaert
ILK
Tilburg University


Fatemeh Torabi Asr wrote:
>
> Dears,
>
> I wonder if anyone knows a software that takes a text as input and 
> outputs a list of included sentences in which common Multi Word 
> Expressions (MWE) appear. I have already found some tools but the 
> underlying algorithm is also important for me. I don't want the 
> algorithm to work based on the frequencies in the input text but 
> [probably] it should have an offline ready list of MWEs (or a similar 
> data structure) based on which it parses the text. Any kind of 
> idiomatic exression (unusual ones e.g., "by and large" or well-formed 
> ones e.g., "break one's heart") are acceptable.
>
> Best,
> Fatemeh
>
>
>
> -- 
> Fatemeh
> ------------------------------------------------------------------------
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>   

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list