[Corpora-List] corpora with regular expression engine (syntactic pattern)
Serge Heiden
slh at ens-lyon.fr
Tue Mar 5 09:15:28 UTC 2013
Hi Olivier and Kevin,
A recent alternative to Xkwic, as a wrapper to the CQP search engine,
is TXM - http://sourceforge.net/projects/txm:
- it runs on Windows, Mac OS X and Linux
- its graphical user interface is available in English, Russian and French
- it is also available as a web portal software (allowing you to give online
access to your own corpora with access control builtin)
- it embeds the R software to allow you to apply any statistical
model you could imagine of to CQP extractions
- it works hard to process all kinds of data formats: Unicode raw text,
XML, various flavours of TEI P5, Transcriber speech transcriptions,
TMX aligned corpora, native CWB...
- runs TreeTagger for you on the fly when importing corpora
- it can decently handle at most 10 million words corpora (currently)
- it is free and open-source
For more info:
- in English see <http://wiki.tei-c.org/index.php/TXM>
- a whole one day introduction tutorial screencast (in French)
<http://txm.sourceforge.net/enregistrement_atelier_initiation_TXM_fr.html>
- the scientific project background
<http://textometrie.ens-lyon.fr/?lang=en>
A last remark concerning the power of the CQP search engine.
It combines two different levels of regular expressions other words:
- a first level on the Part Of Speech tags values, word graphical forms
or lemma...
- a second level on word sequences
For example, an expression like: [pos="V.*"]+
can express any sequence of verbs of any length:
- "V.*" is at the first level (Part Of Speech tag value): any tag
beginning with letter 'V' (ignore sub-categories)
- [...]+ is at the second level (sequence of words): two, three, four...
adjacent verbs
Best,
Serge
Le 02/24/2013 08:02 PM, Kevin B. Cohen a écrit :
> Hi, Olivier,
>
> If you're OK with English, the tgrep and Xkwic programs will allow
> you to do this. Both should work on a Mac. If you have trouble
> using them, two of my students wrote nice tutorials for them this
> past semester.
>
> Kev
>
> On Sun, Feb 24, 2013 at 5:29 AM, Olivier Austina
> <olivier.austina at gmail.com> wrote:
>> Hi,
>>
>> Is there a corpora which can be queried using Part Of Speech tags
>> in a regular expression? -- Regards Austina
>>
>>
>> _______________________________________________ UNSUBSCRIBE from
>> this page: http://mailman.uib.no/options/corpora Corpora mailing
>> list Corpora at uib.no http://mailman.uib.no/listinfo/corpora
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20130305/90bfa69f/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list