[Corpora-List] corpora with regular expression engine (syntactic pattern)

Serge Heiden slh at ens-lyon.fr
Tue Mar 5 09:15:28 UTC 2013


Hi Olivier and Kevin,

A recent alternative to Xkwic, as a wrapper to the CQP search engine,
is TXM - http://sourceforge.net/projects/txm:
- it runs on Windows, Mac OS X and Linux
- its graphical user interface is available in English, Russian and French
- it is also available as a web portal software (allowing you to give online
   access to your own corpora with access control builtin)
- it embeds the R software to allow you to apply any statistical
   model you could imagine of to CQP extractions
- it works hard to process all kinds of data formats: Unicode raw text,
   XML, various flavours of TEI P5, Transcriber speech transcriptions,
   TMX aligned corpora, native CWB...
- runs TreeTagger for you on the fly when importing corpora
- it can decently handle at most 10 million words corpora (currently)
- it is free and open-source

For more info:
- in English see <http://wiki.tei-c.org/index.php/TXM>
- a whole one day introduction tutorial screencast (in French) 
<http://txm.sourceforge.net/enregistrement_atelier_initiation_TXM_fr.html>
- the scientific project background 
<http://textometrie.ens-lyon.fr/?lang=en>

A last remark concerning the power of the CQP search engine.
It combines two different levels of regular expressions other words:
- a first level on the Part Of Speech tags values, word graphical forms 
or lemma...
- a second level on word sequences
For example, an expression like: [pos="V.*"]+
can express any sequence of verbs of any length:
- "V.*" is at the first level (Part Of Speech tag value): any tag 
beginning with letter 'V' (ignore sub-categories)
- [...]+ is at the second level (sequence of words): two, three, four... 
adjacent verbs

Best,
Serge

Le 02/24/2013 08:02 PM, Kevin B. Cohen a écrit :
> Hi, Olivier,
 >
 > If you're OK with English, the tgrep and Xkwic programs will allow
 > you to do this. Both should work on a Mac. If you have trouble
 > using them, two of my students wrote nice tutorials for them this
 > past semester.
 >
 > Kev
 >
 > On Sun, Feb 24, 2013 at 5:29 AM, Olivier Austina
 > <olivier.austina at gmail.com> wrote:
 >> Hi,
 >>
 >> Is there a corpora which can be queried using Part Of Speech tags
 >> in a regular expression? -- Regards Austina
 >>
 >>
 >> _______________________________________________ UNSUBSCRIBE from
 >> this page: http://mailman.uib.no/options/corpora Corpora mailing
 >> list Corpora at uib.no http://mailman.uib.no/listinfo/corpora
 >>
 >
 >


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20130305/90bfa69f/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list