[Corpora-List] Using WebCorp

Andrew Kehoe Andrew.Kehoe at bcu.ac.uk
Tue May 13 09:54:37 UTC 2008


Hi Imtiaz
 
The current version of WebCorp (http://www.webcorp.org.uk/) relies on standard search engines such as Google to access the web, adding layers of refinement specifically for linguistic analysis.  This means that the 'corpus' you are searching is not Part-of-Speech tagged and, thus, you cannot run the type of search you suggest.
 
However, we are currently working on the new WebCorp Linguist's Search Engine, which crawls the web, downloading texts and building structured, POS-tagged corpora. Using this system it is possible to search for your pattern as:
 
'the {ADJ*} man and woman'
 
You can see a screenshot of the first 20 results from a small test corpus at http://www.webcorp.org.uk/WebCorpLSE.gif
 
The WebCorp LSE prototype is currently being beta tested by volunteers from the community.  For more information please visit http://wse1.webcorp.org.uk/preview/
 
Best wishes
 
Andrew Kehoe
Research & Development Unit for English Studies
Birmingham City University
http://rdues.bcu.ac.uk/
 
http://www.webcorp.org.uk/ 

________________________________

From: corpora-bounces at uib.no on behalf of Khan, I. H.
Sent: Mon 12/05/2008 2:54 PM
To: corpora at uib.no
Subject: [Corpora-List] Using WebCorp


Hi
 
Does any one know how to use regular expressions (with Part-of-Speech tags) in WebCorp corpus? 
For example, to extarct the phrases of the form 'the Adj man and woman' from the BNC, I can use the following regexp in SkE:
 
[word = "the"]  [tag = "AJ.*"] [word = "man"] [word = "and"] [word = "woman"];
 
Any help?
 
Regards
 
 
Imtiaz

The University of Aberdeen is a charity registered in Scotland, No SC013683.



Birmingham City University is the new name unveiled for the former University of Central England in Birmingham
For more information about the name change go to http://www.bcu.ac.uk/namechange/official_announcement.html
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20080513/2f07fe53/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list