[Corpora-List] XML concordancing query

Thu May 12 10:51:50 UTC 2011

Hi Ciaran,

In Corpus Workbench (CWB: http://cwb.sourceforge.net/ ) you would handle this sort of thing by conceiving of it in terms of different streams of annotation. You are essentially treating the wordform minus the marked-up chunks as a kind of "lemma" that groups multiple wordforms. This lemma can be considered an annotation on the raw form of the tokens.

You would code your corpus so that the "ignored" data was *included* in the main word, but *excluded* in an annotation (which we'll call "lemma" just for the sake of argument although it may not be quite the same thing). Your example data could be represented as follows in CWB columnar format:

...
abc	abc
def	def
gxyzh	gh
pqijkl	ijkl
...

(that's a relatively easy transformation with either XSLT or a little reformatting script using regex or whatever)

Once that is indexed into CWB, if you use CWB's Corpus Query Processor (CQP) to run the query 

   [lemma="ijkl" | lemma="gh"] 

then the concordance result would look something like this:

... abc def <<gxyzh>> pqijkl ....
... def gxyzh <<pqijkl>> ....

(you can also set up CQP to display the annotation in the concordance instead/as well, if you prefer)

Hope that helps

best

Andrew.

Andrew Hardie
Linguistics & English Language
County South
Lancaster University
Lancaster LA1 4YL
United Kingdom

http://www.ling.lancs.ac.uk/staff/hardie

-----Original Message-----
From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf Of Ciarán Ó Duibhín
Sent: 12 May 2011 10:48
To: Corpora list
Subject: [Corpora-List] XML concordancing query

Hi. I hope someone can save me doing a little research into concordance 
programs!  I'm looking for one which can do this.

My XML-tagged text will have many short strings tagged in a particular way, 
it might possibly be <span class="ignore">xyz</span>

The thing about these tagged strings is that they are to be dropped when the 
text is being divided into tokens.  So, if the text contains

           abc   def   g<span class="ignore">xyz</span>h   <span 
class="ignore">pq</span>ijkl

then the tokens are:   abc def gh ijkl   and these should figure as 
concordance headwords.

The ignored material should however be included in the contexts (without the 
tagging), so this piece of text would give a token of "gxyzh" appearing 
under type "gh", and a token of "pqijkl" appearing under type "ijkl".

Thanks for your help,

Ciarán Ó Duibhín.

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora