[Corpora-List] XML concordancing query
Hardie, Andrew
a.hardie at lancaster.ac.uk
Thu May 12 10:51:50 UTC 2011
Hi Ciaran,
In Corpus Workbench (CWB: http://cwb.sourceforge.net/ ) you would handle this sort of thing by conceiving of it in terms of different streams of annotation. You are essentially treating the wordform minus the marked-up chunks as a kind of "lemma" that groups multiple wordforms. This lemma can be considered an annotation on the raw form of the tokens.
You would code your corpus so that the "ignored" data was *included* in the main word, but *excluded* in an annotation (which we'll call "lemma" just for the sake of argument although it may not be quite the same thing). Your example data could be represented as follows in CWB columnar format:
...
abc abc
def def
gxyzh gh
pqijkl ijkl
...
(that's a relatively easy transformation with either XSLT or a little reformatting script using regex or whatever)
Once that is indexed into CWB, if you use CWB's Corpus Query Processor (CQP) to run the query
[lemma="ijkl" | lemma="gh"]
then the concordance result would look something like this:
... abc def <<gxyzh>> pqijkl ....
... def gxyzh <<pqijkl>> ....
(you can also set up CQP to display the annotation in the concordance instead/as well, if you prefer)
Hope that helps
best
Andrew.
Andrew Hardie
Linguistics & English Language
County South
Lancaster University
Lancaster LA1 4YL
United Kingdom
http://www.ling.lancs.ac.uk/staff/hardie
-----Original Message-----
From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf Of Ciarán Ó Duibhín
Sent: 12 May 2011 10:48
To: Corpora list
Subject: [Corpora-List] XML concordancing query
Hi. I hope someone can save me doing a little research into concordance
programs! I'm looking for one which can do this.
My XML-tagged text will have many short strings tagged in a particular way,
it might possibly be <span class="ignore">xyz</span>
The thing about these tagged strings is that they are to be dropped when the
text is being divided into tokens. So, if the text contains
abc def g<span class="ignore">xyz</span>h <span
class="ignore">pq</span>ijkl
then the tokens are: abc def gh ijkl and these should figure as
concordance headwords.
The ignored material should however be included in the contexts (without the
tagging), so this piece of text would give a token of "gxyzh" appearing
under type "gh", and a token of "pqijkl" appearing under type "ijkl".
Thanks for your help,
Ciarán Ó Duibhín.
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list