[Corpora-List] Search tool for XCES-encoded parallel corpora?
Mickel Grönroos
mickel.gronroos at masterin.com
Fri Sep 23 12:11:22 UTC 2005
Hello!
I am looking for a corpus search tool that could be used for querying a
parallel corpus tagged in XCES format. All operating systems and programming
languages will do. Does anybody now if such a tool exists or do I need to
code it myself?
Basically what I want to be able to do is say something like: "Look for the
word X in language A using my set of sentence align files N. Show me all
sentences in language A and language B where where X occurs."
What I have is three files, one file with the text in language A, another
with the text in language B and finally an file with the alignment markup
aligning the A sentences with the B sentences.
This is what it looks like:
exampledoc_A.xml:
[...]
<p id="p1">
<s id="p1s1">Aktia nostaa Prime-korkoaan.</s>
<s id="p1s2">Aktia Säästöpankki Oyj:n johtoryhmä on tänään päättänyt
nostaa Prime-korkoa 0,5 prosenttiyksiköllä.</s>
</p>
[...]
exampledoc_B.xml:
[...]
<p id="p1">
<s id="p1s1">Aktia höjer sin Prime-ränta.</s>
<s id="p1s2">Aktia Sparbank Abp:s ledningsgrupp har i dag beslutat att
höja Prime-räntan med 0,5 procentenheter.</s>
</p>
[...]
examplealign.xml:
[...]
<translations>
<translation trans.loc="exampledoc_A.xml" wsd="iso-8859-1" lang="fi"
xml:lang="fi" n="1" />
<translation trans.loc="exampledoc_B.xml" wsd="iso-8859-1" lang="sv"
xml:lang="sv" n="2" />
</translations>
[...]
<linkList>
<linkGrp targType="s">
<link>
<align xlink:href="#p1s1" />
<align xlink:href="#p1s1" />
</link>
<link>
<align xlink:href="#p1s2" />
<align xlink:href="#p1s2" />
</link>
</linkGrp>
</linkList>
[...]
I want to be able to say:
xces_search --searchlanguage=sv 'höjer' examplealign.xml
What I want to get is:
Aktia höjer sin Prime-ränta.
Aktia nostaa Prime-korkoaan.
Any ideas?
Best regards,
Mickel Grönroos
--
Mickel Grönroos, project manager, mickel.gronroos at masterin.com, +358 9 2517
4562
Master's Innovations Ltd., Tekniikantie 14, FIN-02150 Espoo, Finland,
www.masterin.com
More information about the Corpora
mailing list