[Corpora-List] Search tool for XCES-encoded parallel corpora?

Joerg Tiedemann tiedeman at let.rug.nl
Fri Sep 23 16:04:21 UTC 2005


In OPUS we use the corpus work bench from IMS stuttgart. We also have our 
sentence alignments in XCES and querying the data (2 or more languages) 
works fine. Look at the opus homepage: http://logos.uio.no/opus
and the query form: http://logos.uio.no/cgi-bin/opus/opuscqp.pl
There are also some scripts you might want to use for converting your xml 
data to cwb input. check the cvs from the opus homepage and look into the 
scripts directory. don't hesitate to ask if you have any questions.

best,


Jörg Tiedemann

***********/\/\/\/\/\/\/\/\/\/\/\************************************
**  Jörg Tiedemann                 tiedeman at let.rug.nl             **
**  Alfa-Informatica               http://www.let.rug.nl/~tiedeman **  
**  Rijksuniversiteit Groningen     Harmoniegebouw, room 1311-429  **
**  Oude Kijk in 't Jatstraat 26    phone: +31 (0)50-363 5935      **
**  9712 EK Groningen               fax:   +31 (0)50-363 6855      **
*************************************/\/\/\/\/\/\/\/\/\/\/\**********

On Fri, 23 Sep 2005, Mickel Grönroos wrote:

> Hello!
> 
> I am looking for a corpus search tool that could be used for querying a
> parallel corpus tagged in XCES format. All operating systems and programming
> languages will do. Does anybody now if such a tool exists or do I need to
> code it myself?
> 
> Basically what I want to be able to do is say something like: "Look for the
> word X in language A using my set of sentence align files N. Show me all
> sentences in language A and language B where where X occurs."
> 
> What I have is three files, one file with the text in language A, another
> with the text in language B and finally an file with the alignment markup
> aligning the A sentences with the B sentences.
> 
> This is what it looks like:
> 
> exampledoc_A.xml:
> [...]
> <p id="p1">
>   <s id="p1s1">Aktia nostaa Prime-korkoaan.</s>
>   <s id="p1s2">Aktia Säästöpankki Oyj:n johtoryhmä on tänään päättänyt
> nostaa Prime-korkoa 0,5 prosenttiyksiköllä.</s>
> </p>
> [...]
> 
> exampledoc_B.xml:
> [...]
> <p id="p1">
>   <s id="p1s1">Aktia höjer sin Prime-ränta.</s>
>   <s id="p1s2">Aktia Sparbank Abp:s ledningsgrupp har i dag beslutat att
> höja Prime-räntan med 0,5 procentenheter.</s>
>   </p>
> [...]
> 
> examplealign.xml:
> [...]
> <translations>
>   <translation trans.loc="exampledoc_A.xml" wsd="iso-8859-1" lang="fi"
> xml:lang="fi" n="1" />
>   <translation trans.loc="exampledoc_B.xml" wsd="iso-8859-1" lang="sv"
> xml:lang="sv" n="2" />
> </translations>
> [...]
> <linkList>
>   <linkGrp targType="s">
>     <link>
>       <align xlink:href="#p1s1" />
>       <align xlink:href="#p1s1" />
>     </link>
>     <link>
>       <align xlink:href="#p1s2" />
>       <align xlink:href="#p1s2" />
>     </link>
>   </linkGrp>
> </linkList>
> [...]
> 
> I want to be able to say:
> 
> xces_search --searchlanguage=sv 'höjer' examplealign.xml
> 
> What I want to get is:
> Aktia höjer sin Prime-ränta.
> Aktia nostaa Prime-korkoaan.
> 
> Any ideas?
> 
> Best regards,
> 
> Mickel Grönroos
> 
> --
> Mickel Grönroos, project manager, mickel.gronroos at masterin.com, +358 9 2517
> 4562
> Master's Innovations Ltd., Tekniikantie 14, FIN-02150 Espoo, Finland,
> www.masterin.com
> 
> 



More information about the Corpora mailing list