[Corpora-List] Search tool for XCES-encoded parallel corpora?

Lars Nygaard lars.nygaard at iln.uio.no
Fri Sep 23 13:06:18 UTC 2005


Mickel,

You can use the excellent Corpus Workbench 
(http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/). It takes a 
bit of work to learn, but it is very flexible and powerful.

To convert your XCES files to CWB format, use attached perl scripts, 
from Jörg Tiedemann's equally excellent UPLUG package.


best,
  Lars Nygaard,
  The Text Laboratory, University of Oslo


Mickel Grönroos wrote:
> Hello!
> 
> I am looking for a corpus search tool that could be used for querying a
> parallel corpus tagged in XCES format. All operating systems and programming
> languages will do. Does anybody now if such a tool exists or do I need to
> code it myself?
> 
> Basically what I want to be able to do is say something like: "Look for the
> word X in language A using my set of sentence align files N. Show me all
> sentences in language A and language B where where X occurs."
> 
> What I have is three files, one file with the text in language A, another
> with the text in language B and finally an file with the alignment markup
> aligning the A sentences with the B sentences.
> 
> This is what it looks like:
> 
> exampledoc_A.xml:
> [...]
> <p id="p1">
>   <s id="p1s1">Aktia nostaa Prime-korkoaan.</s>
>   <s id="p1s2">Aktia Säästöpankki Oyj:n johtoryhmä on tänään päättänyt
> nostaa Prime-korkoa 0,5 prosenttiyksiköllä.</s>
> </p>
> [...]
> 
> exampledoc_B.xml:
> [...]
> <p id="p1">
>   <s id="p1s1">Aktia höjer sin Prime-ränta.</s>
>   <s id="p1s2">Aktia Sparbank Abp:s ledningsgrupp har i dag beslutat att
> höja Prime-räntan med 0,5 procentenheter.</s>
>   </p>
> [...]
> 
> examplealign.xml:
> [...]
> <translations>
>   <translation trans.loc="exampledoc_A.xml" wsd="iso-8859-1" lang="fi"
> xml:lang="fi" n="1" />
>   <translation trans.loc="exampledoc_B.xml" wsd="iso-8859-1" lang="sv"
> xml:lang="sv" n="2" />
> </translations>
> [...]
> <linkList>
>   <linkGrp targType="s">
>     <link>
>       <align xlink:href="#p1s1" />
>       <align xlink:href="#p1s1" />
>     </link>
>     <link>
>       <align xlink:href="#p1s2" />
>       <align xlink:href="#p1s2" />
>     </link>
>   </linkGrp>
> </linkList>
> [...]
> 
> I want to be able to say:
> 
> xces_search --searchlanguage=sv 'höjer' examplealign.xml
> 
> What I want to get is:
> Aktia höjer sin Prime-ränta.
> Aktia nostaa Prime-korkoaan.
> 
> Any ideas?
> 
> Best regards,
> 
> Mickel Grönroos
> 
> --
> Mickel Grönroos, project manager, mickel.gronroos at masterin.com, +358 9 2517
> 4562
> Master's Innovations Ltd., Tekniikantie 14, FIN-02150 Espoo, Finland,
> www.masterin.com
> 
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: align2cwb.pl
Type: text/x-perl
Size: 4517 bytes
Desc: not available
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20050923/3c02d474/attachment-0002.pl>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: xml2cwb.pl
Type: text/x-perl
Size: 5760 bytes
Desc: not available
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20050923/3c02d474/attachment-0003.pl>


More information about the Corpora mailing list