[Corpora-List] Aligner for ParaConc? - summary

Tue Sep 3 09:20:13 UTC 2002

Dear all,

Some time ago I asked for an aligner that could be used with ParaConc. I 
got two replies and a request for a summary. Unfortunately I do not have 
time for a proper summary, instead I have attached the original message and 
the replies I got . I would like to thank Martin Wynne
and Raphael Salkie for their assistance.

By the way, after I had sent my request I got to know about a free TM 
software called Wordfast. The program is fully integrated into MS Word and 
for me it seems an exellent tool, considering it is a freeware. (This is 
not a paid advertisement, just my personal opinion!) Wordfast  has got an 
add-on called +Tools, which includes an aligner, also based on MS Word. The 
aligner automates some things that you should do manually in Word (such as 
breaking text into sentences and line numbering), but I am afraid the 
aligning method is not too intelligent: a lot of work must be done manually 
anyways. However, it is one possibility worth mentioning. And, for other 
corpus fans and enthusiasts, Wordfast is provided with a pretty fast but 
modest concordancer, too :-)  Both Wordfast and +Tools can be downloaded 
from the following URL:  http://www.champollion.net/

sincerely,
sampo

The original message below:
-------------------------------------------------------------------------------------------------------------------------------
I wonder if there is any (freely available) alignment tools to be used with 
ParaConc? That is, the aligner should let users save the original and 
target texts into separate files. I know there is an aligner in the WS 
Tools pack, but for some reason the program tends to "re-join" the 
sentences you already "un-joined"... Well, you can use the WSTools Aligner 
if you get the job done at once, in one go, without saving and re-opening 
the files. (I don't know whether it's my fault - I cannot use the program 
correctly - or there's a bug in the prog.) I also know there are alignment 
tools for "filling up" translation memories (e.g. Trans Suite 2000 Align, 
which is distributed freely), but they seem not to have an option of saving 
the source and the target texts into separate files. Ok, I could save the 
output file as a text file with a separator between the segments, then open 
it to Excel using these separators as column separators, and, finally, save 
each column as a separate text file... but this makes a simple task too 
complicated, IMHO.  So, could someone help me to find out an aligner 
(preferably Windows GUI, to be used in a classroom) that would simply split 
the texts into sentences and let the user correct the alignment by joining 
and unjoining sentences? The program should then save the files into 
separate (ascii) text files. Many thanks in advance for your tips and advice!
----------------------------------------------------------------------------------------------------------------------------------

-----------------------------------------------------------------------------------------------------
From: Martin Wynne <martin.wynne at ota.ahds.ac.uk>
To: "'Sampo Nevalainen'" <samponev at cc.joensuu.fi>
-----------------------------------------------------------------------------------------------------
I have used a simple Perl aligner written by Pernilla Danielsson and Daniel 
Ridings. When I taught with pernilla on a course at the Tuscan Word Centre 
we used this program (which she calls the "vanilla aligner") to align texts 
specifically to use with ParaConc, so I know it can do this job. We may 
have done a bit of tweaking on the output. You can contact her on 
pernilla at ccl.bham.ac.uk.
best,
Martin

-----------------------------------------------------------------------------------------------------------------
From: R.M.Salkie at bton.ac.uk
To: samponev at cc.joensuu.fi
------------------------------------------------------------------------------------------------------------------
I've been struggling with the same problem, including using Trans Suite 
2000 Align. I don't have a good answer, just two suggestions.
Firstly, it's possible to use the replace function in Word using the output 
of Trans Suite, saved in TMX format. This is what a typical pair of
sentences looks like:
<tu
creationdate="20020723T151150Z"
creationid="TS2!ALIGN"
changedate="20020723T151150Z"
 >
<tuv lang="EN-GB">
<seg>World consumption has expanded at an unprecedented pace over the 20th 
century, with private and public consumption expenditures reaching $24 
trillion in 1998, twice the level of 1975 and six times that of 1950. </seg>
</tuv>
<tuv lang="DE-DE">
<seg>Der weltweite Konsum hat sich im Verlauf des 20. Jahrhundert in 
beispiellosem Tempo ausgeweitet. 1998 erreichen die privaten und
öffentlichen Konsumausgaben 24 Billionen Dollar, sie sind damit doppelt so 
hoch wie 1975 und sechsmal so hoch wie 1950. </seg>
</tuv>
</tu>
The aim is to remove all the English sentences, leaving the German ones in 
place. Load the document into Word, choose "Replace", then tick "use 
wildcards" . In the "Find what" box paste in:
\<tuv lang="EN-GB"\>*\</tuv\>
(Notice that the < and > characters need a backslash before them so that 
Word does not treat them as wildcards). If you choose "Replace all", this 
will now delete all the English sentences. Then use "save as" to save the 
file as German only. To create the English file, do the same thing to the 
original file but change the language code in the "Find what" box to 
"DE-DE". You can then use some similar techniques to remove the remaining 
XML codes and the creation dates. I realise that this is even more 
elaborate than your suggestion of using Excel, but it's something that 
students could perhaps manage. I agree entirely that it would be better if 
students didn't have to do this.

Suggestion 2: Write to Mike Barlow and suggest that he adds to ParaConc the 
ability to handle files which are in this typical translation memory format 
where the source and target sentences are in pairs. Presumably this is a 
simpler task for a computer programme than relating texts in two separate 
files: as long as the computer knows which is the source language, then it 
would have to produce the sentence (or KWIC) containing the source word, 
along with the sentence which follows. For searches in the target language
it would be the sentence that precedes. I couldn't wirte a programme to do 
this, but I think a programmer could. I hope that someone comes up with a 
better solution, and I'd be grateful if you could publicise anything useful.

Best wishes. - Raphael
-------------------------------------------------------------------------------------------------------------------

( : ============================================= : )

Sampo Nevalainen, M.A.
Researcher
University of Joensuu
Savonlinna School of Translation Studies
P.O.Box 48
FIN-57101 Savonlinna
FINLAND

tel     +358-15-511 70      (operator)
         +358-15-511 7704
fax     +358-15-515 096
email   samponev at cc.joensuu.fi
http://www.joensuu.fi/slnkvl/