[Corpora-List] Summary: fast string replacement

Jörg Schuster joerg.schuster at gmail.com
Tue Mar 15 13:08:33 UTC 2005


Hello,

thanks to all who participated in this discussion. 

First I have to apologize for my original posting (or mail?): I asked
for programs for transducing strings. I wrote 'strings (!)' to
indicate that I really meant strings (and not regular expressions or
tokens). Yet, the examples I gave mislead some people because they did
not include cases of transduction of multi word lexemes. 

In the remainder of this paper I will give an overview of the
suggested solutions. The solution that I like best is Paul Bijnens' C
program (12). 

For shortness, I will mostly leave away the names of the people who pointed
me to the sites. 

(1) Max Silberztein: http://www.nyu.edu/pages/linguistics/intex/
(2) Helmut Schmid: http://www.ims.uni-stuttgart.de/~schmid
(3) Stephan Kanthak:
http://www-i6.informatik.rwth-aachen.de/~kanthak/fsa.html
(4) Gertjan van Noord: http://grid.let.rug.nl/~vannoord/Fsa/fsa.html
(5) Arnaud Adant: http://membres.lycos.fr/adant/tfe/
(6) ISI: http://www.isi.edu/licensed-sw/carmel/
(7) MIT: http://people.csail.mit.edu/people/ilh/fst/

Comments: (1)-(6) all look like really serious programs. Yet, I
considered them to be too complicated for my purposes.

(7) is not available at the moment.

(8)  ?: ftp://ftp.gnu.org/non-gnu/flex/
      Comment: good, but overkill for my purposes.

(9) Songlin Piao pointed me to a java tool of his:
      http://www.lancs.ac.uk/staff/piaosl/research/download/download.htm. I

      Comment: I tried to use it, but it did not work:
      $ java -jar mlct_concordance.jar 
      $ Invalid or corrupt jarfile mlct_concordance.jar

(10) Leif Arda Nielsen gave me the advice to use sed.
        Comment: too slow. 

(11) Damon Allen Davison gave me the advice to use SQL. 
        Comment: I did not quite understand Damon's mail.

(12) Paul Bijnens pointed me to a c program of his:
        http://torvald.aksis.uib.no/corpora/repl.zip
        Comment: This program is great. 
        - It worked immediately. (No fumbling around with paths,
           (versions of) compilers and the like.)
        - It doesn't seem to care about the size of the rewrite
          dictionary (except that you need to have enough RAM, of course)
        - It is quite fast: I gave it a rewrite dictionary of 1 million
          entries. It transduced about 50MB per minute on an Athlon 2600+.

Jörg Schuster



More information about the Corpora mailing list