[Corpora-List] fast string replacement

Stefan Evert evert at IMS.Uni-Stuttgart.DE
Fri Mar 11 15:28:50 UTC 2005


> I am looking for a program that
>
> - takes as input a string (!) rewriting dictionary and and a corpus
> - applies all rewriting rules to all lines of the corpus
> - is fast, stable and free
> - works under Linux
>

Two further questions:

- What exactly do you mean by "fast"?

Perl is very good at doing that sort of thing and it is usually quite
fast. However, whether Perl is a feasible option or not depends on
your answer to my second question (Perl is good at word replacement
but fairly slow for string replacement).

- Do you mean string replacement (arbitrary substrings in a line of
text) or word replacement?

If you do string replacement then

  Eunice from the bookstand.

would become

  Eunice/adj from the books/v:3:pres;n:plurtand

after transduction. If you work on white-space delimited words, on the
other hand, you can split lines in Perl, look up each word in a hash
that stores rewriting rules, and insert the replacement if applicable.

If you're really interested in string replacement (probably with some
additional code to identify word boundaries), you should be looking at
finite-state transducers. Two open-source solutions I know are Helmut
Schmid's FST toolkit (see http://www.ims.uni-stuttgart.de/~schmid) and
Steve Abney's cascaded parser CASS (you'll have to search Google for
the source code).

Cheers,
Stefan.

> Example:
>
> Some rewriting rules:
>
>  book3, books/v:3:pres;n:plur
>  nice, nice/adj
>
> A "corpus" before transduction:
>
>  John reads nice books.
>
> The same corpus after transduction:
>
>  John reads nice/adj books/v:3:pres;n:plur
>
> Does anyone know such a program?
>
> Jörg Schuster
>



More information about the Corpora mailing list