[Corpora-List] fast string replacement

Paul Bijnens paul.bijnens at xplanation.com
Mon Mar 14 13:00:37 UTC 2005


Jörg Schuster wrote:

> I mean really REALLY fast. The size of my rewriting dictionary is 1
> million lines at the moment. (But it will grow larger). The size of my
> corpus is 80GB. And I would like to be able to tag often.

Attached you'll find a little C-program that replaces fixed strings,
that I wrote about 15 years ago.  I'm still using it however.

[ attachment: http://torvald.aksis.uib.no/corpora/repl.zip ]

I've never tried it on a replacement set of 1 million lines,
but I'm very interested to see how it behaves on such large input.  :-)

There is no man page, but in the source there is some more information.

Quick getting started:

make a file having the following syntax:

====cut here=====
# This is a comment
/search/replace/

# the longest search string will be replaced
/searchsomethingelse/replace this too/

# blank lines are ignored

# The first non-alfabetic char is the separator:
!/this/contains/slashes!/THIS/CONTAINS/SLASHES/!

# A search or replacement string can contain newlines
# or any bytes (includeing null, better encode this \000)
/some
line/some line/

/need to split/need
to split/

# You can encode bytes with backslash notation like
#  \n, \t, ...etc, \007 (octoal) or \xC4 (hexadecimal)
/élève/\xe9l\xe8ve/
========== cut here ===========


Execute with:

$ repl /name/of/repl/table infile > outfile

You can also specify replacements on the command line:

$ repl -e '/\r\n/\n/' infile > outfile


At least the program is very simple... (and fast for me!)

If really needed, the tree implementation could be replaced
by a trie implementation to make it even faster, at the expense of
being more complicated (that's probably what the commercial progs do).


--
Paul Bijnens, Xplanation                            Tel  +32 16 397.511
Technologielaan 21 bus 2, B-3001 Leuven, BELGIUM    Fax  +32 16 397.512
http://www.xplanation.com/          email:  Paul.Bijnens at xplanation.com
***********************************************************************
* I think I've got the hang of it now:  exit, ^D, ^C, ^\, ^Z, ^Q, F6, *
* quit,  ZZ, :q, :q!,  M-Z, ^X^C,  logoff, logout, close, bye,  /bye, *
* stop, end, F3, ~., ^]c, +++ ATH, disconnect, halt,  abort,  hangup, *
* PF4, F20, ^X^X, :D::D, KJOB, F14-f-e, F8-e,  kill -1 $$,  shutdown, *
* kill -9 1,  Alt-F4,  Ctrl-Alt-Del,  AltGr-NumLock,  Stop-A,  ...    *
* ...  "Are you sure?"  ...   YES   ...   Phew ...   I'm out          *
***********************************************************************



More information about the Corpora mailing list