[Corpora-List] fast string replacement
Paul Bijnens
paul.bijnens at xplanation.com
Mon Mar 14 13:00:37 UTC 2005
Jörg Schuster wrote:
> I mean really REALLY fast. The size of my rewriting dictionary is 1
> million lines at the moment. (But it will grow larger). The size of my
> corpus is 80GB. And I would like to be able to tag often.
Attached you'll find a little C-program that replaces fixed strings,
that I wrote about 15 years ago. I'm still using it however.
[ attachment: http://torvald.aksis.uib.no/corpora/repl.zip ]
I've never tried it on a replacement set of 1 million lines,
but I'm very interested to see how it behaves on such large input. :-)
There is no man page, but in the source there is some more information.
Quick getting started:
make a file having the following syntax:
====cut here=====
# This is a comment
/search/replace/
# the longest search string will be replaced
/searchsomethingelse/replace this too/
# blank lines are ignored
# The first non-alfabetic char is the separator:
!/this/contains/slashes!/THIS/CONTAINS/SLASHES/!
# A search or replacement string can contain newlines
# or any bytes (includeing null, better encode this \000)
/some
line/some line/
/need to split/need
to split/
# You can encode bytes with backslash notation like
# \n, \t, ...etc, \007 (octoal) or \xC4 (hexadecimal)
/élève/\xe9l\xe8ve/
========== cut here ===========
Execute with:
$ repl /name/of/repl/table infile > outfile
You can also specify replacements on the command line:
$ repl -e '/\r\n/\n/' infile > outfile
At least the program is very simple... (and fast for me!)
If really needed, the tree implementation could be replaced
by a trie implementation to make it even faster, at the expense of
being more complicated (that's probably what the commercial progs do).
--
Paul Bijnens, Xplanation Tel +32 16 397.511
Technologielaan 21 bus 2, B-3001 Leuven, BELGIUM Fax +32 16 397.512
http://www.xplanation.com/ email: Paul.Bijnens at xplanation.com
***********************************************************************
* I think I've got the hang of it now: exit, ^D, ^C, ^\, ^Z, ^Q, F6, *
* quit, ZZ, :q, :q!, M-Z, ^X^C, logoff, logout, close, bye, /bye, *
* stop, end, F3, ~., ^]c, +++ ATH, disconnect, halt, abort, hangup, *
* PF4, F20, ^X^X, :D::D, KJOB, F14-f-e, F8-e, kill -1 $$, shutdown, *
* kill -9 1, Alt-F4, Ctrl-Alt-Del, AltGr-NumLock, Stop-A, ... *
* ... "Are you sure?" ... YES ... Phew ... I'm out *
***********************************************************************
More information about the Corpora
mailing list