[Corpora-List] Re: fast string replacement

stahl at germanistik.uni-wuerzburg.de stahl at germanistik.uni-wuerzburg.de
Tue Mar 15 12:10:44 UTC 2005


Jörg Schuster wrote:
 
> I mean really REALLY fast. The size of my rewriting dictionary is 1
> million lines at the moment. (But it will grow larger). The size of my
> corpus is 80GB. And I would like to be able to tag often.

To manipulate really large files I use the "TUebingen System of 
TExt processing Programms" (Tustep), which contains a module that
can be used - among many others things - to exchange many source-strings
into new target-strings. You find infos about Tustep unter this URL:
   http://www.uni-tuebingen.de/zdv/tustep

To answer you question I created a test file holding 1048576 lines and 
a script file with two strings to exchange. 

The target file contains 1 million lines with the text:
   Dies ist eine Datei mit 1 Million Zeilen.             

A script with the lines
   #create,test2,confirm=-
   #copy,test,test2,-,+,*
   xx        .datei.file file file file.
   xx        .zeilen.lines.
   *eof
creates a new target file (test2), copies test into test2 and
exchanges the string "datei" into "file file file file" as well
as the string "zeilen" into "lines". Please excuse the simplicity
of my text. Executing the script took 2 seconds. 

Each line in the target file test2 then looks like this:
   Dies ist eine file file file file mit 1 Million lines.

Copying and manipulating 2 million lines took 4 seconds.

The Tustep-replacement strings pretty much look like regular
expressions that you can enrich with exceptions and abstract
patterns. And you can replace thousands of strings in one single script. 
Maybe this can give you some ideas.

Best regards
Peter Stahl
University of Wuerzburg



More information about the Corpora mailing list