Corpora: sgml detagger

Danko Sipka sipkadan at main.amu.edu.pl
Tue Apr 16 18:31:35 UTC 2002


Hi:
This Perl script should do the job:

print "What is your input file name:\n";
chomp($infile=<STDIN>);
open IN, $infile or die "No file, no fun!";
open OUT, ">$infile.out" or die "No file, no fun!";
while (<IN>) {
    $_=~s/\<.+?\>//g;
    print OUT "$_";
    }
close (IN) or die "D'oh!";
close (OUT) or die "D'oh!";

Best,

Danko Sipka
sipkadan at main.amu.edu.pl | Danko.Sipka at asu.edu
http://main.amu.edu.pl/~sipkadan | http://www.public.asu.edu/~dsipka


  ----- Original Message ----- 
  From: Tine & Colleen 
  To: CORPORA at HD.UIB.NO 
  Sent: Tuesday, April 16, 2002 8:13 PM
  Subject: Corpora: sgml detagger


  Hi
  I am compiling a corpus for research reasons and some of the texts are sgml-tagged.
  Does anybody know an easy way to remove the tags and save the texts as 'raw' .txt files?
  Maybe a PERL script?

  Thanks in advance

  Tine Lassen
  Copenhagen
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20020416/3b55b79b/attachment.htm>


More information about the Corpora mailing list