<!doctype html public "-//w3c//dtd html 4.0 transitional//en">
<html>
<body bgcolor="#FFFFFF">
The script below will work for most tags, but may fail in the following
more complicated cases:
<p>1. A tag is spread out over more than 1 line (usual cases: comment tags,
tags with attribute/value pairs).
<p>2. A tag has an attribute value that has a ">" in it.
<p>3. A comment tag has a ">" embedded in it.
<p>I have encountered these in html files of journal articles gotten off
the web. Thanks.
<p>-Alex Yeh
<br>
<p>Danko Sipka wrote:
<blockquote TYPE=CITE><style></style>
Hi:This Perl script should do the
job: print "What is your input file name:\n";
<br>chomp($infile=<STDIN>);
<br>open IN, $infile or die "No file, no fun!";
<br>open OUT, ">$infile.out" or die "No file, no fun!";
<br>while (<IN>) {
<br> $_=~s/\<.+?\>//g;
<br> print OUT "$_";
<br> }
<br>close (IN) or die "D'oh!";
<br>close (OUT) or die "D'oh!";Best, Danko Sipka<a href="mailto:sipkadan@main.amu.edu.pl">sipkadan@main.amu.edu.pl</a>
| <a href="mailto:Danko.Sipka@asu.edu">Danko.Sipka@asu.edu</a><a href="http://main.amu.edu.pl/~sipkadan">http://main.amu.edu.pl/~sipkadan</a>
| <a href="http://www.public.asu.edu/~dsipka">http://www.public.asu.edu/~dsipka</a>
<blockquote dir=ltr
style="PADDING-RIGHT: 0px; PADDING-LEFT: 5px; MARGIN-LEFT: 5px; BORDER-LEFT: #000000 2px solid; MARGIN-RIGHT: 0px">
<div style="FONT: 10pt arial">----- Original Message -----</div>
<div
style="BACKGROUND: #e4e4e4; FONT: 10pt arial; font-color: black"><b>From:</b>
<a href="mailto:tine.lassen@tdcadsl.dk" title="tine.lassen@tdcadsl.dk">Tine
& Colleen</a></div>
<div style="FONT: 10pt arial"><b>To:</b> <a href="mailto:CORPORA@HD.UIB.NO" title="CORPORA@HD.UIB.NO">CORPORA@HD.UIB.NO</a></div>
<div style="FONT: 10pt arial"><b>Sent:</b> Tuesday, April 16, 2002 8:13
PM</div>
<div style="FONT: 10pt arial"><b>Subject:</b> Corpora: sgml detagger</div>
<font face="Arial"><font color="#800000"><font size=-1>Hi</font></font></font><font face="Arial"><font color="#800000"><font size=-1>I
am compiling a corpus for research reasons and some of the texts are sgml-tagged.</font></font></font><font face="Arial"><font color="#800000"><font size=-1>Does
anybody know an easy way to remove the tags and save the texts as 'raw'
.txt files?</font></font></font><font face="Arial"><font color="#800000"><font size=-1>Maybe
a PERL script?</font></font></font> <font face="Arial"><font color="#800000"><font size=-1>Thanks
in advance</font></font></font> <font face="Arial"><font color="#800000"><font size=-1>Tine
Lassen</font></font></font><font face="Arial"><font color="#800000"><font size=-1>Copenhagen</font></font></font></blockquote>
</blockquote>
</body>
</html>