Corpora: sgml detagger
Alexander S. Yeh
asy at mitre.org
Tue Apr 16 18:43:59 UTC 2002
The script below will work for most tags, but may fail in the following
more complicated cases:
1. A tag is spread out over more than 1 line (usual cases: comment tags,
tags with attribute/value pairs).
2. A tag has an attribute value that has a ">" in it.
3. A comment tag has a ">" embedded in it.
I have encountered these in html files of journal articles gotten off
the web. Thanks.
-Alex Yeh
Danko Sipka wrote:
> Hi:This Perl script should do the job: print "What is your input file
> name:\n";
> chomp($infile=<STDIN>);
> open IN, $infile or die "No file, no fun!";
> open OUT, ">$infile.out" or die "No file, no fun!";
> while (<IN>) {
> $_=~s/\<.+?\>//g;
> print OUT "$_";
> }
> close (IN) or die "D'oh!";
> close (OUT) or die "D'oh!";Best, Danko Sipkasipkadan at main.amu.edu.pl |
> Danko.Sipka at asu.eduhttp://main.amu.edu.pl/~sipkadan |
> http://www.public.asu.edu/~dsipka
>
> ----- Original Message -----
> From: Tine & Colleen
> To: CORPORA at HD.UIB.NO
> Sent: Tuesday, April 16, 2002 8:13 PM
> Subject: Corpora: sgml detagger
> HiI am compiling a corpus for research reasons and some of
> the texts are sgml-tagged.Does anybody know an easy way to
> remove the tags and save the texts as 'raw' .txt files?Maybe
> a PERL script? Thanks in advance Tine LassenCopenhagen
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20020416/2056efd7/attachment.htm>
More information about the Corpora
mailing list