Corpora: sgml detagger

Alexander S. Yeh asy at mitre.org
Tue Apr 16 18:43:59 UTC 2002


The script below will work for most tags, but may fail in the following
more complicated cases:

1. A tag is spread out over more than 1 line (usual cases: comment tags,
tags with attribute/value pairs).

2. A tag has an attribute value that has a ">" in it.

3. A comment tag has a ">" embedded in it.

I have encountered these in html files of journal articles gotten off
the web. Thanks.

-Alex Yeh


Danko Sipka wrote:

> Hi:This Perl script should do the job: print "What is your input file
> name:\n";
> chomp($infile=<STDIN>);
> open IN, $infile or die "No file, no fun!";
> open OUT, ">$infile.out" or die "No file, no fun!";
> while (<IN>) {
>     $_=~s/\<.+?\>//g;
>     print OUT "$_";
>     }
> close (IN) or die "D'oh!";
> close (OUT) or die "D'oh!";Best, Danko Sipkasipkadan at main.amu.edu.pl |
> Danko.Sipka at asu.eduhttp://main.amu.edu.pl/~sipkadan |
> http://www.public.asu.edu/~dsipka
>
>      ----- Original Message -----
>      From: Tine & Colleen
>      To: CORPORA at HD.UIB.NO
>      Sent: Tuesday, April 16, 2002 8:13 PM
>      Subject: Corpora: sgml detagger
>       HiI am compiling a corpus for research reasons and some of
>      the texts are sgml-tagged.Does anybody know an easy way to
>      remove the tags and save the texts as 'raw' .txt files?Maybe
>      a PERL script? Thanks in advance Tine LassenCopenhagen
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20020416/2056efd7/attachment.htm>


More information about the Corpora mailing list