[Corpora-List] jumk java
Michael Betsch
michael.betsch at uni-tuebingen.de
Mon Jun 27 05:11:06 UTC 2005
> I've had this problem on several occasions - I convert html files to txt and
> strip out the html as best I can (this last time I used beautifulsoup) only
> to find large chunks of what appears to be java code still perched inside
> many of the texts.
>
> I've tried writing code to strip it out, but it is pretty resistant. At
> present I'm looking for duplicate chunks of it and will try to use these as
> templates to erase the stuff but it is not a happy process and is certain to
> leave non-duplicate occurrences.
(You mean javascript scripts)
It is difficult to first strip html tags and then look for specific
content. Javascript scripts in a html-file are tagged with
<script type="text/javascript"> (javascript) </script>
so they can be easily seen and removed before html tags are cut, but not
after that moment.
For instance, you can use a program that understands html for the
conversion html => text. Lynx can "dump" the text:
lynx -dump html-file(s) > textfile
or any other sgml-to-sgml conversion will do, if it allows to specify a
treatment for specific sgml-elements.
Michael Betsch
More information about the Corpora
mailing list