[Corpora-List] jumk java

Alexander S. Yeh asy at mitre.org
Mon Jun 27 21:27:05 UTC 2005


Michael Betsch wrote:
>>I've had this problem on several occasions - I convert html files to txt and
>>strip out the html as best I can (this last time I used beautifulsoup) only
>>to find large chunks of what appears to be java code still perched inside
>>many of the texts.
>>
>>I've tried writing code to strip it out, but it is pretty resistant.  At
>>present I'm looking for duplicate chunks of it and will try to use these as
>>templates to erase the stuff but it is not a happy process and is certain to
>>leave non-duplicate occurrences.
> 
> 
> (You mean javascript scripts)

Possibly related: when I tried to convert html to txt a few years ago, I 
would find large comment tags that would go across several lines (new 
lines within the comment tag). It turns out that these tags had embedded 
  javascript within it. Embedding the javascript within a comment tag 
meant that a browser which could not deal with javascript would just 
ignore it.

To strip out such tags, somebody wrote a tag stripper that could handle 
tags where the tag start ("<") and tag end (">") were not on the same line.

-Alex Yeh


> 
> It is difficult to first strip html tags and then look for specific
> content. Javascript scripts in a html-file are tagged with
> 
> <script type="text/javascript"> (javascript) </script>
> 
> so they can be easily seen and removed before html tags are cut, but not
> after that moment.
> 
> For instance, you can use a program that understands html for the
> conversion html => text. Lynx can "dump" the text:
> 
> lynx -dump html-file(s) > textfile
> 
> or any other sgml-to-sgml conversion will do, if it allows to specify a
> treatment for specific sgml-elements.
> 
> Michael Betsch
> 
> 
> 
> 



More information about the Corpora mailing list