[Corpora-List] jumk java

j_kurjian at hotmail.com j_kurjian at hotmail.com
Sun Jun 26 20:41:02 UTC 2005


Hi all,

I've had this problem on several occasions - I convert html files to txt and 
strip out the html as best I can (this last time I used beautifulsoup) only 
to find large chunks of what appears to be java code still perched inside 
many of the texts.

I've tried writing code to strip it out, but it is pretty resistant.  At 
present I'm looking for duplicate chunks of it and will try to use these as 
templates to erase the stuff but it is not a happy process and is certain to 
leave non-duplicate occurrences.

Has anyone else had this problem?  Has anyone satisfactorily managed to 
overcome it?

Jerry

_________________________________________________________________
FREE pop-up blocking with the new MSN Toolbar - get it now! 
http://toolbar.msn.click-url.com/go/onm00200415ave/direct/01/



More information about the Corpora mailing list