[Corpora-List] jumk java

Marco Baroni baroni at sslmit.unibo.it
Mon Jun 27 08:33:28 UTC 2005


Do you mean javascript?

I use vilistextum:

http://bhaak.dyndns.org/vilistextum/

and it seems to do a good job at removing javascript and html code.

Also, BTE (part of the Hyppia project):

http://smi.ucd.ie/hyppia/

reccommended to me on this list, tries to guess what is the "interesting"
content of a page, and removes everything else (thus, not only html and
javascript, but any text it believes to be boilerplate). If your goal is
precision rather than recall (i.e., it's ok to occasionally throw away
good content as long as what you keep is consistently good content), it
does an excellent job. It's a bit slow, though.

Regards,

Marco



More information about the Corpora mailing list