[Corpora-List] jumk java
Marco Baroni
baroni at sslmit.unibo.it
Mon Jun 27 08:33:28 UTC 2005
Do you mean javascript?
I use vilistextum:
http://bhaak.dyndns.org/vilistextum/
and it seems to do a good job at removing javascript and html code.
Also, BTE (part of the Hyppia project):
http://smi.ucd.ie/hyppia/
reccommended to me on this list, tries to guess what is the "interesting"
content of a page, and removes everything else (thus, not only html and
javascript, but any text it believes to be boilerplate). If your goal is
precision rather than recall (i.e., it's ok to occasionally throw away
good content as long as what you keep is consistently good content), it
does an excellent job. It's a bit slow, though.
Regards,
Marco
More information about the Corpora
mailing list