[Corpora-List] using a program such as boilerpipe to automatically get the MAIN text

Siddhartha Jonnalagadda sid.kgp at gmail.com
Fri Aug 5 21:49:51 UTC 2011


boilerpipe [1] served my purposes temporarily. When I tried to do more
serious stuff such as reading the news, it fails. For example, consider:
 http://www.innovations-report.com/html/reports/studies/report-92130.html
it pulls only what is not relevant

or http://www.highbeam.com/doc/1P1-160189301.html
it pulls stuff from related articles, which I'm not interested in.

[1] Christian Kohlschütter, Peter Fankhauser and Wolfgang Nejdl,
Boilerplate Detection using Shallow Text
Features<http://www.l3s.de/%7Ekohlschuetter/publications/wsdm187-kohlschuetter.pdf>
,
WSDM 2010 -- The Third ACM International Conference on Web Search and Data
Mining New York City, NY USA.

Any suggestions on tools or addons?

Sincerely,
Siddhartha Jonnalagadda, Ph.D.
sjonnalagadda.wordpress.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20110805/a478656d/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list