[Corpora-List] using a program such as boilerpipe to automatically get the MAIN text
RadimRehurek
RadimRehurek at seznam.cz
Fri Aug 5 22:28:14 UTC 2011
Hello Siddhartha,
you can try Honza Pomikalek's "justext" tool: http://code.google.com/p/justext/
There is also a little demo on that page where you can try whether it does what you need, tweak params etc.
Best,
Radim
> ------------ Původní zpráva ------------
> Od: Siddhartha Jonnalagadda <sid.kgp at gmail.com>
> Předmět: [Corpora-List] using a program such as boilerpipe to automatically get
> the MAIN text
> Datum: 06.8.2011 00:17:40
> ----------------------------------------
> boilerpipe [1] served my purposes temporarily. When I tried to do more
> serious stuff such as reading the news, it fails. For example, consider:
> http://www.innovations-report.com/html/reports/studies/report-92130.html
> it pulls only what is not relevant
>
> or http://www.highbeam.com/doc/1P1-160189301.html
> it pulls stuff from related articles, which I'm not interested in.
>
> [1] Christian Kohlschütter, Peter Fankhauser and Wolfgang Nejdl,
> Boilerplate Detection using Shallow Text
> Features<http://www.l3s.de/%7Ekohlschuetter/publications/wsdm187-kohlschuetter.pdf>
> ,
> WSDM 2010 -- The Third ACM International Conference on Web Search and Data
> Mining New York City, NY USA.
>
> Any suggestions on tools or addons?
>
> Sincerely,
> Siddhartha Jonnalagadda, Ph.D.
> sjonnalagadda.wordpress.com
>
>
>
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list