[Corpora-List] using a program such as boilerpipe to automatically get the MAIN text

Orion Montoya orion at mdcclv.com
Fri Aug 5 22:53:36 UTC 2011


I forgot to reply-all on this:

On Fri, Aug 5, 2011 at 2:49 PM, Siddhartha Jonnalagadda
<sid.kgp at gmail.com>wrote:

> boilerpipe [1] served my purposes temporarily. When I tried to do more
> serious stuff such as reading the news, it fails. For example, consider:
>  http://www.innovations-report.com/html/reports/studies/report-92130.html
> it pulls only what is not relevant
>
> or http://www.highbeam.com/doc/1P1-160189301.html
> it pulls stuff from related articles, which I'm not interested in.
>
> [1] Christian Kohlschütter, Peter Fankhauser and Wolfgang Nejdl,
> Boilerplate Detection using Shallow Text Features<http://www.l3s.de/%7Ekohlschuetter/publications/wsdm187-kohlschuetter.pdf>
> ,
> WSDM 2010 -- The Third ACM International Conference on Web Search and Data
> Mining New York City, NY USA.
>
> Any suggestions on tools or addons?
>
> Sincerely,
> Siddhartha Jonnalagadda, Ph.D.
> sjonnalagadda.wordpress.com
>
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20110805/55653be1/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list