[Corpora-List] using a program such as boilerpipe to automatically get the MAIN text

Siddhartha Jonnalagadda sid.kgp at gmail.com
Fri Aug 5 23:07:28 UTC 2011


Thanks Orion,

That might work in most of my cases, but there are some cases where the
largest content is not necessarily the MAIN context. In the highbeam.com, I
would have been more satisfied, it is returns:

*ROBYN BECK
Getty Images
01-20-2009
Incoming US First Lady Michelle Obama (L) and daughters Malia...*

*

*
*Full Size JPG (909 KB)*

*Incoming US First Lady Michelle Obama (L) and daughters Malia (2nd L) and
Sasha watch Barack Obama being … *

, or, if it says that there is no MAIN content.


Any thoughts?


Sincerely,
Siddhartha Jonnalagadda, Ph.D.
sjonnalagadda.wordpress.com




On Fri, Aug 5, 2011 at 3:33 PM, Orion Montoya <orion at mdcclv.com> wrote:

>
>
> On Fri, Aug 5, 2011 at 2:49 PM, Siddhartha Jonnalagadda <sid.kgp at gmail.com
> > wrote:
>
>> boilerpipe [1] served my purposes temporarily. When I tried to do more
>> serious stuff such as reading the news, it fails. For example, consider:
>>  http://www.innovations-report.com/html/reports/studies/report-92130.html
>> it pulls only what is not relevant
>>
>
> This works when you use LargestContentExtractor:
>
> http://boilerpipe-web.appspot.com/extract?url=http%3A%2F%2Fwww.innovations-report.com%2Fhtml%2Freports%2Fstudies%2Freport-92130.html&extractor=LargestContentExtractor&output=htmlFragment
>>
>
>>
>> or http://www.highbeam.com/doc/1P1-160189301.html
>> it pulls stuff from related articles, which I'm not interested in.
>>
>
> I can't see any interesting content on this page at all—it's trying to
> upsell me for a trial subscription.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20110805/8bdb52a0/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list