[Corpora-List] using a program such as boilerpipe to automatically get the MAIN text
Siddhartha Jonnalagadda
sid.kgp at gmail.com
Fri Aug 5 23:07:28 UTC 2011
Thanks Orion,
That might work in most of my cases, but there are some cases where the
largest content is not necessarily the MAIN context. In the highbeam.com, I
would have been more satisfied, it is returns:
*ROBYN BECK
Getty Images
01-20-2009
Incoming US First Lady Michelle Obama (L) and daughters Malia...*
*
*
*Full Size JPG (909 KB)*
*Incoming US First Lady Michelle Obama (L) and daughters Malia (2nd L) and
Sasha watch Barack Obama being … *
, or, if it says that there is no MAIN content.
Any thoughts?
Sincerely,
Siddhartha Jonnalagadda, Ph.D.
sjonnalagadda.wordpress.com
On Fri, Aug 5, 2011 at 3:33 PM, Orion Montoya <orion at mdcclv.com> wrote:
>
>
> On Fri, Aug 5, 2011 at 2:49 PM, Siddhartha Jonnalagadda <sid.kgp at gmail.com
> > wrote:
>
>> boilerpipe [1] served my purposes temporarily. When I tried to do more
>> serious stuff such as reading the news, it fails. For example, consider:
>> http://www.innovations-report.com/html/reports/studies/report-92130.html
>> it pulls only what is not relevant
>>
>
> This works when you use LargestContentExtractor:
>
> http://boilerpipe-web.appspot.com/extract?url=http%3A%2F%2Fwww.innovations-report.com%2Fhtml%2Freports%2Fstudies%2Freport-92130.html&extractor=LargestContentExtractor&output=htmlFragment
>
>
>
>>
>> or http://www.highbeam.com/doc/1P1-160189301.html
>> it pulls stuff from related articles, which I'm not interested in.
>>
>
> I can't see any interesting content on this page at all—it's trying to
> upsell me for a trial subscription.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20110805/8bdb52a0/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list