[Corpora-List] How do we extract actual text in html?

Siddhartha Jonnalagadda sid.kgp at gmail.com
Mon Aug 2 11:24:22 UTC 2010


Thanks all for your replies. I am trying BoilerPipe now; will also look into
the other things mentioned.

thanks again,
siddhartha

On Mon, Aug 2, 2010 at 2:51 AM, Wouter Weerkamp <w.weerkamp at uva.nl> wrote:

> In 2007 there was a workshop on content extraction from web pages. You
> could gave a look at the papers presented there:
> http://cleaneval.sigwac.org.uk/
>
> If you intend to follow feeds, and need to extract content from these, you
> can use a learning approach. For each feed you collect a certain number of
> pages, and you learn which part of the page changes, and which parts don't.
> From that it shouldn't be hard to determine "real" content.
>
> You could also have a look at fivefilters, it works pretty good given the
> simple approach is uses:
> http://fivefilters.org/content-only/
> (following a few links, you can get to the (php) code).
>
> Wouter
>
>
>
> On 8/1/10 8:08 PM, Beatrice Alex wrote:
>
>> You might want to check out Boilerpipe:
>>
>> http://code.google.com/p/boilerpipe/
>>
>> Best,
>>
>> Bea
>>
>> ------------------
>> Beatrice Alex
>> Research Fellow and Project Manager at the School of Informatics,
>> University of Edinburgh.
>>
>>
>> On 1 Aug 2010, at 01:26, Siddhartha Jonnalagadda wrote:
>>
>>  Is it trivial to extract the title and relevant text (ignoring the ads
>>> and other irrelevant stuff)? For example, in the website:
>>> http://tvnz.co.nz/world-news/chelsea-clinton-marries-in-ny-3680168
>>>
>>> I am only interested in extracting the tile: "Chelsea Clinton marries in
>>> NY"
>>> and the subject below. How easy is this?
>>>
>>> "Bill and Hillary Clinton's daughter married her long-time boyfriend in
>>> the picturesque New York village of Rhinebeck today in what has been dubbed
>>> America's royal wedding.
>>> Chelsea Clinton - the only child of the former US president and the US
>>> secretary of state - wed Marc Mezvinsky at Astor Courts, an historic 50-acre
>>> (20-hectare) estate on the Hudson River, about 160 km north of New York
>>> City.
>>>
>>> "Today, we watched with great pride and overwhelming emotion as Chelsea
>>> and Marc wed in a beautiful ceremony at Astor Courts, surrounded by family
>>> and their close friends," Bill and Hillary Clinton said in a statement.
>>>
>>> "We could not have asked for a more perfect day to celebrate the
>>> beginning of their life together, and we are so happy to welcome Marc into
>>> our family," the statement said.
>>>
>>> "On behalf of the newlyweds, we want to give special thanks to the people
>>> of Rhinebeck for welcoming us and to everyone for their well-wishes on this
>>> special day."
>>>
>>> The statement, sent just after 7:30 pm (12:30pm NZT today), did not
>>> indicate exactly when the nuptials took place.
>>>
>>> On Friday night, Bill and Hillary Clinton waved to crowds of onlookers as
>>> they arrived at the historic Beekman Arms Inn in the center of Rhinebeck for
>>> a late-night cocktail party for some of the wedding guests.
>>>
>>>
>>>
>>> Apart from the parents of the bride, the only other high profile guests
>>> seen in Rhinebeck have been Bill Clinton's former secretary of state,
>>> Madeleine Albright, actors Ted Danson and Mary Steenburgen and fashion
>>> designer Vera Wang.
>>>
>>> Also spotted was real estate scion and movie producer billionaire Steve
>>> Bing. Bing lent Bill Clinton his jet to fly to North Korea in August of last
>>> year to bring home American journalists Laura Ling and Euna Lee after they
>>> spent four months imprisoned in the reclusive communist state.
>>>
>>> Guests boarded buses in Rhinebeck to be taken to Astor Courts about 5 pm
>>> EDT (10am NZT)
>>>
>>> Chelsea Clinton, 30, and Mezvinsky, 32, have known each other since they
>>> were teenagers. He is an investment banker, whose parents Marjorie
>>> Margolies-Mezvinsky and Edward Mezvinsky were once Democratic US House of
>>> Representatives members.
>>>
>>> Chelsea Clinton, who worked at a New York hedge fund and has more
>>> recently studied health policy at Columbia University, has kept a low
>>> profile since her father left the White House in January 2001, although she
>>> campaigned for her mother during her failed run for the 2008 Democratic
>>> presidential nomination.
>>>
>>> Signs and pictures congratulating the newlyweds hang in many shop windows
>>> in Rhinebeck, which has been swarmed by media around the world for an event
>>> that experts estimate to have cost between $US3 million and $US5 million.
>>>
>>> Airspace above Rhinebeck has been closed for 12 hours from 3 pm EDT (7pm
>>> NZT) today for the wedding and media were kept well away from the entrance
>>> to Astor Courts. Security in the area was comparable to that surrounding
>>> state visits.
>>>
>>> The guest list was reported to be between 400 and 500, but did not
>>> include a very understanding President Barack Obama.
>>>
>>> "Hillary and Bill properly want to keep this as a thing for Chelsea and
>>> her soon-to-be husband," Obama said on The View talk show on Thursday. "It
>>> would be tough enough to have one president at a wedding. You don't want two
>>> presidents."
>>>
>>> "
>>> _______________________________________________
>>> Corpora mailing list
>>> Corpora at uib.no
>>> http://mailman.uib.no/listinfo/corpora
>>>
>>
>>
>>
>>
>>
>>
>> The University of Edinburgh is a charitable body, registered in
>> Scotland, with registration number SC005336.
>>
>>
>>
>>
>> _______________________________________________
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/listinfo/corpora
>>
>
> --
> ISLA * University of Amsterdam * http://ilps.science.uva.nl
>
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20100802/4c7d6515/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list