[Corpora-List] How do we extract actual text in html?

Anil Singh anil at research.iiit.ac.in
Mon Aug 2 14:13:31 UTC 2010


Cleaneval is a good place to find out the problems and many solutions.
However, my experience is that it ultimately depends on your exact needs.
And the methods can be broadly categorized in two classes: deterministic and
learning based. Unless you want to work on data with completely arbitrary
formats, learning doesn't seem to be a good idea.

There is some code for text extraction from HTML documents and one or two
utilities in Sanchay <http://sanchay.co.in>, but there is no documentation
and the it is not connected to the current public GUI. The available code
will have to be slightly modified for specific formats: some simple code
that uses the HTML parser library to effectively create a template for
extraction of a specific format. For a single format, it is not very time
consuming.

On Mon, Aug 2, 2010 at 4:54 PM, Siddhartha Jonnalagadda
<sid.kgp at gmail.com>wrote:

> Thanks all for your replies. I am trying BoilerPipe now; will also look
> into the other things mentioned.
>
> thanks again,
> siddhartha
>
> On Mon, Aug 2, 2010 at 2:51 AM, Wouter Weerkamp <w.weerkamp at uva.nl> wrote:
>
>> In 2007 there was a workshop on content extraction from web pages. You
>> could gave a look at the papers presented there:
>> http://cleaneval.sigwac.org.uk/
>>
>> If you intend to follow feeds, and need to extract content from these, you
>> can use a learning approach. For each feed you collect a certain number of
>> pages, and you learn which part of the page changes, and which parts don't.
>> From that it shouldn't be hard to determine "real" content.
>>
>> You could also have a look at fivefilters, it works pretty good given the
>> simple approach is uses:
>> http://fivefilters.org/content-only/
>> (following a few links, you can get to the (php) code).
>>
>> Wouter
>>
>>
>>
>> On 8/1/10 8:08 PM, Beatrice Alex wrote:
>>
>>> You might want to check out Boilerpipe:
>>>
>>> http://code.google.com/p/boilerpipe/
>>>
>>> Best,
>>>
>>> Bea
>>>
>>> ------------------
>>> Beatrice Alex
>>> Research Fellow and Project Manager at the School of Informatics,
>>> University of Edinburgh.
>>>
>>>
>>> On 1 Aug 2010, at 01:26, Siddhartha Jonnalagadda wrote:
>>>
>>>  Is it trivial to extract the title and relevant text (ignoring the ads
>>>> and other irrelevant stuff)? For example, in the website:
>>>> http://tvnz.co.nz/world-news/chelsea-clinton-marries-in-ny-3680168
>>>>
>>>> I am only interested in extracting the tile: "Chelsea Clinton marries in
>>>> NY"
>>>> and the subject below. How easy is this?
>>>>
>>>> "Bill and Hillary Clinton's daughter married her long-time boyfriend in
>>>> the picturesque New York village of Rhinebeck today in what has been dubbed
>>>> America's royal wedding.
>>>> Chelsea Clinton - the only child of the former US president and the US
>>>> secretary of state - wed Marc Mezvinsky at Astor Courts, an historic 50-acre
>>>> (20-hectare) estate on the Hudson River, about 160 km north of New York
>>>> City.
>>>>
>>>> "Today, we watched with great pride and overwhelming emotion as Chelsea
>>>> and Marc wed in a beautiful ceremony at Astor Courts, surrounded by family
>>>> and their close friends," Bill and Hillary Clinton said in a statement.
>>>>
>>>> "We could not have asked for a more perfect day to celebrate the
>>>> beginning of their life together, and we are so happy to welcome Marc into
>>>> our family," the statement said.
>>>>
>>>> "On behalf of the newlyweds, we want to give special thanks to the
>>>> people of Rhinebeck for welcoming us and to everyone for their well-wishes
>>>> on this special day."
>>>>
>>>> The statement, sent just after 7:30 pm (12:30pm NZT today), did not
>>>> indicate exactly when the nuptials took place.
>>>>
>>>> On Friday night, Bill and Hillary Clinton waved to crowds of onlookers
>>>> as they arrived at the historic Beekman Arms Inn in the center of Rhinebeck
>>>> for a late-night cocktail party for some of the wedding guests.
>>>>
>>>>
>>>>
>>>> Apart from the parents of the bride, the only other high profile guests
>>>> seen in Rhinebeck have been Bill Clinton's former secretary of state,
>>>> Madeleine Albright, actors Ted Danson and Mary Steenburgen and fashion
>>>> designer Vera Wang.
>>>>
>>>> Also spotted was real estate scion and movie producer billionaire Steve
>>>> Bing. Bing lent Bill Clinton his jet to fly to North Korea in August of last
>>>> year to bring home American journalists Laura Ling and Euna Lee after they
>>>> spent four months imprisoned in the reclusive communist state.
>>>>
>>>> Guests boarded buses in Rhinebeck to be taken to Astor Courts about 5 pm
>>>> EDT (10am NZT)
>>>>
>>>> Chelsea Clinton, 30, and Mezvinsky, 32, have known each other since they
>>>> were teenagers. He is an investment banker, whose parents Marjorie
>>>> Margolies-Mezvinsky and Edward Mezvinsky were once Democratic US House of
>>>> Representatives members.
>>>>
>>>> Chelsea Clinton, who worked at a New York hedge fund and has more
>>>> recently studied health policy at Columbia University, has kept a low
>>>> profile since her father left the White House in January 2001, although she
>>>> campaigned for her mother during her failed run for the 2008 Democratic
>>>> presidential nomination.
>>>>
>>>> Signs and pictures congratulating the newlyweds hang in many shop
>>>> windows in Rhinebeck, which has been swarmed by media around the world for
>>>> an event that experts estimate to have cost between $US3 million and $US5
>>>> million.
>>>>
>>>> Airspace above Rhinebeck has been closed for 12 hours from 3 pm EDT (7pm
>>>> NZT) today for the wedding and media were kept well away from the entrance
>>>> to Astor Courts. Security in the area was comparable to that surrounding
>>>> state visits.
>>>>
>>>> The guest list was reported to be between 400 and 500, but did not
>>>> include a very understanding President Barack Obama.
>>>>
>>>> "Hillary and Bill properly want to keep this as a thing for Chelsea and
>>>> her soon-to-be husband," Obama said on The View talk show on Thursday. "It
>>>> would be tough enough to have one president at a wedding. You don't want two
>>>> presidents."
>>>>
>>>> "
>>>> _______________________________________________
>>>> Corpora mailing list
>>>> Corpora at uib.no
>>>> http://mailman.uib.no/listinfo/corpora
>>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> The University of Edinburgh is a charitable body, registered in
>>> Scotland, with registration number SC005336.
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Corpora mailing list
>>> Corpora at uib.no
>>> http://mailman.uib.no/listinfo/corpora
>>>
>>
>> --
>> ISLA * University of Amsterdam * http://ilps.science.uva.nl
>>
>>
>> _______________________________________________
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/listinfo/corpora
>>
>
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20100802/05d9f5d6/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list