[Corpora-List] How do we extract actual text in html?

Beatrice Alex balex at staffmail.ed.ac.uk
Sun Aug 1 18:08:23 UTC 2010


You might want to check out Boilerpipe:

http://code.google.com/p/boilerpipe/

Best,

Bea

------------------
Beatrice Alex
Research Fellow and Project Manager at the School of Informatics, University of Edinburgh.


On 1 Aug 2010, at 01:26, Siddhartha Jonnalagadda wrote:

> Is it trivial to extract the title and relevant text (ignoring the ads and other irrelevant stuff)? For example, in the website: http://tvnz.co.nz/world-news/chelsea-clinton-marries-in-ny-3680168
> 
> I am only interested in extracting the tile: "Chelsea Clinton marries in NY"
> and the subject below. How easy is this?
> 
> "Bill and Hillary Clinton's daughter married her long-time boyfriend in the picturesque New York village of Rhinebeck today in what has been dubbed America's royal wedding.
> Chelsea Clinton - the only child of the former US president and the US secretary of state - wed Marc Mezvinsky at Astor Courts, an historic 50-acre (20-hectare) estate on the Hudson River, about 160 km north of New York City.
> 
> "Today, we watched with great pride and overwhelming emotion as Chelsea and Marc wed in a beautiful ceremony at Astor Courts, surrounded by family and their close friends," Bill and Hillary Clinton said in a statement.
> 
> "We could not have asked for a more perfect day to celebrate the beginning of their life together, and we are so happy to welcome Marc into our family," the statement said.
> 
> "On behalf of the newlyweds, we want to give special thanks to the people of Rhinebeck for welcoming us and to everyone for their well-wishes on this special day."
> 
> The statement, sent just after 7:30 pm (12:30pm NZT today), did not indicate exactly when the nuptials took place.
> 
> On Friday night, Bill and Hillary Clinton waved to crowds of onlookers as they arrived at the historic Beekman Arms Inn in the center of Rhinebeck for a late-night cocktail party for some of the wedding guests.
> 
> 
>  
> Apart from the parents of the bride, the only other high profile guests seen in Rhinebeck have been Bill Clinton's former secretary of state, Madeleine Albright, actors Ted Danson and Mary Steenburgen and fashion designer Vera Wang.
> 
> Also spotted was real estate scion and movie producer billionaire Steve Bing. Bing lent Bill Clinton his jet to fly to North Korea in August of last year to bring home American journalists Laura Ling and Euna Lee after they spent four months imprisoned in the reclusive communist state.
> 
> Guests boarded buses in Rhinebeck to be taken to Astor Courts about 5 pm EDT (10am NZT)
> 
> Chelsea Clinton, 30, and Mezvinsky, 32, have known each other since they were teenagers. He is an investment banker, whose parents Marjorie Margolies-Mezvinsky and Edward Mezvinsky were once Democratic US House of Representatives members.
> 
> Chelsea Clinton, who worked at a New York hedge fund and has more recently studied health policy at Columbia University, has kept a low profile since her father left the White House in January 2001, although she campaigned for her mother during her failed run for the 2008 Democratic presidential nomination.
> 
> Signs and pictures congratulating the newlyweds hang in many shop windows in Rhinebeck, which has been swarmed by media around the world for an event that experts estimate to have cost between $US3 million and $US5 million.
> 
> Airspace above Rhinebeck has been closed for 12 hours from 3 pm EDT (7pm NZT) today for the wedding and media were kept well away from the entrance to Astor Courts. Security in the area was comparable to that surrounding state visits.
> 
> The guest list was reported to be between 400 and 500, but did not include a very understanding President Barack Obama.
> 
> "Hillary and Bill properly want to keep this as a thing for Chelsea and her soon-to-be husband," Obama said on The View talk show on Thursday. "It would be tough enough to have one president at a wedding. You don't want two presidents."
> 
> "
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20100801/8712e35f/attachment.htm>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20100801/8712e35f/attachment-0001.ksh>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list