[Corpora-List] How do we extract actual text in html?

Siddhartha Jonnalagadda sid.kgp at gmail.com
Sun Aug 1 00:26:48 UTC 2010


Is it trivial to extract the title and relevant text (ignoring the ads and
other irrelevant stuff)? For example, in the website:
http://tvnz.co.nz/world-news/chelsea-clinton-marries-in-ny-3680168

I am only interested in extracting the tile: "Chelsea Clinton marries in NY"
and the subject below. How easy is this?

"Bill and Hillary Clinton's daughter married her long-time boyfriend in the
picturesque New York village of Rhinebeck today in what has been dubbed
America's royal wedding.

Chelsea Clinton - the only child of the former US president and the US
secretary of state - wed Marc Mezvinsky at Astor Courts, an historic 50-acre
(20-hectare) estate on the Hudson River, about 160 km north of New York
City.

"Today, we watched with great pride and overwhelming emotion as Chelsea and
Marc wed in a beautiful ceremony at Astor Courts, surrounded by family and
their close friends," Bill and Hillary Clinton said in a statement.

"We could not have asked for a more perfect day to celebrate the beginning
of their life together, and we are so happy to welcome Marc into our
family," the statement said.

"On behalf of the newlyweds, we want to give special thanks to the people of
Rhinebeck for welcoming us and to everyone for their well-wishes on this
special day."

The statement, sent just after 7:30 pm (12:30pm NZT today), did not indicate
exactly when the nuptials took place.

On Friday night, Bill and Hillary Clinton waved to crowds of onlookers as
they arrived at the historic Beekman Arms Inn in the center of Rhinebeck for
a late-night cocktail party for some of the wedding guests.

[image: Advertisement]
<http://ad.au.doubleclick.net/jump/tvnz.co.nz/news/world-news/reuters/_3680168;pos=mid;sectn=world-news;site=news;kw=ONENEWS;kw=WORLD;kw=BILLCLINTON;kw=HILLARYCLINTON;sourc=Reuters;sid=425822;did=3680168;sz=300x250;ord=123456789?>

Apart from the parents of the bride, the only other high profile guests seen
in Rhinebeck have been Bill Clinton's former secretary of state, Madeleine
Albright, actors Ted Danson and Mary Steenburgen and fashion designer Vera
Wang.

Also spotted was real estate scion and movie producer billionaire Steve
Bing. Bing lent Bill Clinton his jet to fly to North Korea in August of last
year to bring home American journalists Laura Ling and Euna Lee after they
spent four months imprisoned in the reclusive communist state.

Guests boarded buses in Rhinebeck to be taken to Astor Courts about 5 pm EDT
(10am NZT)

Chelsea Clinton, 30, and Mezvinsky, 32, have known each other since they
were teenagers. He is an investment banker, whose parents Marjorie
Margolies-Mezvinsky and Edward Mezvinsky were once Democratic US House of
Representatives members.

Chelsea Clinton, who worked at a New York hedge fund and has more recently
studied health policy at Columbia University, has kept a low profile since
her father left the White House in January 2001, although she campaigned for
her mother during her failed run for the 2008 Democratic presidential
nomination.

Signs and pictures congratulating the newlyweds hang in many shop windows in
Rhinebeck, which has been swarmed by media around the world for an event
that experts estimate to have cost between $US3 million and $US5 million.

Airspace above Rhinebeck has been closed for 12 hours from 3 pm EDT (7pm
NZT) today for the wedding and media were kept well away from the entrance
to Astor Courts. Security in the area was comparable to that surrounding
state visits.

The guest list was reported to be between 400 and 500, but did not include a
very understanding President Barack Obama.

"Hillary and Bill properly want to keep this as a thing for Chelsea and her
soon-to-be husband," Obama said on The View talk show on Thursday. "It would
be tough enough to have one president at a wedding. You don't want two
presidents."
"
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20100731/9a63bc5b/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list