[Corpora-List] To segment HTML document?

Delip Rao deliprao at yahoo.com
Tue Oct 25 16:34:52 UTC 2005


Try j-spider for crawling
http://j-spider.sourceforge.net/

But for HTML segmentation and extraction from HTML
documents you may want to look at the Wrapper work by
Stephen Soderland.

--- Chris Jordan <cjordan at cs.dal.ca> wrote:

> Hey Imen,
> 
> Sounds like you are writing a crawler in Java. If so
> why reinvent the 
> wheel? There are plenty of open source ones lying
> around.
> 
> ismi.touati wrote:
> 
> > Dear all,
> >  
> > Does anyone know of :
> >    - program to segment HTML documents (web
> pages),
> >    - command java that can connect to a web page
> on the internet 
> > having his URL.
> >  
> > Thanks
> >  
> > All the best
> >  
> > Imen.
> >  
> > //****************************//
> > Imen Touati
> > Master Student at Faculty of Economic Science and
> management of sfax, 
> > Tunisia.
> > LARIS laboratory
> > Addresse : LARIS, FSEGS, BP 1088, 3018 Sfax,
> Tunisia
> > Tel : (216) 74 27 87 77
> > e-mail : ismi.touati at laposte.net
> <mailto:ismi.touati at laposte.net>
> >
> >
> > /Accédez au courrier électronique de La Poste :
> www.laposte.net ;/
> > /3615 LAPOSTENET (0,34 /mn) ; tél : 08 92 68 13 50
> (0,34/mn)/
> 
> 
> 



	
	
		
__________________________________ 
Do you Yahoo!? 
New and Improved Yahoo! Mail - 1GB free storage! 
http://sg.whatsnew.mail.yahoo.com



More information about the Corpora mailing list