Corpora: Help please - downloading text from the Web

Ralf Steinberger ralf.steinberger at jrc.it
Mon Mar 27 10:44:12 UTC 2000


I have a related question. What tools do you use once you have downloaded
the HTML files to (batch-)convert them in reasonably clean "plain" text? 


HTML TO TEXT CONVERTERS

Recently, I had the same need, i.e. I needed a tool which batch-converts HTML files to TXT format. There are a number of free HTML to TEXT converters, among which some reasonable ones. See, for instance, http://www.softseek.com/Internet/Web_Publishing_Tools/HTML_Conversion.

However, none of the found tools could deal with Java Scripts in the HTML source document so that the presence of Java Scripts led to gibberish in the TXT file. I finally found the software HTML2TEXT from the company TENMAX, which costs 80 USD, but which works perfectly. See http://www.tenmax.com/tools/home.htm. Originally, it interpreted many accented characters as word breaks, but TENMAX support staff changed the software so that now it can deal with all (?) European languages.


SOFTWARE TO DOWNLOAD WEB PAGES

TENMAX also sells the downloading tool Teleport Pro for 40 USD, which works rather well, even though it has some restrictions (http://www.tenmax.com/teleport/home.htm).

I hope this helps

Ralf


Ralf Steinberger (ralf.steinberger at jrc.it)
European Commission, Joint Research Center (http://www.jrc.cec.eu.int/jrc)


 -----Original Message-----
From: 	owner-corpora at lists.uib.no [mailto:owner-corpora at lists.uib.no]  On Behalf Of Jean Veronis
Sent:	27 March 2000 10:10
To:	corpora at hd.uib.no
Subject:	Re: Corpora: Help please - downloading text from the Web

This list of tools is very interesting (thanks!).

I have a related question. What tools do you use once you have downloaded
the HTML files to (batch-)convert them in reasonably clean "plain" text? 


More information about the Corpora mailing list