Corpora: Help please - downloading text from the Web

Mark Davies mdavies at ilstu.edu
Mon Mar 27 16:37:00 UTC 2000


Here's some suggestions on components in creating large corpora from
web-based materials that you might find useful.  I've used these to create
35,000,000 and 25,000,000 word corpora of Spanish and Portuguese newspapers
(respectively) (http://mdavies.for.ilstu.edu/personal/texts.htm). I also
presented a paper detailing some of the steps in creating large
multi-million word web-based corpora at the "North American Symposium on
Corpora in Linguistics and Language Teaching" at the Univ. of Michigan in
May 1999, and would be happy to send the handout from that talk to anyone
who is interested.

I'm sure that everyone has their own system and preferred software, but
here's mine:

DOWNLOADING
Re. tools for downloading, I've been using Grab-A-Site
(http://www.bluesquirrel.com). One of the nice features of this program
(which may be shared by others; I'm not sure) is that you can maintain the
directory structure of the site from which you're downloading.  This is
particularly useful in the case of newspapers, where you can store
different days in different days or different sections of the newspaper in
separate directories.  Several times I've set things up to download 5-6
newspapers during the night, and come back to find 100-150MB of files
waiting patiently for me -- it's really been nice..

HTML to ASCII
Re. converting HTML to ASCII, I've found HTMASC32
(http://www.bitenbyte.com/index.htm) to work very nicely.  I've converted
up to 5000 HTML files at one time, as well as single 20MB HTML files
(created by concatenating thousands of smaller webpages), and it's never
had any problem.

MACROS, BATCH FILES, ETC.
I'd also recommend a nice text editor that can do macros, including
conditional looping.  You'll want something like this to clean up the text
files, even after the HMTL to ASCII conversion. To do these macros, I use
the old tried-and-true WP 5.1 for DOS, which has a very nice macro language
and can handle files up to 10MB without much problem. Of course it's a DOS
program, so there are problems with 8.3 filenames, etc. In addition, you'll
want to come up to speed (if not already there) on batch files (and using
macros to create these). When you're dealing with hundreds of thousands of
files, you need some way to automatize file manipulation.

Anyway, just my .02 worth.

Mark Davies


=======================================
Mark Davies, Associate Professor, Spanish Linguistics
Dept. of Foreign Languages, Illinois State University
Normal, IL 61790-4300

Voice:309/438-7975      email:mdavies at ilstu.edu
Fax:309/438-8038          http://mdavies.for.ilstu.edu/personal/
=======================================



More information about the Corpora mailing list