[Corpora-List] workstation advice for corpus linguistics work

Justin Washtell lec3jrw at leeds.ac.uk
Mon Jan 17 21:17:32 UTC 2011


Hi Don,

I don't know a great deal about these particular software packages or what you will be trying to do with them.

At the risk of stating the obvious, what I can tell you - with a moderate degree of confidence - is that for many kinds of corpus work the most important thing is to have access to a lot of RAM! Trying to work with data structures which exceed your RAM limits will plunge you into virtual memory hell and things which ought to take a few minutes can end up taking a few days. Co-occurrence matrices are a good case in point, but increasingly there are some corpora which are very large in their own right.

With 32-bit processors there is a hard limit of 4GB ram, some of which is inevitably claimed by the operating system. Windows is particularly greedy, but some judicious jiggling of the system settings can get you an extra gigabyte or so more that you can access "out of the box".

With 64-bit processors, and an appropriate operating system (e.g. 64-bit Linux builds, and certain versions of Windows Vista and Windows 7 excluding the "Starter" and "Home" versions), the RAM that you can access is effectively (for the time-being) unlimited... so I would put those features (processor and operating system) very high on the list.

Video cards are irrelevant for most corpus work... with one very large notable exception. If you are working with a lot of matrices, or otherwise parallelizable code, then running code on your GPU (as opposed to your CPU) becomes a very tempting possibility. In that case you want the best CUDA or OpenCL-compatible GPU you can get your hands on. There are extensions for R and Matlab which will help you take advantage of your GPU in this way. Depending on the nature of the code you can get orders of magnitude more throughput than on a comparable CPU implementation. You don't necessarily need a top-end graphics card to take advantage of these gains either - it may be that a couple of hundred dollars proves worthwhile. I've not actually done this yet mind you, so by all means do your research. I'm sure some people on this list have experience of working in this way.

Hard disk storage is big and cheap, and getting bigger and cheaper, so I wouldn't worry too much about that factor. Unless you have some very serious needs (mirroring substantial chunks of the World Wide Web) than buy as many hundreds of gigabytes as you can afford, and buy a few hundred more as and when you need them.

Justin Washtell
University of Leeds


________________________________________
From: corpora-bounces at uib.no [corpora-bounces at uib.no] On Behalf Of Donald E Hardy [donhardy at unr.edu]
Sent: 17 January 2011 19:54
To: CORPORA at UIB.NO
Subject: [Corpora-List] workstation advice for corpus linguistics work

Dear all,

I’m looking for advice on purchasing a workstation for corpus work.

These are the software that I will be using and operating systems that I am thinking I will need:

R (e.g., for multiple runs of Fisher’s exact test)
 Word
Windows
Linux
Perl programs (multiple text manipulation programs)
Excel
Access
Perhaps other SQL applications
XAIRA
ICECUP 3.1

I’m sure there will be other software packages added to the list.

Corpora include data gathered from Corpus of Contemporary American English, Corpus of Historical American English, BNC, Treebank, ICE-GB, Brown, Frown

I’m looking at Dell workstations.

Recommendations I’m looking for are operating system(s), CPU, RAM, Video card, hard disk, RAID.

I am relatively computer literate (program in Perl, manage a server); and, I do have expert technicians for help and advice locally.  However, I don’t have anyone locally for advice on the best system setup for corpus linguistic work.

Thanks very much,

Don Hardy


_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list