[Corpora-List] What is best for text processing Perl of Python?

Oliver Mason O.Mason at bham.ac.uk
Wed Mar 5 16:25:52 UTC 2008


Given that most linguists have not learnt programming from early on I
would say that C/C++ is not suitable for most text/corpus processing
tasks.  Especially research-oriented work, where there is a lot of
explorative 'hacking'.

The main reason is that C/C++ is too powerful for most users.  You can
do everything that you want, but most of these things you wouldn't
want to do in the first place.  So it is very easy to have plenty of
errors in your software, and finding those is a very time-consuming
job.  The time you save in running (due to speed) you are most likely
to waste in staring at your code trying to find where things went
wrong.  And while the software runs you can have a cup of tea or a
nap, but you can't sleep when finding errors.

A few years ago I ported a large corpus access framework (15+K lines)
from C to Java, and in the process most bugs got eliminated without me
doing anything.  Not having to worry about memory management speeds up
coding tremendously and reduces the scope for many errors.  And Java
*has* become faster.  For most applications in language processing I
would dispute the claim that it was 10 times slower than C.  Perhaps
graphical desktop applications still lag, but for most processing I do
I find it is plenty fast enough.

Scripting languages are probably best for the occasional user, as
development is easier.  And as a beginner you shouldn't worry to much
about run-time performance.  Speed of development and robustness are
more important.  What benefit is it if your program runs 10 times
faster but continuously crashes for unknown reasons?

My personal opinion of Perl is that it's too cryptic... but I haven't
worked with perl for a long time.  I guess it'd still be better than C
for the purpose.  The best solution is probably not to program at all,
but use the Unix text tools.

If you want performance, the real way forward is Erlang.  Easy
concurrency allows you to run your programs on multi-core
architectures (or multiple connected computers) which is the direction
computing will go in the future.  But I wouldn't recommend learning
Erlang to a beginner...

Oliver

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list