[Corpora-List] What is best for text processing Perl of Python?

Steve Finch s.finch at daxtra.com
Wed Mar 5 11:31:26 UTC 2008


OK, not to get into a religious war I would say that no-one has really stuck 
up for Perl.  I would say its OO support is sufficient (and 
characteristically flexible), and that many of its optimised builtin features 
and syntactic sugar (eg sorting, tokenizing, pattern matching, filtering, 
iteration) make many of the individual computational operations you want to 
do in language processing expressible in little more than 1 line, and 
effortlessly well implemented.

I replicated more or less the entire algorithm set I used for my PhD (corpus 
collection from the web, distibutional analysis, clustering, patricia, and 
HMM pos tagging) in a little over a weekend from scratch in fewer than 1300 
not very dense lines. I certainly could not have done that in C++ or Java.  
Not being a Pythonite, I cannot comment on Python.

One issue I would say *is* important, especially for large experiments, is 
performance.  Although I have not done any controlled tests my gut feel is 
that Perl is probably at least 10 x slower than optimised and carefully 
implemented C on the same hardware.  I have heard tell that Python's 
performance is not any better than Perl's (in fact I have heard it is worse - 
http://furryland.com/~mikec/bench).  Java-ites - like the Lispers of 1990 - 
will go on about how built in optimisers make their language close in 
performance to C.  It was not true for Lisp (Xerox POS tagger was at least 10 
x slower than C POS tagger), and is almost certainly not true for Java unless 
you are very careful or clever and know the right coding contortions to go 
through to be efficient.  

So the bottom line is that if you need non standard large data structures (eg 
Sparse Arrays, compressed bitmap representations of indexes), or if you have 
huge data sets requiring expensive statistical calculations or have a 
non-trivial context free parser on an ambiguous grammar which you need to 
apply to a large dataset, it will pay to bite the bullet and go for C or C++, 
at least for some parts of your application, since of all standard high-level 
languages, C/C++ simply cannot be beaten for speed, even when you're not 
trying!

- Steve.


On Tuesday 04 March 2008 16:45, Darren Pearce wrote:
> > I heard good things about Python. I don't think PHP is suitable for
> > text processing; it is more for presentation (webpages).
>
> PHP was originally deployed only as a way to serve up dynamic web pages.
> However, it now has a standalone mode. It has full regular expression
> support and is pretty fast (it needs to be in a web context). Given that
> PHP 5 has pretty good object-oriented programming support (similar to
> Java), it's not a bad choice at all. It's OOP support is (IMHO) certainly
> better than Perl's.
>
> :Darren.

-- 
Steven Finch
Daxtra Technologies
Tel: +44 (0)131 653 1250
Email: s.finch at daxtra.com

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list