[Corpora-List] What is best for text processing Perl of Python?
Steve Finch
s.finch at daxtra.com
Wed Mar 5 11:31:26 UTC 2008
OK, not to get into a religious war I would say that no-one has really stuck
up for Perl. I would say its OO support is sufficient (and
characteristically flexible), and that many of its optimised builtin features
and syntactic sugar (eg sorting, tokenizing, pattern matching, filtering,
iteration) make many of the individual computational operations you want to
do in language processing expressible in little more than 1 line, and
effortlessly well implemented.
I replicated more or less the entire algorithm set I used for my PhD (corpus
collection from the web, distibutional analysis, clustering, patricia, and
HMM pos tagging) in a little over a weekend from scratch in fewer than 1300
not very dense lines. I certainly could not have done that in C++ or Java.
Not being a Pythonite, I cannot comment on Python.
One issue I would say *is* important, especially for large experiments, is
performance. Although I have not done any controlled tests my gut feel is
that Perl is probably at least 10 x slower than optimised and carefully
implemented C on the same hardware. I have heard tell that Python's
performance is not any better than Perl's (in fact I have heard it is worse -
http://furryland.com/~mikec/bench). Java-ites - like the Lispers of 1990 -
will go on about how built in optimisers make their language close in
performance to C. It was not true for Lisp (Xerox POS tagger was at least 10
x slower than C POS tagger), and is almost certainly not true for Java unless
you are very careful or clever and know the right coding contortions to go
through to be efficient.
So the bottom line is that if you need non standard large data structures (eg
Sparse Arrays, compressed bitmap representations of indexes), or if you have
huge data sets requiring expensive statistical calculations or have a
non-trivial context free parser on an ambiguous grammar which you need to
apply to a large dataset, it will pay to bite the bullet and go for C or C++,
at least for some parts of your application, since of all standard high-level
languages, C/C++ simply cannot be beaten for speed, even when you're not
trying!
- Steve.
On Tuesday 04 March 2008 16:45, Darren Pearce wrote:
> > I heard good things about Python. I don't think PHP is suitable for
> > text processing; it is more for presentation (webpages).
>
> PHP was originally deployed only as a way to serve up dynamic web pages.
> However, it now has a standalone mode. It has full regular expression
> support and is pretty fast (it needs to be in a web context). Given that
> PHP 5 has pretty good object-oriented programming support (similar to
> Java), it's not a bad choice at all. It's OOP support is (IMHO) certainly
> better than Perl's.
>
> :Darren.
--
Steven Finch
Daxtra Technologies
Tel: +44 (0)131 653 1250
Email: s.finch at daxtra.com
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list