[Corpora-List] What is best for text processing Perl of Python?
Chris Jordan
chris.jordan at acm.org
Wed Mar 5 16:15:23 UTC 2008
On 5-Mar-08, at 11:40 AM, maxwell at umiacs.umd.edu wrote:
> Steve Finch wrote:
>> OK, not to get into a religious war I would say that no-one has
>> really
>> stuck up for Perl.
>> ...many of the individual computational operations you want to
>> do in language processing [are] expressible in little more than 1
>> line, and effortlessly well implemented.
>> ...
>> Not being a Pythonite, I cannot comment on Python.
>
> Being a Pythonite (and as a morphologist, I approve of that word!)
> myself,
> I guess I would say that the ability of Perl to express such
> operations in
> one line is both its benefit and its shortcoming. Personally, I can't
> understand my own code a month after I've written it (comes with olde
> age), and I would much rather take a dozen lines and comments to
> code some
> operation, so I can understand later what I did.
>
> Of course, if you want to do a one-off (maybe on the command line),
> then
> Perl is the way to go. (Unix/ Linux commands like sed, grep, awk etc.
> will often also get you there, although if you use the non-ASCII
> portion
> of Unicode heavily, using these programs can get quite messy.)
>
> Mike Maxwell
> CASL/ U MD
>
I think largely it depends on the text processing task you are trying
to do and the availability of existing packages that will determine
whether you use Perl or Python.
As for my own personal rant about Perl, there are two really huge
advantages to using it for NLP/IR:
1 - cpan & perldoc - Ok, so this is more of an advantage of using Perl
for developing scripts and software in general. cpan and perldoc are
two incredible applications for helping you develop software quickly.
cpan allows you to search and install Perl modules in the CPAN
repository. perldoc allows you to search a Perl knowledge base full of
function explanation and code examples, many of which were authored by
Larry Wall himself.
2 - RSPerl (http://www.omegahat.org/RSPerl/) - This is a Perl
interface to R and S. In other words, any statistical calculations
that you want to do you can off load to R or S which are amazing
statistical applications both in terms of functionality and
performance. The combination of Perl's regular expression
functionality and R's statistical capabilities allows for rapid
development of statistical NLP experiments/scripts.
Now Perl has a serious disadvantage; Perl hashes eat memory like water
through an exponential garden hose... that metaphor doesn't really
work but basic hashes in Perl were not well thought out and there is
no way that I have found to easily deallocate hashes that are not used
anymore or to reduce fragmentation. Now this only matters if you are
dealing with a large data set which is fairly common in statistical
NLP. You can workaround this hash problem by using an SQL database and
treating a table like a hash however you will take a performance hit
unless you tweak out your SQL DBMS to store as much as possible in
memory and to cache in a smart way which can be painful to figure out.
Sorry if I repeated anything that was brought up earlier in the
thread; half paying attention while I FINALLY finish the PhD.
--
Chris Jordan
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list