[Corpora-List] What is best for text processing Perl of Python?

Chris Jordan chris.jordan at acm.org
Wed Mar 5 16:15:23 UTC 2008


On 5-Mar-08, at 11:40 AM, maxwell at umiacs.umd.edu wrote:

> Steve Finch wrote:
>> OK, not to get into a religious war I would say that no-one has  
>> really
>> stuck up for Perl.
>> ...many of the individual computational operations you want to
>> do in language processing [are] expressible in little more than 1
>> line, and effortlessly well implemented.
>> ...
>> Not being a Pythonite, I cannot comment on Python.
>
> Being a Pythonite (and as a morphologist, I approve of that word!)  
> myself,
> I guess I would say that the ability of Perl to express such  
> operations in
> one line is both its benefit and its shortcoming.  Personally, I can't
> understand my own code a month after I've written it (comes with olde
> age), and I would much rather take a dozen lines and comments to  
> code some
> operation, so I can understand later what I did.
>
> Of course, if you want to do a one-off (maybe on the command line),  
> then
> Perl is the way to go.  (Unix/ Linux commands like sed, grep, awk etc.
> will often also get you there, although if you use the non-ASCII  
> portion
> of Unicode heavily, using these programs can get quite messy.)
>
>   Mike Maxwell
>   CASL/ U MD
>


I think largely it depends on the text processing task you are trying  
to do and the availability of existing packages that will determine  
whether you use Perl or Python.

As for my own personal rant about Perl, there are two really huge  
advantages to using it for NLP/IR:
1 - cpan & perldoc - Ok, so this is more of an advantage of using Perl  
for developing scripts and software in general. cpan and perldoc are  
two incredible applications for helping you develop software quickly.  
cpan allows you to search and install Perl modules in the CPAN  
repository. perldoc allows you to search a Perl knowledge base full of  
function explanation and code examples, many of which were authored by  
Larry Wall himself.
2 - RSPerl (http://www.omegahat.org/RSPerl/) - This is a Perl  
interface to R and S. In other words, any statistical calculations  
that you want to do you can off load to R or S which are amazing  
statistical applications both in terms of functionality and  
performance. The combination of Perl's regular expression  
functionality and R's statistical capabilities allows for rapid  
development of statistical NLP experiments/scripts.

Now Perl has a serious disadvantage; Perl hashes eat memory like water  
through an exponential garden hose... that metaphor doesn't really  
work but basic hashes in Perl were not well thought out and there is  
no way that I have found to easily deallocate hashes that are not used  
anymore or to reduce fragmentation. Now this only matters if you are  
dealing with a large data set which is fairly common in statistical  
NLP. You can workaround this hash problem by using an SQL database and  
treating a table like a hash however you will take a performance hit  
unless you tweak out your SQL DBMS to store as much as possible in  
memory and to cache in a smart way which can be painful to figure out.

Sorry if I repeated anything that was brought up earlier in the  
thread; half paying attention while I FINALLY finish the PhD.
-- 
Chris Jordan

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list