[Corpora-List] Python Benchmarks Using Lots of Memory

Fri Nov 12 18:14:31 UTC 2010

Dear Carl Friedrich,

Most of my own software that I use for CPU- or memory-intensive computation
uses bits of Cython in them (aka it would be awesome if PyPy could talk to
cpdef functions in Cython modules and automagically optimize away the
boxing/unboxing at the PyPy/Cython boundary), but here are two
examples of code that will probably fit your bill in that it can read
in data and
will use more memory when you feed it with more data:

* The DECCA toolkit looks at sequences of POS tags and word sequences
http://decca.osu.edu/software.php
* NLTK is a pure-python toolkit that contains implementations for a couple of
useful NLP algorithms, and includes sample datasets that show what it does
with them:
http://www.nltk.org/
The NLTK book should be a very useful source of examples that can be
used directly on the sample data:
http://nltk.googlecode.com/svn/trunk/doc/book/ch03.html

NLTK is probably a very good testbed for using PyPy on it since
* it comes with its own data, so there's no need to hunt for datasets
or produce synthetic data
* it's actually written with clarity in mind and probably contains less
squeezing-the-last-drops-of-performance-out-of-CPython code than
other projects.

Best wishes,
Yannick Versley

On Fri, Nov 12, 2010 at 6:52 PM, Carl Friedrich Bolz <cfbolz at gmx.de> wrote:
> Hello all,
>
> I'm a computer science PhD student at the University of Düsseldorf working
> on improving the memory behavior (and also performance) of a Python
> implementation [1]. For that reason I am looking for "real-world" Python
> programs that consume a lot of memory when running (around 50MB RAM or
> above) and ideally do a lot of string manipulations. Possible examples could
> be corpus analysis tools like concordances or n-gram/collocation analyzers.
>
> So if you have written such a Python program or script and want to send it
> to me, I would be very grateful (and would treat them confidentially if you
> wish). The benefit for you would be that future Python implementations might
> become optimized for precisely your use case :-).
>
> Thanks a lot and best wishes,
>
> Carl Friedrich Bolz
>
>
> [1] http://pypy.org
>
> --
> Carl Friedrich Bolz
> Institut für Informatik
> Heinrich-Heine-Universität Düsseldorf
> Universitätsstr. 1
> 40225 Düsseldorf
> Germany
> +49 211 8110537
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora