[Corpora-List] Simple instructions to scale a java application?
John K Pate
j.k.pate at sms.ed.ac.uk
Sat May 21 23:37:20 UTC 2011
On Sat, 2011-05-21 at 15:14 -0700, Siddhartha Jonnalagadda wrote:
> I have a single threaded java (NLP) application that processes 1000
> sentences in 1 hour. I obviously can't wait for 1000 hours to process
> million sentences. Are there any simple instructions to make my
> program run in 100 servers at a time? This involves migrating the
> project workspace into each of them (or create them from a snapshot
> that contains it) and concatenate the output that each server
> produces.
>
> Any quick pointers, please? I spent couple of hours browsing through
> Amazon MapReduce documentation, but that didn't take me as far...
The basic technology behind MapReduce is made available in Hadoop:
http://hadoop.apache.org/
but I haven't used it myself.
> Since I don't own shares in Amazon, I am open to non-Amazon solutions
> too.
I have personally used actors in Scala to parallelize my NLP
applications. Since Scala compiles to JVM bytecode, you can simply call
your java library from your Scala code. The Scala standard library has
its own implementation of local and remote actors, or you can use Akka
actors:
http://akka.io/
Akka actors can also be used directly in Java code, without having to
write any Scala yourself (see:
http://akka.io/docs/akka/1.1/java/remote-actors.html and
http://akka.io/docs/akka/1.1/java/untyped-actors.html).
If you're interested, I have some Scala examples of using actors for
this purpose.
Here is a library (EM for PCFGs) where I used Scala standard library
actors to parallelize across arbitrarily many machines and cores:
http://github.com/jpate/ShakesEM/
and here is a library (EM for a few different kinds of DBNs) where I use
Akka actors to parallelize across abitrarily many cores (and call MALLET
java code):
http://github.com/jpate/prosodicParsing/
Both of these send sentences to local or remote actors, which produce
expectations of the appropriate type for each sentence and send them
back to the manager actor for the maximization step.
Hope this helps.
John
==
John K Pate
Student, PhD Informatics, The University of Edinburgh
http://homepages.inf.ed.ac.uk/s0930006/
--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list