[Corpora-List] Amazon MapReduce instructions for a simple java functionality

Ted Pedersen tpederse at d.umn.edu
Thu May 26 00:00:42 UTC 2011


Hi Siddhartha,

I have not used the Amazon MapReduce framework, so I can't help you
there. However, I did take the plunge into Hadoop/MapReduce earlier
this year (via a semester long class I was teaching) and I must say
that while I really enjoyed Hadoop/MapReduce, I found the Java aspects
of that just hateful. It is hard to do even simple things with Java in
Hadoop/MapReduce, unless you are using streaming (which I would
strongly encourage, unless you have specific need of some of the built
in API functionality).

But, if you really must use Java via the API, then I think you will
just need to be patient and work through lots of mind-numbing detail
and practice exercises. I found the Hadoop Book by Tom White quite
helpful in terms of Java mechanics (although it was still hateful).
There are also chapters on a few alternatives to Java, as well as
decent discussion of streaming.

http://www.hadoopbook.com/

Otherwise, I'm not sure that Java is really your only or best choice
in terms of developing NLP applications - it's obviously popular,
although as the poll you cited also shows C/C++ is quite popular, as
are python and to a lesser extent my old friend Perl. I'd say that any
of them are good choices for various sorts of NLP (as are many other
languages), and you ought not to fall into the trap of thinking you
can only use MapReduce with Java.

The only reason I would really seriously consider using Java with
MapReduce is if I had a ton of time for development, and if squeezing
every little bit of performance out of your code is highly important.
In fact it has been my observation that if you are a mediocre Java
programmer, it's very likely that your MapReduce programs will run
surprisingly slowly even if you work hard on them for a long time. So,
I think using streaming is a very good option (and in fact any
language that supports streaming can be used in this way, and that's
most programming languages).

There are also programming languages being built on top of MapReduce
that speed development without sacrificing too much in terms of
performance. The one I've had good luck with is Pig, which is quite
easy to learn and develop in. More info here ... http://pig.apache.org

To summarize, as a relative novice to MapReduce/Hadoop, I think what
I've learned so far is the following :

1) avoid Java unless you are really good at it already, *and* have
significant time to devote to development
2) if you have existing code that you don't want to rewrite, consider
streaming. It's very possible you can get things running on lots of
machines without many modifications
3) consider alternatives built on top of MapReduce (like Pig) that
will speed development without sacrificing too much in terms of
performance.

I hope this helps.

Cordially,
Ted

On Wed, May 25, 2011 at 4:58 PM, Siddhartha Jonnalagadda
<sid.kgp at gmail.com> wrote:
> Hi All,
>
> Thanks for confirming that MapReduce is the way to go and the tutorials! I
> was trying to go through some of the tutorials, but they lack specific
> details about using a java project. So, I changed my question. Please excuse
> me if you consider this discussion inappropriate for this list and ignore
> the rest. I thought this is a problem that most of us would be facing. Java
> is the most popular language for NLP
> (http://nlpers.blogspot.com/2009/03/programming-language-of-choice.html) and
> we all need to map to clusters and reduce our processing. Further, Amazon
> servers is the way to go for many that don't have access to personal HPC
> clusters.
>
> Wondering if someone could help me with precise instructions to use Amazon
> MapReduce for the simple java program below? It has one class that takes an
> input, has a dictionary and produces an output. (Basically whatever is in
> input, if it is present in dictionary) I would use that as a template for my
> java application. I need mapreduce I want to decrease the time taken for a
> complex application by n-fold.
>
> I'm kind of lost trying to learn different things. It is easier to do it the
> other way, I guess. Someone, please?
>
> Here is the tested code:
> http://dl.dropbox.com/u/6777654/Simple.zip
>
> I greatly appreciate you spending 5-10 minutes in giving simple instructions
> that a java programmer with knowledge of MapReduce and familiarity with
> Amazon servers could use.
>
> Thanks.
>
> Sincerely,
> Siddhartha Jonnalagadda,
> Text mining Researcher, Lnx Research, LLC, Orange, CA
> sjonnalagadda.wordpress.com
>
>
> Confidentiality Notice:
>
> This e-mail message, including any attachments, is for the sole use of the
> intended recipient(s) and may contain confidential and privileged
> information. Any unauthorized review, use, disclosure or distribution is
> prohibited. If you are not the intended recipient, please contact the sender
> by reply e-mail and destroy all copies of the original message.
>
>
>
>
> On Sat, May 21, 2011 at 3:14 PM, Siddhartha Jonnalagadda <sid.kgp at gmail.com>
> wrote:
>>
>> I have a single threaded java (NLP) application that processes 1000
>> sentences in 1 hour. I obviously can't wait for 1000 hours to process
>> million sentences. Are there any simple instructions to make my program run
>> in 100 servers at a time? This involves migrating the project workspace into
>> each of them (or create them from a snapshot that contains it) and
>> concatenate the output that each server produces.
>>
>> Any quick pointers, please? I spent couple of hours browsing through
>> Amazon MapReduce documentation, but that didn't take me as far...
>>
>> Since I don't own shares in Amazon, I am open to non-Amazon solutions too.
>>
>> Sincerely,
>> Siddhartha Jonnalagadda,
>> Text mining Researcher, Lnx Research, LLC, Orange, CA
>> sjonnalagadda.wordpress.com
>>
>>
>> Confidentiality Notice:
>>
>> This e-mail message, including any attachments, is for the sole use of the
>> intended recipient(s) and may contain confidential and privileged
>> information. Any unauthorized review, use, disclosure or distribution is
>> prohibited. If you are not the intended recipient, please contact the sender
>> by reply e-mail and destroy all copies of the original message.
>>
>>
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>



-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list