[Corpora-List] High-performance Computing and NLP

Oliver Mason O.Mason at bham.ac.uk
Wed Mar 17 14:09:00 UTC 2010


I have not done too much work on parallel/high-performance NLP, as it
is not my main area, but was exploring some issues recently as we have
a high-performance cluster available at Birmingham.

The easiest way to exploit the computing power was, ironically, to run
the same (sequential) procedure multiple times in parallel. This was
looking at the behaviour of words, so each process had a different
list of words to work through. As there were no dependencies between
the individual processes, this was not an issue. But it felt rather
like cheating. However, I could use my existing Java programs without
modification.

I then looked at Erlang, which is ideal for developing concurrent
programs, and managed to easily write a simple RTN parser which
replaces a stack with concurrent processes. This was/is more of a toy
project, but worked really well (and the program was amazingly short).
Due to its Prolog origins Erlang is a nice language to work with in
this area. String handling is always cited as being a weak point, but
I didn't find that particularly bad. The big plus is that it is very
easy to spread any sequential task over multiple processes (up to
hundreds of thousands, or even more than that).

For text processing the main weakness of the Erlang system seems to be
poor i/o performance, but it is perfectly possible to implement a
concordancer in it, or compute collocations. If you have a task that
needs to be performed over a large number of concordance lines (such
as collecting candidates for collocates) that seems a good match for
map/reduce, which is easy in Erlang.

But as I said, I only dipped into it briefly, and have not done any
serious comparisons between different implementations.

Oliver

>> Good day,
>>
>> My research group is investigating the use of high-performance
>> computing facilities in NLP. By this we mostly mean clustered
>> environments, in which many (usually identical) computers are
>> networked in a single location, and used as a single computing entity
>> through libraries like MPI / OpenMP, MapReduce, etc. and/or using UIMA
>> or other frameworks in environments like that. Grid methods are less
>> of interest to us but I'd also like to hear about them. Pure machine
>> learning research that might be applied to NLP would also be welcome.
>>
>> If you're doing or aware of work like this, please let me know.
>>
>> Many thanks,
>> Sean Igo
>> University of Utah
>> Center for High Performance Computing / Biomedical Informatics Dept.
>>
>> _______________________________________________
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/listinfo/corpora
>
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>



-- 
Dr Oliver Mason
Technical Director of the Centre for Corpus Research
School of English, Drama, and ACS
The University of Birmingham
Birmingham B15 2TT

To arrange a meeting time see http://meetwith.me/ojmason

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list