[Corpora-List] Quick (but possibly clueless) software question

Borislav Iordanov borislav.iordanov at gmail.com
Sat Nov 7 21:25:54 UTC 2009


William,

Yes, I did develop something that is quite close to what you are
looking for. It is open-source, but never officially released and
undocumented. If you are interested in getting it to work for you and
extending it to suit your needs, I will make it a priority to document
and help you. The system, called Disko, can be described in a nutshell
thus:

1) Overall paradigm is Flow-Based Programming, see
http://en.wikipedia.org/wiki/Flow-based_programming.
2) Nodes in a dataflow network are high-level processes, e.g. sentence
detector, parser, semantic relation extractor, WSD. They are
input-output black boxes and there's no need for a global shared
memory.
3) You'd write an NLP pipeline by developing a dataflow network,
focusing on the algorithms only.
4) The framework then offers facilities to make your pipeline into a
job processing entity, so for example you can submit documents to be
indexed for a search engine as jobs that are queued and handled as
soon as resources become available.
5) The framework also lets you parallelize and distribute your
processing nodes: you can take any node and make N copies of it, which
to the outside world behave as a single node; you can also take any
node and send it to another machine.
6) A pipeline using Relex+LinkGrammar+OpenNLP+custom WSD has been
developed, tested and working.
7) The database used for that pipeline is something called
HyperGraphDB (see http://code.google/p/hypergraphdb).
8) HyperGraphDB is necessary in order to run the distributed
processing framework - it is a BerkeleyDB-based Java database for
storing hypergraphs.

The reason for 8 is mainly reuse of the efforts to make HyperGraphDB
distributed, as well as the fact that this is what I use to store my
data. The networking code is based on XMPP (OpenFire server) and FIPA
(agent communication language) standards.

My next steps are:

A) Automate (5) above based on runtime monitoring: the system should
be able to decide what nodes to parallelize, how much and at what
location to activate them. Currently, one needs to explicitly
configure that exact topology of the dataflow network.

B) Make it easily installable to any computer for idle time use like SETI etc.

Best,
Boris

On Sat, Nov 7, 2009 at 11:23 AM, Linas Vepstas <linasvepstas at gmail.com> wrote:
> 2009/11/4 Spruiell, William C <sprui1wc at cmich.edu>:
>> Are there any available corpus analysis tools
>
> You'd have to narrow what kind of tool you are talking about.
>
>> that work by “farming” texts
>> out to client programs on multiple computers (workstation cluster, beowulf,
>> or just widely distributed)  and then collating the results
>
> I suspect that various semantic search companies have solved
> this in proprietary ways for parsing text.
>
> For open source, I know that Boris Iordanov has done this for
> parsing text, to provide Dade county (Miami Florida) with a simple
> web-based question answering service for local government
> services.  I believe that there are now other local governments
> also looking at this (??? in California??).  His system is tailored
> for the RelEx +Link-Grammar parser combo.
>
>> (like the
>> screensaver freeware that the SETI project distributed so that anyone
>> interested could volunteer to do some of their signal analysis for them)?
>
> Well, there's also BOINC and other distributed systems that
> allow anyone to "easily" write distributed applications.
>
> --linas
>



-- 
"Frozen brains tell no tales."

-- Buckethead

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list