[Corpora-List] [NLP2RDF] Announcement: NLP Interchange Format(NIF)

Rich Cooper rich at englishlogickernel.com
Fri Dec 9 20:02:38 UTC 2011


Dear Siddhartha,

 

Metadata in RDBMSs is usually stored in tables (as
you may already know) that can be queried or
updated.  I show how to do so in a program that
represents metadata in tables in my patent
7,209,923:

 

http://www.englishlogickernel.com/Patent-7-209-923
-B1.pdf

 

That view of the database tables, columns and rows
is ideal for reasoning tasks and still takes full
advantage of the RDBMS features, as shown in the
text and figures of that patent.  Figure 2 shows a
generic view of metadata arranged in tables for
representing symbols and text strings, including
tokens and phrases.  A copy is below, if it gets
through the email:

 



 

Relationships can be modeled most obviously by
creating tables named for the relationship, with
rows that contain the constants and variables you
want to place into the relationship.  I use text
names for constants and variables, with variables
distinctively starting with underscores ("_") much
like prolog does.  

 

I keep one table locked in memory for the symbol
table, which binds a unique arrival ID (an integer
that grows with each new symbol definition) to a
unique text string.  

 

The metadata tables relate to the symbol table by
storing just the indexed arrival ID for that
string, whether the string is a symbol or a phrase
extracted from a text source.  Unification is very
fast given that representation because the integer
indexes are adequate for calculating unifications.


 

My particular NLP interest of the moment is in
examining patent specifications, which contain
unstructured text fields within a formulaic
overall outline that can be dissected
algorithmically.  Patent claims are phrases that
bind the sentence "I claim X" so that each claim
phrase can be substituted for X.  

 

I also use an inverted text method for separating
out phrases (mostly sentences) from texts.  Each
patent document is read in as text, inverted to
enumerate phrases (approximately sentences).  Each
indexed phrase from the inverted document is then
tokenized, with the tokens interned uniquely into
the symbol table.  

 

Quantifiers are transformed into sequences of
symbol table arrival IDs (integers), and the
sequences are stored as rows in the relationship
modeling table.  Since quantifiers can be either
constants or variables, rules can be generalized
from instance data in the claim phrase or the
specification phrases.  That means all stored
relations, other than metadata tables, have rows
containing cells populated by integers.  That is
why unification and search are so fast with this
representation.  

 

There is an example program you can download,
though it doesn't work with Windows 7 yet.  You
can run it on a Vista or an XP box though.  It can
be downloaded from:

 

http://www.englishlogickernel.com/setup.exe

 

I promise it won't screw up your computer; I use
the program on a daily basis and it helps
enormously in my business of patent analysis.  

 

It isn't a general tool, but an application of NLP
analysis.  I am planning a more general analysis
tool, but that won't be ready for quite a while
yet.  Once I have solved all operational problems
for the patent analysis task, I will reorganize
the software components to provide the general
capability.  For now, this is as much as I can
handle with the available resources.  

 

Please feel free to ask questions if any of the
above isn't clear.  

 

-Rich

 

Sincerely,

Rich Cooper

EnglishLogicKernel.com

Rich AT EnglishLogicKernel DOT com

9 4 9 \ 5 2 5 - 5 7 1 2

  _____  

From: Siddhartha Jonnalagadda
[mailto:sid.kgp at gmail.com] 
Sent: Friday, December 09, 2011 11:03 AM
To: Rich Cooper
Cc: nlp2rdf at lists.informatik.uni-leipzig.de;
CORPORA List; Jens Lehmann
Subject: Re: [Corpora-List] [NLP2RDF]
Announcement: NLP Interchange Format(NIF)

 

Hey Rich,

RDBMS is an industry standard that works well for
some things such as storing the extracted
metadata, but might not be optimal for performing
reasoning over it. That might be one reason some
people use other representations such as
RDF/SPARQL for higher-level tasks. In general,
storing everything in the Common Analysis
Structure defined UIMA's type system works for me
and where needed I could write them into a
Database. What is the optimal way to represent the
metadata for reasoning tasks? How could I transfer
my UIMA CAS into that "thing"?

Sincerely,
Siddhartha Jonnalagadda, Ph.D.
 <http://sjonnalagadda.wordpress.com>
sjonnalagadda.wordpress.com





On Fri, Dec 9, 2011 at 11:56 AM, Rich Cooper
<rich at englishlogickernel.com> wrote:

Dear Siddhartha,

 

Could you please provide more detail about what
you need in the way of "more
computer-interpretable than RDBMS"?  I use the
RDBMS columns with unstructured text, analyze the
text in software, and populate new columns to
store the analyzed NLP information.  By
iteratively aggregating RDBMS columns, I am able
to process NLP quite well using the RDBMS
capabilities plus software functionality for
interpretation.  

 

More information would be useful,

-Rich

 

Sincerely,

Rich Cooper

EnglishLogicKernel.com

Rich AT EnglishLogicKernel DOT com

9 4 9 \ 5 2 5 - 5 7 1 2

  _____  

From: corpora-bounces at uib.no
[mailto:corpora-bounces at uib.no] On Behalf Of
Siddhartha Jonnalagadda
Sent: Friday, December 09, 2011 9:07 AM
To: nlp2rdf at lists.informatik.uni-leipzig.de;
CORPORA List
Cc: Jens Lehmann
Subject: Re: [Corpora-List] [NLP2RDF]
Announcement: NLP Interchange Format(NIF)

 

Somewhat related issue:
Since UIMA is seeing an increasing use within NLP
community (both Information Extraction and others
such as Question/Answering), I wonder why another
standard as opposed to an interface between the
UIMA type system and one of the many existing
standards. In other words, is there some work on
representing the information we extract in a way
more computer-interpretable than RDBMS?

Sincerely,
Siddhartha Jonnalagadda, Ph.D.
 <http://sjonnalagadda.wordpress.com>
sjonnalagadda.wordpress.com




On Fri, Dec 9, 2011 at 10:39 AM, John F. Sowa
<sowa at bestweb.net> wrote:

Before making a firm commitment to any notation as
a standard for NLP,
I suggest that you poll computational linguists
and ask them what they
would prefer for their work.  Among the questions
you could ask is to
look at those five serializations and check which
one(s) they prefer.

Corpora List is a good place to start such a poll.

 

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20111209/1f30bd8f/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.jpg
Type: image/jpeg
Size: 20875 bytes
Desc: not available
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20111209/1f30bd8f/attachment-0001.jpg>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list