[Corpora-List] gold standard for IE from tables

Thomas L. Packer tpacker at byu.net
Fri Mar 4 13:36:43 UTC 2011


Hello Andrea

 

                I'm working on something similar: entity and relation
extraction from semi-structured lists, in particular printed lists (i.e. the
texts come from scanned and OCRed document images).  I'm not aware of many
such datasets, so I will be interested in seeing others' responses to your
question.  

 

Can you give more details about what you are interested in?  Are you
interested in HTML tables, text tables with tab or some other character
delimiters between columns?  Printed tables with spatial layout information?
Lists of records that do not necessarily have delimiters between columns or
column headers?

 

I am preparing to create a dataset of different kinds of printed lists in
the family history domain, including some tables.  I may need to also
correct the OCR errors and delimit columns in the corresponding text along
with annotating the fields, so that might be close to what you are looking
for even if you are not targeting printed tables.  

 

One dataset I've been trying out in the mean time is the Cora research paper
citations dataset for IE, but this may not fall under your definition of
"table" because the fields are not in a consistent order, the list entries
do not have a single consistent schema, and the fields are not unambiguously
delimited.  

 

 <http://www.cs.umass.edu/~mccallum/data.html>
http://www.cs.umass.edu/~mccallum/data.html 

 

                Good luck.

 

Thomas L. Packer

BYU CS

~~~~~~~~~~~~~~~~~~~~

 

From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf Of
Andrea Varga
Sent: Friday, March 04, 2011 4:15 AM
To: corpora at uib.no
Cc: andrea.job06 at yahoo.com
Subject: [Corpora-List] gold standard for IE from tables

 

Dear corpora members,

 

I was wondering whether there are any publicly available corpora annotated
for Information Extraction from tables. I am particularly interested in
entity extraction and relation extraction from tables.

 

Many thanks,

Andrea

-- 

Ms Andrea Varga MSc
PhD Student
OAK Group
The University of Sheffield 
a.varga at dcs.shef.ac.uk

 

 

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20110304/4c883305/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list