[Corpora-List] help requested in planning a new corpus of scanned and transcribed document images
Thomas L. Packer
tpacker at byu.net
Wed Aug 25 23:26:38 UTC 2010
Corpora list
Ancestry.com, Inc. and the Department of Computer Science at Brigham
Young University are working together to create a publically available,
hand-annotated collection of images and OCR transcriptions of scanned
documents. We would like your feedback before we get too far along. We
hope to have this complete early next year.
This corpus is intended to facilitate evaluating document image analysis,
OCR error correction and/or information extraction algorithms and related
research. The images come from the scanned books and newspapers in a large
collection at Ancestry.com and include the following kinds of documents:
. newspapers (typical newspapers from the 20th century)
. city directories (like old phone books)
. collage yearbooks (includes photos, names and majors)
. navy cruise books (like a yearbook for those who served on large
US Navy ships)
. birth records books (recording birth events and family
relationships among parents and children)
. local histories (histories of small geographical areas)
. family histories (multi-generational histories of particular
families)
. church yearbooks (describes the organization and events of a local
church congregation)
A few example images can be found here:
http://deg.byu.edu/ancestry-examples/
The particular selection of documents was motivated by genealogy and
family history research, but we believe the final corpus with annotations
(including manual transcriptions) will be of value outside this field of
research.
We would like your help in refining our priorities for this corpus if you
are likely to use this corpus in your research. We ask you to reply to this
email with ideas you may have as well as helping us prioritize an existing
wish-list of potential features by voting at this website:
http://www.allourideas.org/ancestry_corpus
This web page will iteratively present you with random pairs of features.
Each feature is described briefly by a short sentence. For each pair of
features, please click on the one that would be the most useful to you.
After each click, you will be presented with the next pair. Feel free to
vote in multiple sessions, on multiple days. You can continue voting for as
long as you would like. The longer you vote, the better our ranking will
be.
You can also add your own ideas to this wish-list. Please add them
early so other people have a chance to vote for them, but only after you
familiarize yourself with the features already listed. A complete list can
be found by clicking on "View Results". We would also like to hear any
feedback you may have when you reply to this email.
If you know of someone who might be interested in this corpus, please
forward this email to them.
Thank you for your time.
Thomas L. Packer
Department of Computer Science
Brigham Young University
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20100825/0d3f517d/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list