[Corpora-List] ACL Corpus with extracted and cleaned full-text
Christian Kirschner
kirschner at kdsl.informatik.tu-darmstadt.de
Mon Nov 25 14:09:46 UTC 2013
Dear all,
I am looking for an ACL Anthology corpus which contains the extracted
full-texts of ACL papers (for example as textfile or xml file). I know
there is the "ACL Anthology Reference Corpus"
(http://acl-arc.comp.nus.edu.sg/). Unfortunately the data is not very
clean here: There are page-numbers, footnotes, table data etc. in the
text files and headings are not identified.
Is there anybody who knows about a "cleaner" ACL corpus or has anybody
tried to filter out page-numbers, footnotes, figures etc. and to
identify headings, paragraphs etc.?
(Apart from ACL other "clean" corpora for scientific literature would be
interesting for me)
Cheers,
Christian Kirschner
(Ubiquitous Knowledge Processing Lab, TU Darmstadt)
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list