[Corpora-List] ACL Corpus with extracted and cleaned full-text

Christian Kirschner kirschner at kdsl.informatik.tu-darmstadt.de
Mon Nov 25 14:09:46 UTC 2013


Dear all,

I am looking for an ACL Anthology corpus which contains the extracted 
full-texts of ACL papers (for example as textfile or xml file). I know 
there is the "ACL Anthology Reference Corpus" 
(http://acl-arc.comp.nus.edu.sg/). Unfortunately the data is not very 
clean here: There are page-numbers, footnotes, table data etc. in the 
text files and headings are not identified.

Is there anybody who knows about a "cleaner" ACL corpus or has anybody 
tried to filter out page-numbers, footnotes, figures etc. and to 
identify headings, paragraphs etc.?

(Apart from ACL other "clean" corpora for scientific literature would be 
interesting for me)

Cheers,
Christian Kirschner
(Ubiquitous Knowledge Processing Lab, TU Darmstadt)

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list