[Corpora-List] ACL-DCI and BLLIP corpora
David Brooks
D.J.Brooks at cs.bham.ac.uk
Tue Apr 11 13:06:06 UTC 2006
Dear All,
Until very recently, I was under the impression that the sole
distributions of Penn Treebank data were to be found in the Treebank
projects at the LDC. However, I've been made aware that certain subsets
of the data are also available through two other LDC projects: ACL-DCI
and BLLIP. I'm looking into obtaining one or both of these corpora, but
would like some advice as to their content, as the online descriptions
are not all that thorough.
Ideally, I'd like to get hold of the ATIS and Wall Street Journal
corpora in PTB parsed format, for the purpose of parser evaluation. Now,
ACL-DCI claims to have some Penn Treebank material (though I don not
know if that covers ATIS), and some WSJ material. Does anyone know if
the WSJ material is parsed in PTB format? Does that include the now
infamous Sections 1-23 used in parser evaluation? Otherwise, can anyone
tell me what the PTB datasets are, relative to the Treebank projects?
If the ACL-DCI does not contain parsed WSJ material, does the BLLIP
corpus contain the data I am looking for?
Many thanks,
David
--
David Brooks
http://www.cs.bham.ac.uk/~djb
More information about the Corpora
mailing list