[Corpora-List] ACL-DCI and BLLIP corpora

David Brooks D.J.Brooks at cs.bham.ac.uk
Tue Apr 11 13:06:06 UTC 2006


Dear All,

Until very recently, I was under the impression that the sole 
distributions of Penn Treebank data were to be found in the Treebank 
projects at the LDC. However, I've been made aware that certain subsets 
of the data are also available through two other LDC projects: ACL-DCI 
and BLLIP. I'm looking into obtaining one or both of these corpora, but 
would like some advice as to their content, as the online descriptions 
are not all that thorough.

Ideally, I'd like to get hold of the ATIS and Wall Street Journal 
corpora in PTB parsed format, for the purpose of parser evaluation. Now, 
ACL-DCI claims to have some Penn Treebank material (though I don not 
know if that covers ATIS), and some WSJ material. Does anyone know if 
the WSJ material is parsed in PTB format? Does that include the now 
infamous Sections 1-23 used in parser evaluation? Otherwise, can anyone 
tell me what the PTB datasets are, relative to the Treebank projects?

If the ACL-DCI does not contain parsed WSJ material, does the BLLIP 
corpus contain the data I am looking for?

Many thanks,
David
-- 
David Brooks
http://www.cs.bham.ac.uk/~djb



More information about the Corpora mailing list