[Corpora-List] structured data (enu | csy) for IE needed

Yannick Versley versley at sfs.uni-tuebingen.de
Thu Jan 25 08:44:48 UTC 2007


Hello,

> for my graduation theses, I need a set of structured data for some
> experiments: Data set should consists of XML files, HTML files or any of
> hypertext based files. Next requirement is: "highly structuded data". This
> means, that I'm not interested in data with structure such as next example
> has:
> <p>Paragraph, many words in same tag</p>
> I' looking for the data, that are more structured. Like this example:
> <t> <tag2>Few words (up to 10)</tag2> <tag3>Few words (up to 10)</tag3>
> </t> Last requirement is: English or Czech domain.
My guess would be that Wikipedia fits your description, where you will find 
many tables and/or templates, and it is available in English and Czech. I 
don't know if anyone has tried extracting specific information from that, 
though.

Best,
Yannick
-- 
Yannick Versley
Seminar für Sprachwissenschaft, Abt. Computerlinguistik
Wilhelmstr. 19, 72074 Tübingen
Tel.: (07071) 29 77352



More information about the Corpora mailing list