[Corpora-List] Parsed corpus file format
Lou Burnard
lou.burnard at oucs.ox.ac.uk
Thu Jul 3 15:41:31 UTC 2008
> >
> > Second, answering your question about the file format:
> > the link you provided looks very much like the spurious XML you have to use
> > with the CQP/CWB, and I'm very used to seeing my corpora in that format.
> > However, if your goal is to have the best possible distribution format, then
> > why not sticking to a recognised standard (XML, of course) and avoid
> > reinventing the wheel. The TEI XML guideline is probably the de facto
> > standard:
> >
> > http://www.tei-c.org/
>
> Do you have any experience with this? I can't figure out how to use it.
> Yes, I could add a teiHeader, not a bad idea. And I found WordHoard
> by following the links, it was very suggestive, and shared a number of
> good ideas with TigerXML, which I'm thinking of incorporating. But I
> didn't see a more direct way of making use of TEI. ... ?
>
The TEI is a good way of defining your XML schema in a way that leverages the
work done by lots of others before you in defining standards based formats for
such data. It comes as a suite of modules defining XML elements and attrbutes
which can be used buffet-style to define a schema that is interoperable with
any other TEI derived schema. I gave a paper at the last LREC highlghting this
aspect of it, but theres no shortage of other information about it -- it has
been around even longer than the wacky chaps!
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list