[Corpora-List] Parsed corpus file format

Lou Burnard lou.burnard at oucs.ox.ac.uk
Thu Jul 3 15:41:31 UTC 2008


> >
> > Second, answering your question about the file format:
> > the link you provided looks very much like the spurious XML you have to use
> > with the CQP/CWB, and I'm very used to seeing my corpora in that format.
> > However, if your goal is to have the best possible distribution format, then
> > why not sticking to a recognised standard (XML, of course) and avoid
> > reinventing the wheel. The TEI XML guideline is probably the de facto
> > standard:
> >
> > http://www.tei-c.org/
> 
> Do you have any experience with this?  I can't figure out how to use it.
> Yes, I could add a teiHeader, not a bad idea.  And I found WordHoard
> by following the links, it was very suggestive, and shared a number of
> good ideas with TigerXML, which I'm thinking of incorporating.  But I
> didn't see a more direct way of making use of TEI. ... ?
> 

The TEI is a good way of defining your XML schema in a way that leverages the 
work done by lots of others before you in defining standards based formats for 
such data. It comes as a suite of modules defining XML elements and attrbutes 
which can be used buffet-style to define a schema that is interoperable with 
any other TEI derived schema. I gave a paper at the last LREC highlghting this 
aspect of it, but theres no shortage of other information about it -- it has 
been around even longer than the wacky chaps!

 

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list