[Corpora-List] Is the TEI a waste of time?

Tue Jul 1 10:27:27 UTC 2003

Hi,

I find this discussion very interesting, but would like to learn more about
what those who are more familiar with the topic than I am have to say about
TEI's "competitors", e.g. CES/XCES (http://www.cs.vassar.edu/CES/ and
http://www.cs.vassar.edu/XCES/).

Cheers, Oli

> -----Original Message-----
> From: Sylvain Loiseau [mailto:sylvain at toucheraveclesyeux.com]
> Sent: Tuesday, July 01, 2003 10:59 AM
> To: Marco Baroni; corpora at uib.no
> Subject: Re: [Corpora-List] Is the TEI a waste of time?
>
>
> From: "Marco Baroni" <baroni at sslmit.unibo.it>
> > Obviously, this is not the current situation, and in the real world
> > the presence of TEI-encoding can be a (minor) hassle, since
> many tools
> > you may want to use (pos taggers, morphological analyzers, machine
> > learning packages, databases, command-line programs, your
> own scripts)
> > are not TEI-compatible, and TEI is not the easiest format
> to deal with
> > (as compared to, eg, tab-delimited text...)
> >
> > I suppose that the best way for people in favor of TEI to convince
> > others to adopt the standard would be to provide all sorts of cool
> > TEI-conformant tools: programs helping (manual and automated)
> > TEI-encoding, programs that perform all sorts of linguistic and
> > statistical analyses of TEI-encoded data, indexers and fast
> searching
> > engines for TEI-encoded corpora, TEI-db's, input/output conversion
> > tools...
>
> I agree with this idea. It is surprising to see how little
> software there is for TEI corpora. The TEI is a waste of time
> only if the encoding is under-exploited - which is a problem
> for the researcher, not for the TEI. As said G. Williams a
> minimal encoding with hasty-pasted-header and
> word-processor-regex encoding of <p> takes only a few minute.
> But in order to exploit easily the encoding there is no
> public framework or set of tools for treatment of TEI-corpus
> - such as concordancer based on SAX stream, etc. Something
> like a set of classes for calling parser, SAX rewriting,
> etc., allowing just to insert SAX handlers or XSLT
> stylesheets in the pipeline could be very useful. While XML
> always gain ground when it normalizes both the standards and
> the software methodologies, the TEI remain a pure standard.
>
> I think the TEI is obviously necessary for the view G.
> Williams defends - a corpus is not a sac of words - and for
> interoperability, etc. But I agree that the TEI is perhaps
> "out to date" for some points: there is nothing for
> morphosyntaxic or morphologic encoding, texts profiling, etc.
> The TEI remains perhaps not sufficiently adapted to
> linguistic corpora. This is quite obvious if we look at the
> projects listed on tei-c.org : it is mainly philological uses
> of the TEI.
>
> Sylvain Loiseau
>
>
>
>
>