[Corpora-List] Is the TEI a waste of time?

Tue Jul 1 08:58:39 UTC 2003

From: "Marco Baroni" <baroni at sslmit.unibo.it>
> Obviously, this is not the current situation, and in the real world the
> presence of TEI-encoding can be a (minor) hassle, since many tools you
> may want to use (pos taggers, morphological analyzers, machine learning
> packages, databases, command-line programs, your own scripts) are not
> TEI-compatible, and TEI is not the easiest format to deal with (as
> compared to, eg, tab-delimited text...)
>
> I suppose that the best way for people in favor of TEI to convince
> others to adopt the standard would be to provide all sorts of cool
> TEI-conformant tools: programs helping (manual and automated)
> TEI-encoding, programs that perform all sorts of linguistic and
> statistical analyses of TEI-encoded data, indexers and fast searching
> engines for TEI-encoded corpora, TEI-db's, input/output conversion
> tools...

I agree with this idea. It is surprising to see how little software there
is for TEI corpora. The TEI is a waste of time only if the encoding is
under-exploited - which is a problem for the researcher, not for the TEI.
As said G. Williams a minimal encoding with hasty-pasted-header and
word-processor-regex encoding of <p> takes only a few minute. But in order
to exploit easily the encoding there is no public framework or set of tools
for treatment of TEI-corpus - such as concordancer based on SAX stream,
etc. Something like a set of classes for calling parser, SAX rewriting,
etc., allowing just to insert SAX handlers or XSLT stylesheets in the
pipeline could be very useful. While XML always gain ground when it
normalizes both the standards and the software methodologies, the TEI
remain a pure standard.

I think the TEI is obviously necessary for the view G. Williams defends - a
corpus is not a sac of words - and for interoperability, etc. But I agree
that the TEI is perhaps "out to date" for some points: there is nothing for
morphosyntaxic or morphologic encoding, texts profiling, etc. The TEI
remains perhaps not sufficiently adapted to linguistic corpora. This
is quite obvious if we look at the projects listed on tei-c.org : it is
mainly philological uses of the TEI.

Sylvain Loiseau