[Corpora-List] Is the TEI a waste of time?

Fri Jun 27 10:44:59 UTC 2003

geoffrey.williams at wanadoo.fr said:
> Easy access to vast amounts of downloadable data has meant that a
> number of "corpus linguists" neither know nor care about the niceties
> of corpus creation, and the whys and wherefores of selecting and
> marking up data. Ease of access has become the main criterion,
> potentially to the detriment of the discipline itself. Easy solutions
> do not necessarily answer the most pertinent questions.

I agree wholeheartedly with these points.  However, it is possible to
devote all due attention and care to the "niceties, whys and wherefores"
without strict adherence to the full details of TEI specifications. That
is, one can create a quite useful corpus with a relatively simple and
shallow markup structure, and with much of the information about the
corpus content provided in as separate documentation, tables, or
stand-off annotations (rather than as in-line markup attached to the
data).

I would differentiate between "ease of access" and "ease of use".  Yes,
easy access to downloadable data sets (e.g. pointing "wget -r ..." at
any number of web sites) can lead to some very messy collections that
won't answer any question very well (except "How quickly can you fill
your hard disk?"); and cleaning up this sort of mess to produce useful
language data is complicated, time-consuming work.

But when that complicated work is actually done, the end product is most
useful when it is easy to process, browse, search, summarize, etc.  In
this regard, I tend to prefer markup that supports and simplifies the
computational uses of the data, and doesn't impose a heavy burden of
parsing through complex headers or intrusive in-line tags, where much of
the detail provided by the markup will tend to be irrelevant to any
given task at hand.

I have seen and/or been party to both extremes -- heavy markup vs. no
markup.  Regardless of where one chooses to sit on that scale, creating
a corpus of good quality is still a lot of work.  But other things being
equal, a corpus with little markup can be just as useful as one with
lots, and will tend to be easier to use.

-----------
David Graff			Linguistic Data Consortium
graff at ldc.upenn.edu		3600 Market St., Suite 810
voice: (215) 898-0887		University of Pennsylvania
fax:   (215) 573-2175		Philadelphia, PA 19104
		http://www.ldc.upenn.edu