[Corpora-List] Is the TEI a waste of time?

Fri Jun 27 12:43:50 UTC 2003

A quick reply as I must rush off and bore some students.

My point was that one can be in conformity with TEI recommendations without going to great depths, it is very much a questions of the needs of the user. For instance, I was not in favour of the Corpus Encoding Standard as it could not go as deeply as the TEI. Also, I did not see the need for another standard on the basis that if the CES was in conformity with the TEI, why not just use the TEI, especially as the TEI could go much deeper if the compiler saw the need. For an awful lot of my work on research papers I use a simple stabdard header and then only markup as deeply as paragraphs. I do use divs to name the parts, which means that rather than looking at a mass of data WordSmith will let me cut up my corpus into named chunks. For this I bless the TEI.

My main bugbear is that part of the NLP community seems to have fallen for the how-much-and-how-quickly-can-I-download it view of a corpus which is very much a never-mind-the-quality-feel- the-width syndrome (BBC cultural reference that maybe shows my age). If this suits their needs all well and good, and in such a case good markup may not be relevent. However, many students in corpus linguistics, as well as supervisors who have only recently leapt on the bandwagon, have also fallen for this syndrome, which is why I see so many "corpora" dealing with, for example, NLP, because the texts are easily available on the web. I still feel that there is a need to tackle "difficult" areas, where a goold old scanner may well be necessary. In this case, simple TEI gives such a wonderful tool for making sense of a text rather than simply thinking in numbers of words.

Better dash off

best

Geoffrey-who-will-be-in-a-meeting-this-weekend-so-might-pick-up-the-thread-later.

----- Original Message ----- 
From: "David Graff" <graff at unagi.cis.upenn.edu>
To: <corpora at uib.no>
Sent: Friday, June 27, 2003 12:44 PM
Subject: Re: [Corpora-List] Is the TEI a waste of time? 

> 
> geoffrey.williams at wanadoo.fr said:
> > Easy access to vast amounts of downloadable data has meant that a
> > number of "corpus linguists" neither know nor care about the niceties
> > of corpus creation, and the whys and wherefores of selecting and
> > marking up data. Ease of access has become the main criterion,
> > potentially to the detriment of the discipline itself. Easy solutions
> > do not necessarily answer the most pertinent questions.
> 
> I agree wholeheartedly with these points.  However, it is possible to
> devote all due attention and care to the "niceties, whys and wherefores"
> without strict adherence to the full details of TEI specifications. That
> is, one can create a quite useful corpus with a relatively simple and
> shallow markup structure, and with much of the information about the
> corpus content provided in as separate documentation, tables, or
> stand-off annotations (rather than as in-line markup attached to the
> data).
> 
> I would differentiate between "ease of access" and "ease of use".  Yes,
> easy access to downloadable data sets (e.g. pointing "wget -r ..." at
> any number of web sites) can lead to some very messy collections that
> won't answer any question very well (except "How quickly can you fill
> your hard disk?"); and cleaning up this sort of mess to produce useful
> language data is complicated, time-consuming work.
> 
> But when that complicated work is actually done, the end product is most
> useful when it is easy to process, browse, search, summarize, etc.  In
> this regard, I tend to prefer markup that supports and simplifies the
> computational uses of the data, and doesn't impose a heavy burden of
> parsing through complex headers or intrusive in-line tags, where much of
> the detail provided by the markup will tend to be irrelevant to any
> given task at hand.
> 
> I have seen and/or been party to both extremes -- heavy markup vs. no
> markup.  Regardless of where one chooses to sit on that scale, creating
> a corpus of good quality is still a lot of work.  But other things being
> equal, a corpus with little markup can be just as useful as one with
> lots, and will tend to be easier to use.
> 
> -----------
> David Graff Linguistic Data Consortium
> graff at ldc.upenn.edu 3600 Market St., Suite 810
> voice: (215) 898-0887 University of Pennsylvania
> fax:   (215) 573-2175 Philadelphia, PA 19104
> http://www.ldc.upenn.edu
> 
> 
> 
>