[Corpora-List] txt to XML

Chris Brew cbrew at ling.ohio-state.edu
Mon Oct 27 13:45:48 UTC 2003


On Mon, Oct 27, 2003 at 12:14:04PM +0100, Belen D?ez wrote:
>
>      Dear list members,
>
>      I have been following the discussion about how to convert XML into
>      TXT, but the problem I´m facing is just the opposite.
>
>      I have a scanned corpus and I would like to have it in XML format
>      to be able to work with it in this way and follow the TEI
>      guidelines.
>
>      Does anybody have the same problem? Any hints?
>
>      Thanks for any suggestions.
>
>      Belén Díez
>
>    Belen Diez
>    Department of English Philology
>    University of Jaen
>    Spain
>      _________________________________________________________________
>

Can you give an example of input and desired output? There are many
different ways to take a document and make it into valid XML. One of
the simplest is to declare a single XML element type (e.g. "docmnt"),
declare it to have arbitrary text content, and you are almost done.
Not quite done, because you need to ensure that the small number of
characters that cannot appear in an XML document are appropriately
escaped. This applies to '&' and '<', which need to be rewritten as
& and < respectively. This is a literal-minded solution to
your problem, but presumably not the one you want, because there
must be a task that you have in mind which tells you that you need
more elaborate markup than just one element. For example, you might
want to mark paragraph boundaries, sentence boundaries and/or
parts of speech. But reflecting on why you would not be happy with
the naive literal-minded solution may help you clarify your requirements.

There are also many different ways of making a document into one that
meets TEI requirements, because the TEI gives many options. I strongly
recommend the TEI pizza chef (http://www.tei-c.org/pizza.html) as a
tool for thinking through the process of designing an appropriate
encoding for a corpus.

Best

Chris



More information about the Corpora mailing list