[Corpora-List] "Multi-encoded" corpora

Albretch Mueller lbrtchx at gmail.com
Mon Oct 6 01:49:55 UTC 2008


~
 I was browsing around the BAWE corpus info previously posted here and
when I noticed all texts are in PDF format (!), it made me wonder
about how do you treat multi-encoded text, say scientific texts
containing mathematical formulas, programming books containing actual
code, ...
~
 I think communicative universes are mostly, if not always,
multi-encoded (I just don't know how to call that, but "multi-lingual"
is it not) and all these code-planes participate while communicating;
when you go eat some place; you:
~
 1) read a menu
 2) of food made after some recipe
 3) talk to the wait[er|ress]
 4) pay
. . .
~
 Or, which is what I have in mind, say you want to encode Euclid's
Elements, including all definitions, postulates (axioms), propositions
(theorems and constructions), mathematical proofs of the propositions,
charts, apocrypha sections, ... and then do the same with the articles
that tried to prove the 5th postulate, the still ongoing
philosophical/logical inquiries, ... even including Schopenhauer's
beef with the obsession we Mathematicians had for more than 20
centuries with this issue ;-)
~
 http://en.wikipedia.org/wiki/Schopenhauer%27s_criticism_of_the_proofs_of_the_Parallel_Postulate
~
 When I say multi-encoded here I mean code in a general way, for
example there is a difference and interplay between what is written as
law and what is talked about in court. These, to me, are two different
"codes" ... even though the same NL is being used
~
 By the way I am looking at these issues more from a semiotic point of
view than a linguistic one
~
 ... and going back to the BAWE corpus, I know there are ways to have
pdf format (essentially a picture) as text in this preprocessing
format they use (was it lex?), what I don't know is how good is this
textual preprocessing format at describing drawings
~
 By the way I know you can use pdf2txt, but you will be loosing all
that is not plain text
~
 thanks
 lbrtchx

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list