[Corpora-List] Quotable Statistics on Unstructured Data on the WWW
maxwell
maxwell at umiacs.umd.edu
Fri Dec 6 21:04:14 UTC 2013
On 2013-12-06 15:47, Otto Lassen wrote:
> If texts are structured or unstructured data depends on their origin.
I think a cross-cutting problem, and perhaps a more easily quantified
(but maybe still useful) one, is that of ambiguity. Structured data is
often designed to avoid ambiguity. (Structured data may provide an
explicit representation of ambiguities, but the explicit representation
should not in itself be ambiguous.)
I'm sure someone will come up with counter-examples, but relational
databases and XML documents are both designed to be unambiguously
parseable (given a database schema or an XML schema). So were
blueprints, if anyone remembers those. Natural language, otoh, is
inherently (and often exceedingly) ambiguous. So are Nekker cubes.
So it might be helpful (if possible) to re-phrase the question to ask
how much data is potentially ambiguous, and at what level
(syntactically, morphologically, lexically, semantically,
pragmatically). By "potentially" ambiguous, I mean in principle; a
particular instance of a natural language sentence might be
syntactically unambiguous, but natural language in general is
syntactically ambiguous. I suppose anything is _pragmatically_
ambiguous.
Mike Maxwell
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list