[Corpora-List] Quotable Statistics on Unstructured Data on the WWW

Fri Dec 6 21:04:14 UTC 2013

On 2013-12-06 15:47, Otto Lassen wrote:
> If texts are structured or unstructured data depends on their origin.

I think a cross-cutting problem, and perhaps a more easily quantified 
(but maybe still useful) one, is that of ambiguity.  Structured data is 
often designed to avoid ambiguity.  (Structured data may provide an 
explicit representation of ambiguities, but the explicit representation 
should not in itself be ambiguous.)

I'm sure someone will come up with counter-examples, but relational 
databases and XML documents are both designed to be unambiguously 
parseable (given a database schema or an XML schema).  So were 
blueprints, if anyone remembers those.  Natural language, otoh, is 
inherently (and often exceedingly) ambiguous.  So are Nekker cubes.

So it might be helpful (if possible) to re-phrase the question to ask 
how much data is potentially ambiguous, and at what level 
(syntactically, morphologically, lexically, semantically, 
pragmatically).  By "potentially" ambiguous, I mean in principle; a 
particular instance of a natural language sentence might be 
syntactically unambiguous, but natural language in general is 
syntactically ambiguous.  I suppose anything is _pragmatically_ 
ambiguous.

    Mike Maxwell

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora