<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<div class="moz-cite-prefix">Hi there,<br>
<br>
For me, the difference between structured and unstructured is
whether it is possible to use some kind of a simple and precise
query system which guarantees to retrieve a particular
information, if it is there. Doing that on a database is easy, use
SQL or any of the other database access systems.<br>
<br>
Doing that on any information in text is not easy. Even using the
best full text capable search engine you are never guaranteed to
get the one information you were looking for, and only that
information. For this matter it is irrelevant whether it is
semi-structured or "fully" unstructured. The semi refers to the
vague feeling that it might be easier to extract information from
tables in text, but as a matter of fact, it is not (or not
significantly), since people tend to invent all kinds of tables
and information meshups. It is only easier in very specific
domains where you can make valid assumptions about what kind of
tabular information representations to expect.<br>
<br>
So, I fully agree that there should be the distinction between
structured and unstructured and probably should be shot next now.
:)<br>
<br>
Best regards,<br>
Stefan<br>
<br>
Am 06.12.2013 17:14, schrieb Eric Atwell:<br>
</div>
<blockquote
cite="mid:alpine.LRH.2.02.1312061552560.26367@cslin-gps.csunix.comp.leeds.ac.uk"
type="cite">Sitting on the fence, I would say that text has
IMPLICIT structure at
<br>
many levels (morphological, phrase structure, dependency etc) but
this
<br>
is not (usually) explictly labelled or "structured" (past tense
verb). For example, see <a class="moz-txt-link-freetext" href="http://corpus.quran.com/treebank.jsp">http://corpus.quran.com/treebank.jsp</a> - an
example
<br>
4-word verse from the Quran ("unstructured text") alongside
<br>
stucture labelling of morphology, syntax, dependency, as well as
<br>
audio recitation and word-by-word English translation.
<br>
Linguists see this implicit structure in all language, whereas
<br>
(some) computer/information scientists only recognise structure if
explicit delimiters or tags are included in the character data
<br>
stream; hence the 4-word Quran verse is "unstructured" whereas the
Treebank annotated data is "structured".
<br>
<br>
Eric Atwell,
<br>
Language research group, School of Computing (hence on the fence
:-)
<br>
Leeds University
<br>
<br>
PS A 2nd, unrelated, comment: even "plain text" Web-pages contain
HTML structure marking headers, paragraphs, links etc so there is
virtually
<br>
no "unstructured data" on the web
<br>
<br>
PPS: congratulations to Kais Dukes, developer of corpus.quran.com
<br>
- who passed his PhD viva yesterday!
<br>
<br>
<br>
On Fri, 6 Dec 2013, Adam Kilgarriff wrote:
<br>
<br>
<blockquote type="cite">there's phrase structure and dependency
structure and morphological structure and text structure and
rhetorical structure
<br>
and semantic structure
<br>
<br>
<br>
On 6 December 2013 12:12, Daniel Gerber
<a class="moz-txt-link-rfc2396E" href="mailto:dgerber@informatik.uni-leipzig.de"><dgerber@informatik.uni-leipzig.de></a> wrote:
<br>
Hallo Adam,
<br>
<br>
On 06.12.2013, at 12:45, Adam Kilgarriff
<a class="moz-txt-link-rfc2396E" href="mailto:adam@lexmasterclass.com"><adam@lexmasterclass.com></a> wrote:
<br>
<br>
> I always squirm when I hear text referred to as
unstructured data. (Daniel - I see you do too, from the
<br>
'(semi-)'.) It feels like a teenager declaring everyone
over 25 as old.
<br>
<br>
As what do you see text then? Yes, I typically refer to text as
being unstructured, tables and so on as semi
<br>
structured und databases as structured.
<br>
I’m sorry that you feel greatly offended by my understanding.
But your reply does not answer my question nor does it
<br>
help me to understand a different point of view any better.
<br>
<br>
> Adam
<br>
>
<br>
> (PS - I first came across it in the IBM-promoted UIMA, the
U is unstructured, so the inventors of that acronym
<br>
should be shot. Not sure if the initiative is ongoing.)
<br>
<br>
I think you should apologize to the people you want to be shot.
I can’t believe that someone (especially with a
<br>
scientific background as you have) articulates in such manner.
<br>
<br>
Daniel
<br>
<br>
>
<br>
>
<br>
>
<br>
> On 6 December 2013 08:48, Daniel Gerber
<a class="moz-txt-link-rfc2396E" href="mailto:dgerber@informatik.uni-leipzig.de"><dgerber@informatik.uni-leipzig.de></a> wrote:
<br>
> Hi,
<br>
> I’m searching for any quotable statistics for the
distribution of structured vs. (semi-)unstructured data on the
<br>
web.
<br>
> So far I could only find some blog post’s about Big Data
statistics or presentations which claim a 15%-85%
<br>
distribution but forget to quote the sources for this claim.
<br>
>
<br>
> Any help would be greatly appreciated,
<br>
> Daniel
<br>
> _______________________________________________
<br>
> UNSUBSCRIBE from this page:
<a class="moz-txt-link-freetext" href="http://mailman.uib.no/options/corpora">http://mailman.uib.no/options/corpora</a>
<br>
> Corpora mailing list
<br>
> <a class="moz-txt-link-abbreviated" href="mailto:Corpora@uib.no">Corpora@uib.no</a>
<br>
> <a class="moz-txt-link-freetext" href="http://mailman.uib.no/listinfo/corpora">http://mailman.uib.no/listinfo/corpora</a>
<br>
>
<br>
>
<br>
>
<br>
> --
<br>
> ========================================
<br>
> Adam Kilgarriff <a class="moz-txt-link-abbreviated" href="mailto:adam@lexmasterclass.com">adam@lexmasterclass.com</a>
<br>
> Director Lexical
Computing Ltd
<br>
> Visiting Research Fellow University of
Leeds
<br>
> Corpora for all with the Sketch Engine
<br>
> DANTE: a lexical database for
English
<br>
> ========================================
<br>
<br>
<br>
<br>
<br>
--
<br>
========================================
<br>
Adam Kilgarriff <a class="moz-txt-link-abbreviated" href="mailto:adam@lexmasterclass.com">adam@lexmasterclass.com</a>
<br>
Director Lexical Computing
Ltd
<br>
Visiting Research Fellow University of Leeds
Corpora for all with the Sketch Engine
<br>
DANTE: a lexical database for English
========================================
<br>
<br>
<br>
</blockquote>
<br>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<br>
<pre wrap="">_______________________________________________
UNSUBSCRIBE from this page: <a class="moz-txt-link-freetext" href="http://mailman.uib.no/options/corpora">http://mailman.uib.no/options/corpora</a>
Corpora mailing list
<a class="moz-txt-link-abbreviated" href="mailto:Corpora@uib.no">Corpora@uib.no</a>
<a class="moz-txt-link-freetext" href="http://mailman.uib.no/listinfo/corpora">http://mailman.uib.no/listinfo/corpora</a>
</pre>
</blockquote>
<br>
<br>
<pre class="moz-signature" cols="72">--
Dr. Stefan Bordag
Head of Research
ExB Research & Development GmbH
Seeburgstr. 100 | 04103 Leipzig | Germany
Phone +49.341.30854851 | Fax +49.89.550673.41
Mobile +49.176.70857605 | email: <a class="moz-txt-link-abbreviated" href="mailto:bordag@exb.de">bordag@exb.de</a>
HRB 184556, Registergericht München
Geschäftsführer: Nicola Pizzoni
UStd-ID Nr: DE-209346179
This email and any attachments are confidential, except where the email states it can be disclosed. If received in error, please do not disclose the contents to anyone, but notify the sender by return email and delete this email (and any attachments) from your system.
The sender of this email is active for various members of the ExB Group. This email may, therefore, be sent in the name of different ExB entities.
</pre>
</body>
</html>