[Corpora-List] Quotable Statistics on Unstructured Data on the WWW
Rich Cooper
rich at englishlogickernel.com
Fri Dec 6 17:48:29 UTC 2013
+1 (another shot to be fired (:-|)
Sincerely,
Rich Cooper
EnglishLogicKernel.com
Rich AT EnglishLogicKernel DOT com
9 4 9 \ 5 2 5 - 5 7 1 2
_____
From: corpora-bounces at uib.no
[mailto:corpora-bounces at uib.no] On Behalf Of
Stefan Bordag
Sent: Friday, December 06, 2013 9:10 AM
To: corpora at uib.no
Subject: Re: [Corpora-List] Quotable Statistics on
Unstructured Data on the WWW
Hi there,
For me, the difference between structured and
unstructured is whether it is possible to use some
kind of a simple and precise query system which
guarantees to retrieve a particular information,
if it is there. Doing that on a database is easy,
use SQL or any of the other database access
systems.
Doing that on any information in text is not easy.
Even using the best full text capable search
engine you are never guaranteed to get the one
information you were looking for, and only that
information. For this matter it is irrelevant
whether it is semi-structured or "fully"
unstructured. The semi refers to the vague feeling
that it might be easier to extract information
from tables in text, but as a matter of fact, it
is not (or not significantly), since people tend
to invent all kinds of tables and information
meshups. It is only easier in very specific
domains where you can make valid assumptions about
what kind of tabular information representations
to expect.
So, I fully agree that there should be the
distinction between structured and unstructured
and probably should be shot next now. :)
Best regards,
Stefan
Am 06.12.2013 17:14, schrieb Eric Atwell:
Sitting on the fence, I would say that text has
IMPLICIT structure at
many levels (morphological, phrase structure,
dependency etc) but this
is not (usually) explictly labelled or
"structured" (past tense verb). For example, see
http://corpus.quran.com/treebank.jsp - an example
4-word verse from the Quran ("unstructured text")
alongside
stucture labelling of morphology, syntax,
dependency, as well as
audio recitation and word-by-word English
translation.
Linguists see this implicit structure in all
language, whereas
(some) computer/information scientists only
recognise structure if explicit delimiters or tags
are included in the character data
stream; hence the 4-word Quran verse is
"unstructured" whereas the Treebank annotated data
is "structured".
Eric Atwell,
Language research group, School of Computing
(hence on the fence :-)
Leeds University
PS A 2nd, unrelated, comment: even "plain text"
Web-pages contain HTML structure marking headers,
paragraphs, links etc so there is virtually
no "unstructured data" on the web
PPS: congratulations to Kais Dukes, developer of
corpus.quran.com
- who passed his PhD viva yesterday!
On Fri, 6 Dec 2013, Adam Kilgarriff wrote:
there's phrase structure and dependency structure
and morphological structure and text structure and
rhetorical structure
and semantic structure
On 6 December 2013 12:12, Daniel Gerber
<mailto:dgerber at informatik.uni-leipzig.de>
<dgerber at informatik.uni-leipzig.de> wrote:
Hallo Adam,
On 06.12.2013, at 12:45, Adam Kilgarriff
<mailto:adam at lexmasterclass.com>
<adam at lexmasterclass.com> wrote:
> I always squirm when I hear text referred
to as unstructured data. (Daniel - I see you do
too, from the
'(semi-)'.) It feels like a teenager
declaring everyone over 25 as old.
As what do you see text then? Yes, I typically
refer to text as being unstructured, tables and so
on as semi
structured und databases as structured.
Im sorry that you feel greatly offended by my
understanding. But your reply does not answer my
question nor does it
help me to understand a different point of view
any better.
> Adam
>
> (PS - I first came across it in the IBM-promoted
UIMA, the U is unstructured, so the inventors of
that acronym
should be shot. Not sure if the initiative is
ongoing.)
I think you should apologize to the people you
want to be shot. I cant believe that someone
(especially with a
scientific background as you have) articulates in
such manner.
Daniel
>
>
>
> On 6 December 2013 08:48, Daniel Gerber
<mailto:dgerber at informatik.uni-leipzig.de>
<dgerber at informatik.uni-leipzig.de> wrote:
> Hi,
> Im searching for any quotable statistics for
the distribution of structured vs.
(semi-)unstructured data on the
web.
> So far I could only find some blog posts about
Big Data statistics or presentations which claim a
15%-85%
distribution but forget to quote the sources for
this claim.
>
> Any help would be greatly appreciated,
> Daniel
> _______________________________________________
> UNSUBSCRIBE from this page:
http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>
>
> --
> ========================================
> Adam Kilgarriff
adam at lexmasterclass.com
> Director
Lexical Computing Ltd
> Visiting Research Fellow
University of Leeds
> Corpora for all with the Sketch Engine
> DANTE: a lexical
database for English
> ========================================
--
========================================
Adam Kilgarriff
adam at lexmasterclass.com
Director
Lexical Computing Ltd
Visiting Research Fellow
University of Leeds Corpora for all with the
Sketch Engine
DANTE: a lexical database
for English
========================================
_______________________________________________
UNSUBSCRIBE from this page:
http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
--
Dr. Stefan Bordag
Head of Research
ExB Research & Development GmbH
Seeburgstr. 100 | 04103 Leipzig | Germany
Phone +49.341.30854851 | Fax +49.89.550673.41
Mobile +49.176.70857605 | email: bordag at exb.de
HRB 184556, Registergericht München
Geschäftsführer: Nicola Pizzoni
UStd-ID Nr: DE-209346179
This email and any attachments are confidential,
except where the email states it can be disclosed.
If received in error, please do not disclose the
contents to anyone, but notify the sender by
return email and delete this email (and any
attachments) from your system.
The sender of this email is active for various
members of the ExB Group. This email may,
therefore, be sent in the name of different ExB
entities.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20131206/47821c9b/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list