[Corpora-List] Quotable Statistics on Unstructured Data on the WWW

Stefan Bordag bordag at exb.de
Fri Dec 6 17:10:09 UTC 2013


Hi there,

For me, the difference between structured and unstructured is whether it 
is possible to use some kind of a simple and precise query system which 
guarantees to retrieve a particular information, if it is there. Doing 
that on a database is easy, use SQL or any of the other database access 
systems.

Doing that on any information in text is not easy. Even using the best 
full text capable search engine you are never guaranteed to get the one 
information you were looking for, and only that information. For this 
matter it is irrelevant whether it is semi-structured or "fully" 
unstructured. The semi refers to the vague feeling that it might be 
easier to extract information from tables in text, but as a matter of 
fact, it is not (or not significantly), since people tend to invent all 
kinds of tables and information meshups. It is only easier in very 
specific domains where you can make valid assumptions about what kind of 
tabular information representations to expect.

So, I fully agree that there should be the distinction between 
structured and unstructured and probably should be shot next now. :)

Best regards,
Stefan

Am 06.12.2013 17:14, schrieb Eric Atwell:
> Sitting on the fence, I would say that text has IMPLICIT structure at
> many levels (morphological, phrase structure, dependency etc) but this
> is not (usually) explictly labelled or "structured" (past tense verb). 
> For example, see http://corpus.quran.com/treebank.jsp - an example
> 4-word verse from the Quran ("unstructured text") alongside
> stucture labelling of morphology, syntax, dependency, as well as
> audio recitation and word-by-word English translation.
> Linguists see this implicit structure in all language, whereas
> (some) computer/information scientists only recognise structure if 
> explicit delimiters or tags are included in the character data
> stream; hence the 4-word Quran verse is "unstructured" whereas the 
> Treebank annotated data is "structured".
>
> Eric Atwell,
>  Language research group, School of Computing (hence on the fence :-)
>  Leeds University
>
> PS A 2nd, unrelated, comment: even "plain text" Web-pages contain HTML 
> structure marking headers, paragraphs, links etc so there is virtually
> no "unstructured data" on the web
>
> PPS: congratulations to Kais Dukes, developer of corpus.quran.com
>  - who passed his PhD viva yesterday!
>
>
> On Fri, 6 Dec 2013, Adam Kilgarriff wrote:
>
>> there's phrase structure and dependency structure and morphological 
>> structure and text structure and rhetorical structure
>> and semantic structure
>>
>>
>> On 6 December 2013 12:12, Daniel Gerber 
>> <dgerber at informatik.uni-leipzig.de> wrote:
>>       Hallo Adam,
>>
>>       On 06.12.2013, at 12:45, Adam Kilgarriff 
>> <adam at lexmasterclass.com> wrote:
>>
>>       > I always squirm when I hear text referred to as unstructured 
>> data.   (Daniel - I see you do too, from the
>>       '(semi-)'.)    It feels like a teenager declaring everyone over 
>> 25 as old.
>>
>> As what do you see text then? Yes, I typically refer to text as being 
>> unstructured, tables and so on as semi
>> structured und databases as structured.
>> I'm sorry that you feel greatly offended by my understanding. But 
>> your reply does not answer my question nor does it
>> help me to understand a different point of view any better.
>>
>> > Adam
>> >
>> > (PS - I first came across it in the IBM-promoted UIMA, the U is 
>> unstructured, so the inventors of that acronym
>> should be shot. Not sure if the initiative is ongoing.)
>>
>> I think you should apologize to the people you want to be shot. I 
>> can't believe that someone (especially with a
>> scientific background as you have) articulates in such manner.
>>
>> Daniel
>>
>> >
>> >
>> >
>> > On 6 December 2013 08:48, Daniel Gerber 
>> <dgerber at informatik.uni-leipzig.de> wrote:
>> > Hi,
>> > I'm searching for any quotable statistics for the distribution of 
>> structured vs.  (semi-)unstructured data on the
>> web.
>> > So far I could only find some blog post's about Big Data statistics 
>> or presentations which claim a 15%-85%
>> distribution but forget to quote the sources for this claim.
>> >
>> > Any help would be greatly appreciated,
>> > Daniel
>> > _______________________________________________
>> > UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>> > Corpora mailing list
>> > Corpora at uib.no
>> > http://mailman.uib.no/listinfo/corpora
>> >
>> >
>> >
>> > --
>> > ========================================
>> > Adam Kilgarriff                  adam at lexmasterclass.com
>> > Director                                    Lexical Computing Ltd
>> > Visiting Research Fellow                 University of Leeds
>> > Corpora for all with the Sketch Engine
>> >                         DANTE: a lexical database for English
>> > ========================================
>>
>>
>>
>>
>> -- 
>> ========================================
>> Adam Kilgarriff                  adam at lexmasterclass.com
>> Director                                    Lexical Computing Ltd
>> Visiting Research Fellow                 University of Leeds    
>> Corpora for all with the Sketch Engine
>>                         DANTE: a lexical database for English         
>>         ========================================
>>
>>
>
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora


-- 
Dr. Stefan Bordag
Head of Research
ExB Research & Development GmbH
Seeburgstr. 100  |  04103 Leipzig |  Germany

Phone +49.341.30854851  |  Fax +49.89.550673.41
Mobile  +49.176.70857605  |  email: bordag at exb.de

HRB 184556, Registergericht München
Geschäftsführer: Nicola Pizzoni
UStd-ID Nr: DE-209346179

This email and any attachments are confidential, except where the email states it can be disclosed. If received in error, please do not disclose the contents to anyone, but notify the sender by return email and delete this email (and any attachments) from your system.
The sender of this email is active for various members of the ExB Group. This email may, therefore, be sent in the name of different ExB entities.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20131206/b0f7f3e4/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list