[Corpora-List] Quotable Statistics on Unstructured Data on the WWW

Eric Atwell E.S.Atwell at leeds.ac.uk
Fri Dec 6 16:14:00 UTC 2013


Sitting on the fence, I would say that text has IMPLICIT structure at
many levels (morphological, phrase structure, dependency etc) but this
is not (usually) explictly labelled or "structured" (past tense verb). 
For example, see http://corpus.quran.com/treebank.jsp - an example
4-word verse from the Quran ("unstructured text") alongside
stucture labelling of morphology, syntax, dependency, as well as
audio recitation and word-by-word English translation.
Linguists see this implicit structure in all language, whereas
(some) computer/information scientists only recognise structure 
if explicit delimiters or tags are included in the character data
stream; hence the 4-word Quran verse is "unstructured" whereas the 
Treebank annotated data is "structured".

Eric Atwell,
  Language research group, School of Computing (hence on the fence :-)
  Leeds University

PS A 2nd, unrelated, comment: even "plain text" Web-pages contain HTML 
structure marking headers, paragraphs, links etc so there is virtually
no "unstructured data" on the web

PPS: congratulations to Kais Dukes, developer of corpus.quran.com
  - who passed his PhD viva yesterday!


On Fri, 6 Dec 2013, Adam Kilgarriff wrote:

> there's phrase structure and dependency structure and morphological structure and text structure and rhetorical structure
> and semantic structure
> 
> 
> On 6 December 2013 12:12, Daniel Gerber <dgerber at informatik.uni-leipzig.de> wrote:
>       Hallo Adam,
>
>       On 06.12.2013, at 12:45, Adam Kilgarriff <adam at lexmasterclass.com> wrote:
>
>       > I always squirm when I hear text referred to as unstructured data.   (Daniel - I see you do too, from the
>       '(semi-)'.)    It feels like a teenager declaring everyone over 25 as old.
> 
> As what do you see text then? Yes, I typically refer to text as being unstructured, tables and so on as semi
> structured und databases as structured.
> I’m sorry that you feel greatly offended by my understanding. But your reply does not answer my question nor does it
> help me to understand a different point of view any better.
> 
> > Adam
> >
> > (PS - I first came across it in the IBM-promoted UIMA, the U is unstructured, so the inventors of that acronym
> should be shot. Not sure if the initiative is ongoing.)
> 
> I think you should apologize to the people you want to be shot. I can’t believe that someone (especially with a
> scientific background as you have) articulates in such manner.
> 
> Daniel
> 
> >
> >
> >
> > On 6 December 2013 08:48, Daniel Gerber <dgerber at informatik.uni-leipzig.de> wrote:
> > Hi,
> > I’m searching for any quotable statistics for the distribution of structured vs.  (semi-)unstructured data on the
> web.
> > So far I could only find some blog post’s about Big Data statistics or presentations which claim a 15%-85%
> distribution but forget to quote the sources for this claim.
> >
> > Any help would be greatly appreciated,
> > Daniel
> > _______________________________________________
> > UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> > Corpora mailing list
> > Corpora at uib.no
> > http://mailman.uib.no/listinfo/corpora
> >
> >
> >
> > --
> > ========================================
> > Adam Kilgarriff                  adam at lexmasterclass.com
> > Director                                    Lexical Computing Ltd
> > Visiting Research Fellow                 University of Leeds
> > Corpora for all with the Sketch Engine
> >                         DANTE: a lexical database for English
> > ========================================
> 
> 
> 
> 
> --
> ========================================
> Adam Kilgarriff                  adam at lexmasterclass.com                                             
> Director                                    Lexical Computing Ltd                
> Visiting Research Fellow                 University of Leeds      Corpora for all with the Sketch Engine                 
>                         DANTE: a lexical database for English                   ========================================
> 
>

-- 
Eric Atwell, Associate Professor, Language research group,
  I-AIBS Institute for Artificial Intelligence and Biological Systems
  School of Computing, Faculty of Engineering, UNIVERSITY OF LEEDS
  Leeds LS2 9JT, England.        TEL: 0113-3435430  FAX: 0113-3435468
  WWW: http://www.comp.leeds.ac.uk/eric
       http://www.comp.leeds.ac.uk/nlp
       http://www.comp.leeds.ac.uk/arabic
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list