[Corpora-List] Quotable Statistics on Unstructured Data on the WWW

Seth Grimes grimes at altaplana.com
Fri Dec 6 12:29:41 UTC 2013


On Fri, 6 Dec 2013, Daniel Gerber wrote:

> On 06.12.2013, at 12:45, Adam Kilgarriff <adam at lexmasterclass.com> wrote:
>
>> I always squirm when I hear text referred to as unstructured data. 
>> (Daniel - I see you do too, from the '(semi-)'.)  It feels like a 
>> teenager declaring everyone over 25 as old.
>
> As what do you see text then? Yes, I typically refer to text as being 
> unstructured, tables and so on as semi structured und databases as 
> structured.

Text is one form that "content" or "media" takes. Those words are 
overloaded, however, so neither will overtake the very 
imprecise term "unstructured."

I took on this mislabeling of text back in 2005 in an article titled 
Structure, Models and Meaning: 
http://www.informationweek.com/software/information-management/structure-models-and-meaning/d/d-id/1030187?

"Most unstructured data is merely unmodeled. Take text, whether written or 
transcribed from speech. Within the unstructured category, text is of 
greatest interest to most enterprises. If text didn't have structure, 
however, documents like this column would be opaque. Text has linguistic 
structure, both syntactic (grammatical) and semantic (meaning), and texts 
almost always appear within an envelope of descriptive metainformation 
such as date, publication and author's name that are used to index 
documents for storage and retrieval."

As for the first question in the thread --

> On 6 December 2013 08:48, Daniel Gerber <dgerber at informatik.uni-leipzig.de> wrote:

> I'm searching for any quotable statistics for the distribution of 
> structured vs.  (semi-)unstructured data on the web. So far I could only 
> find some blog post's about Big Data statistics or presentations which 
> claim a 15%-85% distribution but forget to quote the sources for this 
> claim.

-- I took that on too, back in 2008. See my article, Unstructured Data and 
the 80 Percent Rule, 
http://breakthroughanalysis.com/2008/08/01/unstructured-data-and-the-80-percent-rule/ 
.

 						Seth


-- 
Seth Grimes    grimes at altaplana.com   +1 301-270-0795    @sethgrimes
Alta Plana Corp, analytics strategy consulting, http://altaplana.com
http://SentimentAnalysisSymposium.com organizer, March 5-6, 2014, NY

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list