[Corpora-List] Quotable Statistics on Unstructured Data on the WWW
Seth Grimes
grimes at altaplana.com
Fri Dec 6 12:29:41 UTC 2013
On Fri, 6 Dec 2013, Daniel Gerber wrote:
> On 06.12.2013, at 12:45, Adam Kilgarriff <adam at lexmasterclass.com> wrote:
>
>> I always squirm when I hear text referred to as unstructured data.
>> (Daniel - I see you do too, from the '(semi-)'.) It feels like a
>> teenager declaring everyone over 25 as old.
>
> As what do you see text then? Yes, I typically refer to text as being
> unstructured, tables and so on as semi structured und databases as
> structured.
Text is one form that "content" or "media" takes. Those words are
overloaded, however, so neither will overtake the very
imprecise term "unstructured."
I took on this mislabeling of text back in 2005 in an article titled
Structure, Models and Meaning:
http://www.informationweek.com/software/information-management/structure-models-and-meaning/d/d-id/1030187?
"Most unstructured data is merely unmodeled. Take text, whether written or
transcribed from speech. Within the unstructured category, text is of
greatest interest to most enterprises. If text didn't have structure,
however, documents like this column would be opaque. Text has linguistic
structure, both syntactic (grammatical) and semantic (meaning), and texts
almost always appear within an envelope of descriptive metainformation
such as date, publication and author's name that are used to index
documents for storage and retrieval."
As for the first question in the thread --
> On 6 December 2013 08:48, Daniel Gerber <dgerber at informatik.uni-leipzig.de> wrote:
> I'm searching for any quotable statistics for the distribution of
> structured vs. (semi-)unstructured data on the web. So far I could only
> find some blog post's about Big Data statistics or presentations which
> claim a 15%-85% distribution but forget to quote the sources for this
> claim.
-- I took that on too, back in 2008. See my article, Unstructured Data and
the 80 Percent Rule,
http://breakthroughanalysis.com/2008/08/01/unstructured-data-and-the-80-percent-rule/
.
Seth
--
Seth Grimes grimes at altaplana.com +1 301-270-0795 @sethgrimes
Alta Plana Corp, analytics strategy consulting, http://altaplana.com
http://SentimentAnalysisSymposium.com organizer, March 5-6, 2014, NY
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list