[Corpora-List] Quotable Statistics on Unstructured Data on the WWW

Leon Derczynski leon at dcs.shef.ac.uk
Fri Dec 6 12:31:52 UTC 2013


Dear Daniel,

There is an active "Data Extraction" community who among other things often
work on extracting data from the web that's not encoded linguistically but
in other forms (e.g. in web tables, visually in document layout, in
semantic markup) - they may have evidence that helps answer your query. I
know of two workshops on this work that could provide helpful starting
points:

http://diadem.cs.ox.ac.uk/oxford13/
http://diadem.cs.ox.ac.uk/deos14/

Of course, one perhaps needs to define what data is and how you measure "a
data" before you can talk about the percentage of data in a given format -
but that's another issue!

All the best,


Leon


On 6 December 2013 09:48, Daniel Gerber
<dgerber at informatik.uni-leipzig.de>wrote:

> Hi,
> I’m searching for any quotable statistics for the distribution of
> structured vs.  (semi-)unstructured data on the web.
> So far I could only find some blog post’s about Big Data statistics or
> presentations which claim a 15%-85% distribution but forget to quote the
> sources for this claim.
>
> Any help would be greatly appreciated,
> Daniel
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>



-- 
Leon R A Derczynski
Research Associate, NLP Group

Department of Computer Science
University of Sheffield, UK

http://www.dcs.shef.ac.uk/~leon/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20131206/a77ee7ba/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list