[Corpora-List] The standard size of splitting the dataset

Thu Jun 27 13:35:20 UTC 2013

There isn't really a single standard.  It depends a great deal on the size and other characteristics of the dataset.

Mary Elaine Califf, PhD
Interim Director/Associate Professor
School of Information Technology
Illinois State University
mecalif at ilstu.edu

This message is intended for the use of the individual or entity to which it is addressed and may contain information that is privileged, confidential and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient or the employee or agent responsible for delivering this message to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited.

If you have received this communication in error, please notify us immediately by email reply or by telephone at (309) - 438-8338 and immediately delete this message and any attachments.

From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf Of Jack Alan
Sent: Thursday, June 27, 2013 7:58 AM
To: corpora at uib.no
Subject: [Corpora-List] The standard size of splitting the dataset

Hi all,

Has anyone came across the standard size of splitting the dataset into (training, development and test) in supervised learning? I mean what is the typical percentage size for each subset especially for sequence labelling tasks, e.g. POS and NER?

I wonder if it is something like 60% training, 20% development and 20% test?

Many thanks
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20130627/cd9b19c5/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora