[Corpora-List] Corpus Development

Oliver Mason O.Mason at bham.ac.uk
Sun Apr 20 09:18:09 UTC 2008


> By fully functional, I mean something that can be rightly called a corpus.

That probably opens a can of worms, but one definition of a corpus
would be authentic data collected for answering a specific research
question.  Most corpora are general enough to answer many questions,
but 'fully functional' only makes sense in relation to a question.  If
you want to look at spoken Pashto, then your corpus of written data
would be useless.  And I don't think you can create a corpus to answer
all conceivable questions.

For example, the Bank of English was collected for the purpose of
creating a contemporary learners' dictionary.  Hence it does not
contain historical data, but a variety of genres/text types and data
from various regions.  As it happens, it can be (and is) also used for
looking at other aspects of English apart from just lexis.

I'm not sure if there was a specific purpose for creating the BNC (Lou
would know I guess), but it too is suitable for many different
research questions.  FLOB and Frown were mainly collected for
investigating language change, but are also more versatile.

As for software, a corpus is just data.  If it is stored in a
particular format, many programs can be used to process it, which is
desirable, as you never know what the next person will want to use it
for.

Oliver

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list