[Corpora-List] Anything resembling TPC benchmarks for corpora?

Thu Jul 12 09:11:19 UTC 2012

Adam Kilgarriff <adam at lexmasterclass.com> wrote:

> I really can't swallow the analogy with DBMS.  That's technology, (corpus)
> linguistics is science.  There, the task is getting your house in order and
> singing from the same hymnsheet.  Here, the big picture is that we are
> trying to find out how language works.

Wholeheartedly agreed. But this doesn't mean that some of the
technological problems involved in storing, annotating and querying
corpora cannot be solved in a theory-agnostic manner.

We may not agree on what exactly our markup will contain, but at least
we should be able to settle on a common markup scheme, and so avoid
problems like, say, someone's concordancer choking on someone else's
POS tags. XML is unfortunately not good enough, as it requires proper
nesting. And at present things are a mess, with a multitude of
standards making it very hard to explore the same corpus with a
variety of tools.

At a recent workshop, Martin Wynne mentioned the CLARIN project
(http://www.clarin.eu/external/index.php?page=about-clarin), which
seems a great step in the right way, but I'm not sure how complete
their set of standards is at present.

A.

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora