[Corpora-List] Anything resembling TPC benchmarks for corpora?

Anil Singh anil.phdcl at gmail.com
Sun Jul 8 13:13:35 UTC 2012


I am not an expert in this matter and I have not given enough thought to
the questions raised here, but they do interest me, so I suggest two
possible (not very technical and perhaps obvious) reasons for why what has
happened with DBMS has not happened with corpora:

1. DBMS (that is, relational DBMS) has a very simple problem to solve. In
fact the simplicity of the DBMS structure is its 'genius', if you like. But
when we say corpora, we can mean a lot of very different things of very
different complexities and properties. It may not be fair to expect the
same of, say, the CMS (Corpora Management Systems) community, as has been
achieved by the DBMS community. More analogous to corpora will be the XDMS
(XML Data  Management Systems) rather than the DBMS. Similarly XPath and
XSLT rather than SQL.

2. The researcher and developer community for relational databases is much
larger than for corpora, and the support for the former is also much more.

Another possible reason might be that as soon as you say 'corpora' and you
include annotated corpora among them, you bring in the whole battlefield of
linguistic theories. No wonder it is hard to achieve any degree of
consensus about most of the issues. And when you include phonological
information and speech (and may be even visual data), the problems are all
compounded.

Having said that, this is still a worthy goal and some people have been
working towards it. And (paraphrasing an old popular phrase in Indian
English), contemporary Gods permitting, there might be something at the end
of it.

- Anil Kumar Singh


On Sat, Jul 7, 2012 at 2:01 AM, Albretch Mueller <lbrtchx at gmail.com> wrote:

>  Thank you Adam
> ~
>  I read your paper on "getting to know your corpus (content based on
> the currency of keyword lists)" and I also found very interesting some
> of the referenced papers
> ~
>  and BTW, I am not claiming that everything pertaining to DBMSs is
> implemented and documented "by 'the' book", however there is a certain
> degree of common understanding and what German people call
> "Sachlichkeit" in that culture that we sorely lack. I can't quite get
> why
> ~
>  I was more like (very wishfully ;-)) thinking of something like that:
> ~
>
> http://en.wikipedia.org/wiki/Comparison_of_relational_database_management_systems
> ~
>  http://en.wikipedia.org/wiki/Database_testing
> ~
>  http://en.wikipedia.org/wiki/Structured_Query_Language
> ~
>  lbrtchx
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>



-- 
- Anil
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20120708/d61077c5/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list