[Corpora-List] Anything resembling TPC benchmarks for corpora?

Albretch Mueller lbrtchx at gmail.com
Thu Jul 5 17:01:04 UTC 2012


 John et al,
~
>> I think it is healthy for us to have our own research projects/pet
>> theories, still we will also benefit greatly from having something
>> similar to Transaction Processing Council Benchmarks for corpora
>
> There is a common saying:
>
>     "Lies, damn lies, and benchmarks"
~
 which tends to be very true, but at least the DBMS culture phased out
of the state we still are, by designing a common interface (SQL) to
access data. At least they had some ideas about how DBMSs should be
queried and what to expect from them
~
 When we talk about corpora all we say is something like "this is a
4,000,000 word corpus" we don't talk about (to just name a few
issues):
~
 * how detailed is the morpho-syntactic (encoding or annotation)
~
 * if we can also query it phonologically (say, give all syllables of
words ending in voiced d (excluding past tense of verbs and verbs used
as adjective))
~
 * the depth level of n-grams
~
 * if the texts the corpus is based on are represented exhaustively
(where "stop words" deleted?)
~
 * what is the CQL (Corpus Query Language) that its internal engine
"understands"
~
 * does its CQL accept model or plane switching? (Say you first look
for all instances of a certain verb, but then want to access some
metadata like a correlation of the usage of certain adverbs and their
parse trees)
~
 * can you query their named entities as such? which properties are
used to define named entities?
~
 * can its engine give full statistical correlations as query results?
~
 * are parse trees included for every sentence?
~
 * is it publicly accessible and/or editable?
~
 ...
~
 I could imagine many of you would include other features you deem important
~
>> There is no magic formula for evaluating computer performance.
>> As with so many things, the answer depends upon your point of view.
~
 I think that (partial) truth is being stretched quite a bit. We can
still gauge response times and memory usage (including its dynamic)
quite acurately
~
> For example, look at the Wikipedia article cited in the previous note:
>
> http://en.wikipedia.org/wiki/Transaction_Processing_Performance_Council
>
> It cites 6 earlier benchmarks that became obsolete during the period
> from 1995 to 2005.  Many of those were artificial tests such as
> "transactions per second"
~
 well yes, but still; benchmarking is extensively used in engineering
and sciences
~
> Questions:  What kinds of applications use corpora?  Could any of them
> be considered "typical"?  Could they be specified in a way that would
> help evaluate any corpora software that could be considered "typical"?
> Would performance on such a specification be meaningful for anybody
> who might be searching for software (or algorithms) for some purpose?
~
 Those are some other good questions but some of them are based on the
adjective "typical", which is (in general using adjs in specs) almost
always something very wrong to do
~
> If the focus is placed on applications, a benchmark could stimulate
> innovation by encouraging developers to search for new designs.
> But a focus on the features of a particular technique could lead
> to stagnation if the developers just learn how to "pass the test".
~
 exactly!
~
 lbrtchx

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list