<html>
<body>
At Thursday 19/06/2003 10:26(), Sampo Nevalainen wrote:<br>
<blockquote type=cite class=cite cite>Hi,<br><br>
<blockquote type=cite class=cite cite>then we will face another problem
of comparing approaches and techniques, if each of us use different
corpora (without any possibility to share it with others because of the
legal aspects) then no comparison will be possible.</blockquote><br>
My comment is clearly out of topic, but I could not resist... This is one
thing I have not fully understood ever since I was irrevocably taken with
CL. Many text books on CL give an idea that a corpus should have a finite
size and be "a standard reference" (as McEnery and Wilson put
it in "Corpus Linguistics" 1996). In my humble opinion, this is
rather unnatural, as, after all, we are studying an open, ever-growing,
dynamic, lively organism (unless we are interested in "dead"
languages). From this viewpoint, if we are going to generalize anything
about a language, at least I would have more confidence in results that
are based on several different corpora rather than on a detailed
description of a certain corpus. Just as weather forecasts or climate
studies -- the more measurement points are available the more reliable
they are. (Clearly, one practical solution is a kind of "monitor
corpus" -- or the Internet. I understand that the cruciality of this
question depends a lot on the purpose(s) of the corpus and the aim(s) of
the researcher, which, I think, should be convergent to some extent.) Of
course, the other side of the coin is economy. It would be a huge waste
of money and resources if everybody should compile corpora of their own -
and preferably non-stop!<br><br>
sincerely<br>
Sampo<br>
</blockquote><br><br>
Dear Sampo<br><br>
since you mentioned weather forecast I am sure you understand when I say
that today it is 19° here in Paris, looking at Cnn it says that it is 67.
<br>
If we do not share the same scale there is no comparisons. For corpora we
saw that evaluating taggers (as an example) people may announce they are
85% accurate, others may pretend that their algorithmes outperform these
and achieve 90%. The only possibility is to share the corpora and
the metrics.<br><br>
But of course the corpora should grow and be updated regularly (and then
we face the economic and financial issues you pointed out).<br><br>
So I am sure there is a need for a common corpora for as many languages
as possible. But "common" should also take into
consideration the requirements of a lot of researchers to come up with a
consensus on their needs.<br><br>
<br>
You may want to look at the description of the BLARK concept (Basic
Language Resources Kit) <br>
(see at http://www.elda.fr) under
<font face="Times New Roman, Times" size=4><b>> Projects</font>
<font face="Times New Roman, Times" size=4>> European & International</font> (<a href="http://www.elda.fr/article.php3?id_article=48" eudora="autourl">http://www.elda.fr/article.php3?id_article=48</a>)<br>
</b>or at the report drafted in the framework of the European project ENABLER (European National Activities for Basic Language Resources, <br>
Thematic Network, Deliverable D5.1: Report on a (minimal) set of LRs to be made available for as many languages as possible, and map of the actual gaps)<br><br>
(at http://www.enabler-network.org/reports.htm report D5.1)<br><br>
<br>
All the best<br>
Khalid<br><br>
<br>
</body>
<br>
*************************************************************<br>
Khalid CHOUKRI <a href="mailto:choukri@elda.fr" eudora="autourl">mailto:choukri@elda.fr</a><br>
ELRA CEO<br>
Tel. +33 1 43 13 33 33 - Fax. +33 1 43 13 33 30<br>
Postal Mail: 55 Rue Brillat-Savarin, 75013 Paris France<br>
Home page: <a href="http://www.elda.fr/" eudora="autourl">http://www.elda.fr/</a> or <a href="http://www.elra.info/" eudora="autourl">http://www.</a>elra.info<a href="http://www.elra.info/" eudora="autourl">/<br>
</a>LREC News: <a href="http://www.lrec-conf.org/" eudora="autourl"><font color="#FF00FF">http://www.lrec-conf.org/</a><br>
</font>*************************************************************** </html>