[Corpora-List] Legal aspects of compiling corpora

Khalid CHOUKRI choukri at elda.fr
Thu Jun 19 12:10:18 UTC 2003


At Thursday 19/06/2003 10:26(), Sampo Nevalainen wrote:
>Hi,
>
>>then we will face another problem of comparing approaches and techniques, 
>>if each of us use different corpora (without any possibility to share it 
>>with others because of the legal aspects) then no comparison will be possible.
>
>My comment is clearly out of topic, but I could not resist... This is one 
>thing I have not fully understood ever since I was irrevocably taken with 
>CL. Many text books on CL give an idea that a corpus should have a finite 
>size and be "a standard reference" (as McEnery and Wilson put it in 
>"Corpus Linguistics" 1996). In my humble opinion, this is rather 
>unnatural, as, after all, we are studying an open, ever-growing, dynamic, 
>lively organism (unless we are interested in "dead" languages). From this 
>viewpoint, if we are going to generalize anything about a language, at 
>least I would have more confidence in results that are based on several 
>different corpora rather than on a detailed description of a certain 
>corpus. Just as weather forecasts or climate studies -- the more 
>measurement points are available the more reliable they are. (Clearly, one 
>practical solution is a kind of "monitor corpus" -- or the Internet. I 
>understand that the cruciality of this question depends a lot on the 
>purpose(s) of the corpus and the aim(s) of the researcher, which, I think, 
>should be convergent to some extent.) Of course, the other side of the 
>coin is economy. It would be a huge waste of money and resources if 
>everybody should compile corpora of their own - and preferably non-stop!
>
>sincerely
>Sampo


Dear Sampo

since you mentioned weather forecast I am sure you understand when I say 
that today it is 19° here in Paris, looking at Cnn it says that it is 67.
If we do not share the same scale there is no comparisons. For corpora we 
saw that evaluating taggers (as an example) people may announce they are 
85% accurate, others may pretend that their algorithmes outperform these 
and achieve 90%.  The only possibility is to share the corpora and the metrics.

But of course the corpora should grow and be updated regularly (and then we 
face the economic and financial issues you pointed out).

So I am sure there is a need for a common corpora for as many languages as 
possible.  But "common" should also take into consideration the 
requirements of a lot of researchers to come up with a consensus on their 
needs.


You may want to look at the description of the BLARK concept (Basic 
Language Resources Kit)
(see at http://www.elda.fr) under > Projects > European & 
International   (http://www.elda.fr/article.php3?id_article=48)
or at the report drafted in the framework of the European project ENABLER 
(European National Activities for Basic Language Resources,
Thematic Network, Deliverable D5.1:  Report on a (minimal) set of LRs to be 
made available for as many languages as possible, and map of the actual gaps)

(at http://www.enabler-network.org/reports.htm report D5.1)


All the best
Khalid



*************************************************************
Khalid CHOUKRI  mailto:choukri at elda.fr
ELRA CEO
Tel. +33 1 43 13 33 33 - Fax. +33 1 43 13 33 30
Postal Mail: 55 Rue Brillat-Savarin, 75013 Paris France
Home page:  http://www.elda.fr/ or http://www.elra.info/
LREC News: http://www.lrec-conf.org/
*************************************************************** 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20030619/6701910c/attachment.htm>


More information about the Corpora mailing list