[Corpora-List] Legal issue - privacy protection

Simon Krek kreks at siol.net
Wed Oct 3 19:47:27 UTC 2012


Dear all,

 

I would like to ask for your help with a legal problem that cropped up in
Slovenia during this summer and might be interesting also for others. 

 

In July, Slovenian Information Commissoner (https://www.ip-rs.si/?id=195)
issued a decision on the "Nova beseda" corpus
(http://bos.zrc-sazu.si/a_beseda.html) which contains 318 million words from
newpapers, magazines, books etc and is available in a web concordancer,
accessible without authentication. The decision contains the obligation that
all personal names in the corpus should be either anonymised or excluded
from the results in the online concordancer because of the protection of
personal data (mainly in newspaper articles). After some negotiation it is
now possible to search for a name but not for a combination of names (and/or
surnames). The list of prohibited combinations is based on the first name
and family name database of the Statistical Office of the Republic of
Slovenia. For instance, if you search for a combination of my name and
surname, you get the following result: 

 

http://bos.zrc-sazu.si/c/ada.exe?hits_shown=100
<http://bos.zrc-sazu.si/c/ada.exe?hits_shown=100&clm=22&crm=22&expression=si
mon+krek&clm=22&crm=22&wth=0&hits_shown=100&sel=%28all%29&name=a>
&clm=22&crm=22&expression=simon+krek&clm=22&crm=22&wth=0&hits_shown=100&sel=
%28all%29&name=a. 

 

In our corpora community, we view this solution as unacceptable as it
severely limits the use of corpora on the web and, on the other hand, brings
no additional protection of privacy, as the same information is available
through search engines which are outside the jurisdiction of the Slovenian
Information Commissioner.

 

My question is if anybody involved in corpus creation encountered or
considered this kind of problem before us? I am interested in any experience
that involves **protecting personal privacy in corpus material already
published before** which is simultaneously accessible in (digital) libraries
and most of it also elsewhere on the web in archives of particular
newspapers etc. Perhaps it should be emphasized that this is NOT in any way
a question of copyright or the status of web crawled data in WaCs, it
concerns only the laws on protection of personal data.

 

I Google-translated the decision and put it on my page:
http://www.simonkrek.si/blog/decision/index.html (the original is linked on
the same page).

 

The main ideas in the decision are the following:

- although all the material in the corpus had already been published before
and can be found in libraries and in archives of particular
newspapers/magazines, the corpus represents a NEW STRUCTURED collection
which contains personal data, and as such it cannot be compared with the
original publication in newspaper/magazine, which had a different PURPOSE 

- a very important issue in this decision is "EASE OF ACCESS" as it takes
only a few seconds to find personal data in the corpus whereas more effort
is needed to access or collect the same data in newspapers articles in
libraries or other places.

 

I would be very grateful for hints about any comparable legal considerations
or decisions elsewhere, particularly in EU countries. 

 

Best regards,

Simon Krek

 

 

-----------------------
Amebis, d.o.o., Kamnik
Bakovnik 3
SI-1241 Kamnik
Slovenia

Jozef Stefan Institute
Artificial Intelligence Laboratory
Jamova 39
SI-1000 Ljubljana
Slovenia

skype: simon.krek.jsi

twitter: @SimonKrek
-----------------------
 <http://www.simonkrek.si/> http://www.simonkrek.si/
 <http://www.slovenscina.eu/> http://www.slovenscina.eu/

 

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20121003/b3936409/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list