[Corpora-List] Legal issue - privacy protection

Mcenery, Tony a.mcenery at lancaster.ac.uk
Thu Oct 4 08:49:45 UTC 2012


Dear Simon,

I have not encountered that approach to privacy before and find it somewhat perverse as the privacy is clearly breached already.

The only analogies I can think of relate to non-corpus cases, notably the discussion in the UK over the summer of whether a picture of a member of royalty (naked) should be printed in the UK Press. It was noted at the time that the pictures were available elsewhere on the web, but this did not stop a discussion regarding whether a legislative bubble should isolate the .uk domain, so to speak. That case did not lead to a legal ruling, but there are similar examples which did, I guess - where something is not legal in country A, but is in country B, so cannot be viewed legally on the web in country A, but is visible from country A on sites in country B.

Copyright might be an area where legislative bubbles could give rise to an issue directly similar to that which you have encountered - there are different jurisdictions in operation which permit different behaviours. Might be worth looking at more closely.

Prior to the internet cases like yours were more common - 'Spycatcher' was a book was banned in the UK which was available freely elsewhere (or at least in Australia from memory). That led to legislative fun and games. So - legislative bubbles like this were known in the pre-internet age and do crop up in the internet age also. If you are in the bubble I guess there is little you can do but comply. I daresay there may be exciting and imaginative ways of trying to sidestep the bubble, but I would take very careful advice before you tried any of those ideas, if I were you. Sorry to be unhelpful (and at some length!). Best wishes,

Tony

From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf Of Simon Krek
Sent: 03 October 2012 20:47
To: Corpora at uib.no
Subject: [Corpora-List] Legal issue - privacy protection

Dear all,

I would like to ask for your help with a legal problem that cropped up in Slovenia during this summer and might be interesting also for others.

In July, Slovenian Information Commissoner (https://www.ip-rs.si/?id=195) issued a decision on the "Nova beseda" corpus (http://bos.zrc-sazu.si/a_beseda.html) which contains 318 million words from newpapers, magazines, books etc and is available in a web concordancer, accessible without authentication. The decision contains the obligation that all personal names in the corpus should be either anonymised or excluded from the results in the online concordancer because of the protection of personal data (mainly in newspaper articles). After some negotiation it is now possible to search for a name but not for a combination of names (and/or surnames). The list of prohibited combinations is based on the first name and family name database of the Statistical Office of the Republic of Slovenia. For instance, if you search for a combination of my name and surname, you get the following result:

http://bos.zrc-sazu.si/c/ada.exe?hits_shown=100&clm=22&crm=22&expression=simon+krek&clm=22&crm=22&wth=0&hits_shown=100&sel=%28all%29&name=a.

In our corpora community, we view this solution as unacceptable as it severely limits the use of corpora on the web and, on the other hand, brings no additional protection of privacy, as the same information is available through search engines which are outside the jurisdiction of the Slovenian Information Commissioner.

My question is if anybody involved in corpus creation encountered or considered this kind of problem before us? I am interested in any experience that involves **protecting personal privacy in corpus material already published before** which is simultaneously accessible in (digital) libraries and most of it also elsewhere on the web in archives of particular newspapers etc. Perhaps it should be emphasized that this is NOT in any way a question of copyright or the status of web crawled data in WaCs, it concerns only the laws on protection of personal data.

I Google-translated the decision and put it on my page: http://www.simonkrek.si/blog/decision/index.html (the original is linked on the same page).

The main ideas in the decision are the following:
- although all the material in the corpus had already been published before and can be found in libraries and in archives of particular newspapers/magazines, the corpus represents a NEW STRUCTURED collection which contains personal data, and as such it cannot be compared with the original publication in newspaper/magazine, which had a different PURPOSE
- a very important issue in this decision is "EASE OF ACCESS" as it takes only a few seconds to find personal data in the corpus whereas more effort is needed to access or collect the same data in newspapers articles in libraries or other places.

I would be very grateful for hints about any comparable legal considerations or decisions elsewhere, particularly in EU countries.

Best regards,
Simon Krek


-----------------------
Amebis, d.o.o., Kamnik
Bakovnik 3
SI-1241 Kamnik
Slovenia

Jozef Stefan Institute
Artificial Intelligence Laboratory
Jamova 39
SI-1000 Ljubljana
Slovenia

skype: simon.krek.jsi
twitter: @SimonKrek
-----------------------
http://www.simonkrek.si/
http://www.slovenscina.eu/


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20121004/5028c9ca/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list