[Corpora-List] (no subject)

Sun Aug 17 17:30:39 UTC 2014

The British National Corpus Reference Guide 
http://www.natcorp.ox.ac.uk/docs/URG/index.html 
states that speaker identities were anonymized: :
 " ... guarantee of confidentiality and complete anonymity (all references 
to full names and addresses have been removed from the corpus and the log)"

I assume name and address removal was done by hand-editing the text, 
but were any tests done to double-check anonymization was complete?
What instructions did the manual editors have, on exactly how to identify 
and process the names etc? 

I am interested in the possibility of usinf the BNC as a training corpus 
for automated anonymization of other text sources, eg narrative text 
in medical patient records. Does this sound feasible? What pitfalls should I 
watch out for? 

thanks for expert advice

Eric Atwell, School of Computing, Leeds University 

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora