[Corpora-List] Anonymisation tools for latin languages? Named Entity automatic transformation tools

Christophe Reffay christophe.reffay at ens-cachan.fr
Tue Sep 20 09:20:35 UTC 2011


As a newcomer (not linguist), I hope I will have correct words to ask my 
question:

Many researchers around me are (finally) willing to share their research 
data BUT they can't for privacy reason.
The corpora I'm interrested in are textual messages (like forum, chat, 
SMS, or even blogs, wikis, etc.)

A very recurrent problem could be described as :
Given a data set of milions of textual (human) interaction including 
identifiers in a head part (user name, e-mail address, etc.) and (more 
problematically) firstnames, lastnames, (possibly misspelled), phone 
numbers, e-mail or surface mail addresses... in the message body, I want 
to transform it in a marked entity.

Example of target transformation:
"Christophe Reffay" could be transformed to <ActorRef actorcode="R007"> 
<Firstname modified="no">Christophe</Firstname> <Lastname 
modified="yes">Durand</Lastname></ActorRef>

Parts of the problem I identified...:
- Find, identify and extract: identifiers, named entity (actors, 
locations, ...)
- Blur the critic (known) identifiers
- find the (+ or - local) textual context and extract possible new 
entities...
- Provide a process for systematic transformation
- Evaluate the quality of "anonymization"

Could some of you give 'us':
- A list of scientific references that study such a problem ?
- Some available (and free to try) tools to support such a job ?
or...
Do all researchers transform their data in a /ad hoc /way on its own data?
Did researcher simply abandon such treatments?

I'm interrested in tools for messages possibly including multiple 
(roman) languages (french, spanish, italian, ...)
I'm specially interrested in tools able to treat any language, but 
french and english are the more often used languages...

I'm not a linguist (sorry) => Could I prensent this problem as
"Named Entity Transformation" methods and tools ?
What other keywords could help me to find what I'm looking for ?

Thanks in advance for your help:

-- 
*Christophe Reffay* - Computer Scientist
UMR Sciences Techniques Education Formation: IFE ENS-Lyon / ENS-Cachan
Tel: +33 (0)1 47 40 76 15
Surface mail: /UMR STEF Bat. Cournot - ENS de Cachan - 61, avenue du 
Président Wilson - 94235 Cachan Cedex/
Web: http://www.stef.ens-cachan.fr/annur/reffay.htm
------------------------------------------------------------------------
You want to share your research data? Visit the Mulce project 
<http://ubpweb.univ-bpclermont.fr/HEBERGES/mulce/?lang=en>!
You want to apply tools to your forums? Please consider the Calico 
Project <http://www.stef.ens-cachan.fr/calico/en/calico.htm>, Platform 
<http://woops.crashdump.net/calico/index.php?lang=en> and tools 
<http://www.stef.ens-cachan.fr/calico/en/tools.htm>!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20110920/5d0f903d/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list