<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">
</head>
<body bgcolor="#FFFFFF" text="#000000">
As a newcomer (not linguist), I hope I will have correct words to
ask my question:<br>
<br>
Many researchers around me are (finally) willing to share their
research data BUT they can't for privacy reason.<br>
The corpora I'm interrested in are textual messages (like forum,
chat, SMS, or even blogs, wikis, etc.)<br>
<br>
A very recurrent problem could be described as :<br>
Given a data set of milions of textual (human) interaction including
identifiers in a head part (user name, e-mail address, etc.) and
(more problematically) firstnames, lastnames, (possibly misspelled),
phone numbers, e-mail or surface mail addresses... in the message
body, I want to transform it in a marked entity.<br>
<br>
Example of target transformation:<br>
"Christophe Reffay" could be transformed to <ActorRef
actorcode="R007"> <Firstname
modified="no">Christophe</Firstname> <Lastname
modified="yes">Durand</Lastname></ActorRef><br>
<br>
Parts of the problem I identified...:<br>
- Find, identify and extract: identifiers, named entity (actors,
locations, ...)<br>
- Blur the critic (known) identifiers<br>
- find the (+ or - local) textual context and extract possible new
entities...<br>
- Provide a process for systematic transformation<br>
- Evaluate the quality of "anonymization"<br>
<br>
Could some of you give 'us':<br>
- A list of scientific references that study such a problem ?<br>
- Some available (and free to try) tools to support such a job ?<br>
or...<br>
Do all researchers transform their data in a <i>ad hoc </i>way on
its own data?<br>
Did researcher simply abandon such treatments?<br>
<br>
I'm interrested in tools for messages possibly including multiple
(roman) languages (french, spanish, italian, ...)<br>
I'm specially interrested in tools able to treat any language, but
french and english are the more often used languages...<br>
<br>
I'm not a linguist (sorry) => Could I prensent this problem as <br>
"Named Entity Transformation" methods and tools ? <br>
What other keywords could help me to find what I'm looking for ?<br>
<br>
Thanks in advance for your help:<br>
<br>
<div class="moz-signature">-- <br>
<b>Christophe Reffay</b> - Computer Scientist <br>
UMR Sciences Techniques Education Formation: IFE ENS-Lyon /
ENS-Cachan <br>
Tel: +33 (0)1 47 40 76 15 <br>
Surface mail: <em>UMR STEF Bat. Cournot - ENS de Cachan - 61,
avenue du Président Wilson - 94235 Cachan Cedex</em><br>
Web: <a href="http://www.stef.ens-cachan.fr/annur/reffay.htm">http://www.stef.ens-cachan.fr/annur/reffay.htm</a><br>
<hr>
You want to share your research data? Visit the <a
href="http://ubpweb.univ-bpclermont.fr/HEBERGES/mulce/?lang=en">Mulce
project</a>!<br>
You want to apply tools to your forums? Please consider the Calico
<a href="http://www.stef.ens-cachan.fr/calico/en/calico.htm">Project</a>,
<a href="http://woops.crashdump.net/calico/index.php?lang=en">Platform</a>
and <a href="http://www.stef.ens-cachan.fr/calico/en/tools.htm">tools</a>!
</div>
</body>
</html>