[Corpora-List] Anonymisation tools for latin languages? Named Entity automatic transformation tools
Christophe Reffay
christophe.reffay at ens-cachan.fr
Tue Sep 20 09:20:35 UTC 2011
As a newcomer (not linguist), I hope I will have correct words to ask my
question:
Many researchers around me are (finally) willing to share their research
data BUT they can't for privacy reason.
The corpora I'm interrested in are textual messages (like forum, chat,
SMS, or even blogs, wikis, etc.)
A very recurrent problem could be described as :
Given a data set of milions of textual (human) interaction including
identifiers in a head part (user name, e-mail address, etc.) and (more
problematically) firstnames, lastnames, (possibly misspelled), phone
numbers, e-mail or surface mail addresses... in the message body, I want
to transform it in a marked entity.
Example of target transformation:
"Christophe Reffay" could be transformed to <ActorRef actorcode="R007">
<Firstname modified="no">Christophe</Firstname> <Lastname
modified="yes">Durand</Lastname></ActorRef>
Parts of the problem I identified...:
- Find, identify and extract: identifiers, named entity (actors,
locations, ...)
- Blur the critic (known) identifiers
- find the (+ or - local) textual context and extract possible new
entities...
- Provide a process for systematic transformation
- Evaluate the quality of "anonymization"
Could some of you give 'us':
- A list of scientific references that study such a problem ?
- Some available (and free to try) tools to support such a job ?
or...
Do all researchers transform their data in a /ad hoc /way on its own data?
Did researcher simply abandon such treatments?
I'm interrested in tools for messages possibly including multiple
(roman) languages (french, spanish, italian, ...)
I'm specially interrested in tools able to treat any language, but
french and english are the more often used languages...
I'm not a linguist (sorry) => Could I prensent this problem as
"Named Entity Transformation" methods and tools ?
What other keywords could help me to find what I'm looking for ?
Thanks in advance for your help:
--
*Christophe Reffay* - Computer Scientist
UMR Sciences Techniques Education Formation: IFE ENS-Lyon / ENS-Cachan
Tel: +33 (0)1 47 40 76 15
Surface mail: /UMR STEF Bat. Cournot - ENS de Cachan - 61, avenue du
Président Wilson - 94235 Cachan Cedex/
Web: http://www.stef.ens-cachan.fr/annur/reffay.htm
------------------------------------------------------------------------
You want to share your research data? Visit the Mulce project
<http://ubpweb.univ-bpclermont.fr/HEBERGES/mulce/?lang=en>!
You want to apply tools to your forums? Please consider the Calico
Project <http://www.stef.ens-cachan.fr/calico/en/calico.htm>, Platform
<http://woops.crashdump.net/calico/index.php?lang=en> and tools
<http://www.stef.ens-cachan.fr/calico/en/tools.htm>!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20110920/5d0f903d/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list