[Corpora-List] Gross language detection
Jose Maria Gomez Hidalgo
jmgomez at dinar.esi.uem.es
Wed Jan 8 10:27:44 UTC 2003
Dear all
As a part of a classified ads posting system, a group of natural language
processing students supervised by me have to develop a gross language
detection system for the Spanish language. I do not know if there is any
work in this area (except maybe [1]).
Dou you have ideas of how to do this?
It seems rather heuristic, but my basic idea is:
1. To build a dictionary of forbidden words (f**k, etc)
2. To develop a set of regular expresions that allow to detect variations
of the forbiden words (e.g. if "xyzt" is a forbidden word, then we have to
detect "XyZt", "X_Y_Z_T" or little letter changes for slang - a "k" instead
a "c", etc).
Thank you for your help
Jose Maria
_______________________________________________________________________________
Jose Maria Gomez Hidalgo
Departamento de Inteligencia Artificial
Universidad Europea de Madrid
28670 - Villaviciosa de Odon - MADRID
(+34) 912115670
jmgomez at dinar.esi.uem.es
http://www.esi.uem.es/~jmgomez/
_______________________________________________________________________________
La legislación española ampara el secreto de las comunicaciones. Este
correo electrónico es estrictamente confidencial y va dirigido
exclusivamente a su destinatario/a. Si no es Ud., le rogamos que no difunda
ni copie la transmisión y nos lo notifique cuanto antes.
Spanish law guarantees privacy in electronic communications. This
electronic transmission is strictly confidential and intended solely for
the addressee. If you are not the intended addressee, you are kindly
requested not to disclose nor to copy this transmission and to notify us as
soon as possible.
More information about the Corpora
mailing list