[Corpora-List] Real language detection
Roman Klinger
roman.klinger at scai.fraunhofer.de
Tue Jul 17 14:43:37 UTC 2012
Hi,
we have huge text streams in which parts are not really language but
symbols, ids, numbers etc.
Does anybody of you know an existing (and available) system which can
classify between 'garbage' and 'real sentences'?
Probably this is easily done with a dictionary lookup (eg using Google
n-gram), but maybe somebody else did already put more effort in.
Or do you know any papers in this context?
Thanks,
Roman
--
Dr. Roman Klinger
Fraunhofer-Institute for Algorithms and Scientific Computing (SCAI)
Schloss Birlinghoven
D-53754 Sankt Augustin
Tel.: +49-2241-14-2360
Fax.: +49-2241-14-4-2360
email: roman.klinger at scai.fraunhofer.de
http://www.scai.fraunhofer.de/klinger.html
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list