[Corpora-List] Real language detection

Roman Klinger roman.klinger at scai.fraunhofer.de
Tue Jul 17 14:43:37 UTC 2012


Hi,

we have huge text streams in which parts are not really language but 
symbols, ids, numbers etc.

Does anybody of you know an existing (and available) system which can 
classify between 'garbage' and 'real sentences'?

Probably this is easily done with a dictionary lookup (eg using Google 
n-gram), but maybe somebody else did already put more effort in.

Or do you know any papers in this context?

Thanks,
  Roman


-- 
Dr. Roman Klinger
Fraunhofer-Institute for Algorithms and Scientific Computing (SCAI)
Schloss Birlinghoven
D-53754 Sankt Augustin
Tel.: +49-2241-14-2360
Fax.: +49-2241-14-4-2360
email: roman.klinger at scai.fraunhofer.de
http://www.scai.fraunhofer.de/klinger.html


_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list