[Corpora-List] Real language detection

Craig Pfeifer craig.pfeifer at gmail.com
Tue Jul 17 14:49:55 UTC 2012


The most recent work in language ID I know of is:

http://aclweb.org/anthology-new/W/W12/W12-2108.pdf

You may be able to get code from the authors.
______________
craig.pfeifer at gmail.com


On Tue, Jul 17, 2012 at 10:43 AM, Roman Klinger
<roman.klinger at scai.fraunhofer.de> wrote:
> Hi,
>
> we have huge text streams in which parts are not really language but
> symbols, ids, numbers etc.
>
> Does anybody of you know an existing (and available) system which can
> classify between 'garbage' and 'real sentences'?
>
> Probably this is easily done with a dictionary lookup (eg using Google
> n-gram), but maybe somebody else did already put more effort in.
>
> Or do you know any papers in this context?
>
> Thanks,
>  Roman
>
>
> --
> Dr. Roman Klinger
> Fraunhofer-Institute for Algorithms and Scientific Computing (SCAI)
> Schloss Birlinghoven
> D-53754 Sankt Augustin
> Tel.: +49-2241-14-2360
> Fax.: +49-2241-14-4-2360
> email: roman.klinger at scai.fraunhofer.de
> http://www.scai.fraunhofer.de/klinger.html
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list