[Corpora-List] Real language detection

Adam Kilgarriff adam at lexmasterclass.com
Tue Jul 17 15:25:27 UTC 2012


Dear Roman,

Jan Pomikalek's thesis is substantially on this topic.  The code, justext,
is on Google Code; demo  http://nlp.fi.muni.cz/projekty/justext/

Adam

On 17 July 2012 15:43, Roman Klinger <roman.klinger at scai.fraunhofer.de>wrote:

> Hi,
>
> we have huge text streams in which parts are not really language but
> symbols, ids, numbers etc.
>
> Does anybody of you know an existing (and available) system which can
> classify between 'garbage' and 'real sentences'?
>
> Probably this is easily done with a dictionary lookup (eg using Google
> n-gram), but maybe somebody else did already put more effort in.
>
> Or do you know any papers in this context?
>
> Thanks,
>  Roman
>
>
> --
> Dr. Roman Klinger
> Fraunhofer-Institute for Algorithms and Scientific Computing (SCAI)
> Schloss Birlinghoven
> D-53754 Sankt Augustin
> Tel.: +49-2241-14-2360
> Fax.: +49-2241-14-4-2360
> email: roman.klinger at scai.fraunhofer.**de<roman.klinger at scai.fraunhofer.de>
> http://www.scai.fraunhofer.de/**klinger.html<http://www.scai.fraunhofer.de/klinger.html>
>
>
> ______________________________**_________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/**corpora<http://mailman.uib.no/options/corpora>
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/**listinfo/corpora<http://mailman.uib.no/listinfo/corpora>
>



-- 
========================================
Adam Kilgarriff <http://www.kilgarriff.co.uk/>
adam at lexmasterclass.com
Director                                    Lexical Computing
Ltd<http://www.sketchengine.co.uk/>

Visiting Research Fellow                 University of
Leeds<http://leeds.ac.uk>

*Corpora for all* with the Sketch Engine <http://www.sketchengine.co.uk>

                        *DANTE: a lexical database for
English<http://www.webdante.com>
                  *
========================================
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20120717/c8fa99cf/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list