Dear Roman,<div><br></div><div>Jan Pomikalek's thesis is substantially on this topic. The code, justext, is on Google Code; demo
<a href="http://nlp.fi.muni.cz/projekty/justext/">http://nlp.fi.muni.cz/projekty/justext/</a> <br><br>Adam</div><div><br><div class="gmail_quote">On 17 July 2012 15:43, Roman Klinger <span dir="ltr"><<a href="mailto:roman.klinger@scai.fraunhofer.de" target="_blank">roman.klinger@scai.fraunhofer.de</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi,<br>
<br>
we have huge text streams in which parts are not really language but symbols, ids, numbers etc.<br>
<br>
Does anybody of you know an existing (and available) system which can classify between 'garbage' and 'real sentences'?<br>
<br>
Probably this is easily done with a dictionary lookup (eg using Google n-gram), but maybe somebody else did already put more effort in.<br>
<br>
Or do you know any papers in this context?<br>
<br>
Thanks,<br>
Roman<br>
<br>
<br>
-- <br>
Dr. Roman Klinger<br>
Fraunhofer-Institute for Algorithms and Scientific Computing (SCAI)<br>
Schloss Birlinghoven<br>
D-53754 Sankt Augustin<br>
Tel.: <a href="tel:%2B49-2241-14-2360" value="+492241142360" target="_blank">+49-2241-14-2360</a><br>
Fax.: <a href="tel:%2B49-2241-14-4-2360" value="+4922411442360" target="_blank">+49-2241-14-4-2360</a><br>
email: <a href="mailto:roman.klinger@scai.fraunhofer.de" target="_blank">roman.klinger@scai.fraunhofer.<u></u>de</a><br>
<a href="http://www.scai.fraunhofer.de/klinger.html" target="_blank">http://www.scai.fraunhofer.de/<u></u>klinger.html</a><br>
<br>
<br>
______________________________<u></u>_________________<br>
UNSUBSCRIBE from this page: <a href="http://mailman.uib.no/options/corpora" target="_blank">http://mailman.uib.no/options/<u></u>corpora</a><br>
Corpora mailing list<br>
<a href="mailto:Corpora@uib.no" target="_blank">Corpora@uib.no</a><br>
<a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/<u></u>listinfo/corpora</a><br>
</blockquote></div><br><br clear="all"><div><br></div>-- <br>========================================<br><a href="http://www.kilgarriff.co.uk/" target="_blank">Adam Kilgarriff</a> <a href="mailto:adam@lexmasterclass.com" target="_blank">adam@lexmasterclass.com</a> <br>
Director <a href="http://www.sketchengine.co.uk/" target="_blank">Lexical Computing Ltd</a> <br>Visiting Research Fellow <a href="http://leeds.ac.uk" target="_blank">University of Leeds</a> <div>
<i><font color="#006600">Corpora for all</font></i> with <a href="http://www.sketchengine.co.uk" target="_blank">the Sketch Engine</a> </div><div> <i><a href="http://www.webdante.com" target="_blank">DANTE: <font color="#009900">a lexical database for English</font></a><font color="#009900"> </font> </i><div>
========================================</div></div><br>
</div>