<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body bgcolor="#ffffff" text="#000000">
Yes, it seems like a script detection rather than language
detection. I also wonder if the notion of Arabic here includes
languages who use scripts based on this one (say, Urdu, Persian and
some other).<br>
<br>
Taras<br>
<br>
<br>
On 19/06/12 16:39, Hristo Tanev wrote:
<blockquote
cite="mid:1340120371.7651.YahooMailNeo@web28902.mail.ir2.yahoo.com"
type="cite">
<div style="color: rgb(0, 0, 0); background-color: rgb(255, 255,
255); font-family: times new roman,new york,times,serif;
font-size: 12pt;">
<div>....only that Cyrillic is not a language.</div>
<div><br>
</div>
<div>Hristo Tanev</div>
<div><br>
</div>
<div style="font-size: 12pt; font-family: 'times new roman','new
york',times,serif;">
<div style="font-size: 12pt; font-family: 'times new
roman','new york',times,serif;">
<div dir="ltr"> <font face="Arial" size="2">
<hr size="1"> <b><span style="font-weight: bold;">From:</span></b>
Benjamin Van Durme <a class="moz-txt-link-rfc2396E" href="mailto:vandurme@cs.jhu.edu"><vandurme@cs.jhu.edu></a><br>
<b><span style="font-weight: bold;">To:</span></b>
Christine Amling <a class="moz-txt-link-rfc2396E" href="mailto:chamling@students.uni-mainz.de"><chamling@students.uni-mainz.de></a>
<br>
<b><span style="font-weight: bold;">Cc:</span></b>
<a class="moz-txt-link-abbreviated" href="mailto:corpora@uib.no">corpora@uib.no</a> <br>
<b><span style="font-weight: bold;">Sent:</span></b>
Tuesday, 19 June 2012, 16:05<br>
<b><span style="font-weight: bold;">Subject:</span></b>
Re: [Corpora-List] Need help with Twitter Corpus<br>
</font> </div>
<br>
The following presents a new LID method, and includes a
comparison<br>
against a number of tools on Twitter data.<br>
<br>
Language Identification for Creating Language-Specific
Twitter Collections<br>
Shane Bergsma, Paul McNamee, Mossaab Bagdouri, Clayton Fink,
Theresa Wilson<br>
<a moz-do-not-send="true"
href="http://aclweb.org/anthology-new/W/W12/W12-2108.pdf"
target="_blank">http://aclweb.org/anthology-new/W/W12/W12-2108.pdf</a><br>
<br>
Accuracy numbers (with most other systems run black-box
without<br>
adaptation, so take these conservatively) :<br>
<br>
Arabic Devanagari Cyrillic<br>
TextCat 96.3 89.1 90.3<br>
Google CLD 90.5 NA 91.4<br>
Lui/Baldwin 91.4 78.4 88.8<br>
PPM - (new) 97.6 97.1 95.8<br>
<br>
_______________________________________________<br>
UNSUBSCRIBE from this page: <a moz-do-not-send="true"
href="http://mailman.uib.no/options/corpora"
target="_blank">http://mailman.uib.no/options/corpora</a><br>
Corpora mailing list<br>
<a moz-do-not-send="true" ymailto="mailto:Corpora@uib.no"
href="mailto:Corpora@uib.no">Corpora@uib.no</a><br>
<a moz-do-not-send="true"
href="http://mailman.uib.no/listinfo/corpora"
target="_blank">http://mailman.uib.no/listinfo/corpora</a><br>
<br>
<br>
</div>
</div>
</div>
<pre wrap="">
<fieldset class="mimeAttachmentHeader"></fieldset>
_______________________________________________
UNSUBSCRIBE from this page: <a class="moz-txt-link-freetext" href="http://mailman.uib.no/options/corpora">http://mailman.uib.no/options/corpora</a>
Corpora mailing list
<a class="moz-txt-link-abbreviated" href="mailto:Corpora@uib.no">Corpora@uib.no</a>
<a class="moz-txt-link-freetext" href="http://mailman.uib.no/listinfo/corpora">http://mailman.uib.no/listinfo/corpora</a>
</pre>
</blockquote>
<br>
</body>
</html>