[Corpora-List] Looglefight
Adam Kilgarriff
adam at lexmasterclass.com
Sat Oct 2 11:06:08 UTC 2010
John,
this is less a bug than a knotty tokenisation problem. For most linguistic
purposes it is appropriate to tokenize *cannot* as two words, so that's what
we have done. Can't please all the people all the time ...
Adam
On 2 October 2010 03:26, John F. Sowa <sowa at bestweb.net> wrote:
> Another bug in Looglefight:
>
> I checked both Googlefight and Looglefight for the occurrences of
> 'cannot' vs. 'can not'. According to Googlefight, there are about
> 50 times more occurrences of 'can not' than 'cannot'
>
> But Looglefight said there were 0 occurrences of 'cannot',
> but 13,832 occurrences of 'can not'.
>
> So I checked the concordance for 'can not' and found that
> Looglefight mixed all occurrences of 'cannot' and 'can not'
> in the column for 'can not'.
>
> John Sowa
>
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
--
================================================
Adam Kilgarriff
http://www.kilgarriff.co.uk
Lexical Computing Ltd http://www.sketchengine.co.uk
Lexicography MasterClass Ltd http://www.lexmasterclass.com
Universities of Leeds and Sussex adam at lexmasterclass.com
================================================
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20101002/c1a8da8d/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list