[Corpora-List] Looglefight

Adam Kilgarriff adam at lexmasterclass.com
Sat Oct 2 11:06:08 UTC 2010


John,

this is less a bug than a knotty tokenisation problem.  For most linguistic
purposes it is appropriate to tokenize *cannot* as two words, so that's what
we have done.  Can't please all the people all the time ...

Adam

On 2 October 2010 03:26, John F. Sowa <sowa at bestweb.net> wrote:

> Another bug in Looglefight:
>
> I checked both Googlefight and Looglefight for the occurrences of
> 'cannot' vs. 'can not'.  According to Googlefight, there are about
> 50 times more occurrences of 'can not' than 'cannot'
>
> But Looglefight said there were 0 occurrences of 'cannot',
> but 13,832 occurrences of 'can not'.
>
> So I checked the concordance for 'can not' and found that
> Looglefight mixed all occurrences of 'cannot' and 'can not'
> in the column for 'can not'.
>
> John Sowa
>
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>



-- 
================================================
Adam Kilgarriff
http://www.kilgarriff.co.uk
Lexical Computing Ltd                   http://www.sketchengine.co.uk
Lexicography MasterClass Ltd      http://www.lexmasterclass.com
Universities of Leeds and Sussex       adam at lexmasterclass.com
================================================
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20101002/c1a8da8d/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list