[Corpora-List] Deviations in language models on the web

Wed Nov 17 15:28:51 UTC 2010

Hello!

On Tue, Nov 16, 2010 at 10:12 AM, Serge Sharoff <s.sharoff at leeds.ac.uk> wrote:
> in doing webcrawls for linguistic purposes, I recently came across an
> approach to link spamming or SEO optimisation that involves taking
> sentences from a large range of texts (mostly out-of-copyright fiction),
> mixing the sentences randomly, injecting the name of a product (or other
> keywords) and creating thousands of webpages.

Can you send few URLs with mentioned random text to this mailing list?

I'm not sure if that is new SEO optimization technique or the old one.
Years ago, lots of web pages were boosting their search rank by
including non-visible text segment full of different words (standalone
keywords, not generated sentences). Hiding was made through text
coloring (white text on white background), CSS styles, etc. That
technique was banned by search engines.

If SEO optimization that you have mentioned works the same way, it can
be bypassed (that would introduce whole new level of complexity in web
page cleaning systems, but it can be done).

Also, as I'm familiar with Googles PageRank algorithms (I suppose,
other engines use similar algorithms), it takes into account the rank
of pages that link your pages. If someone creates big amount of dummy
sites with low rank and creates links to his page that won't boost
rank of his page (a lot). I believe that this is done in some other
way, so it would be good to check it out a bit more to see if their
"rank boosting problem" can be exploited by our web page cleaning
software.

  Regards,
    Ivan Krišto

> The intent is probably to fool search engines into thinking these are
> product reviews or descriptions, but the implication for linguistics is
> that we get polluted language models, in which mobile phones collocate
> with horse drawn carriages.
>
> SEO-enhanced pages I came across in the past contained random word lists
> with keywords injected.  It was possible to deal with such cases by
> n-gram filtering.  However, this simple trick doesn't work any longer,
> as the sentences are to a very large extent entirely grammatical.
>
> Any experience from others and suggestions on how to deal with this
> phenomenon.
>
> Best,
> Serge
>
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora