[Corpora-List] Corpora Digest, Vol 47, Issue 14

Michal Ptaszynski ptaszynski at media.eng.hokudai.ac.jp
Thu May 12 17:57:36 UTC 2011


> In theory, though, all the books are available for free from  
> http://books.google.com/ .  In the Google ngram interface at  
> http://ngrams.googlelabs.com/

Luckily its downloadable (the n-grams), since you do NOT want to use a
hundred-billion word corpus on an interface which blocks you for a day
after 1000 queries. :D

BTW. I was wondering why so many people still stick to n-grams, when we
all know that frequent sentence patterns usually consist of separated
entities (words, pos, or however you define sentence patterns). I remember
Yorik tried something called "skip grams". Apart from a huge number of
skip-grams they generated (in GBs), was this approach actually useful in
any way? (I mean, e.g., produced an effective tool or method - not just a
conference paper).
I am asking, since at present I am developing a method based on
combinatorial approach (extract all combinations of entities and check
which appear frequently).
It is a bit similar to skip grams, but it does not assume any
restrictions, neither in number of "skips" nor in "grams". Basically
working on this method reminds me prehistoric experiments, when people
were launching a computer program and taking two days off :)  However, the
results are interesting and seem promising - I could extract frequent
patterns from Ainu language (a polysynthetic relict which I don't know at
all) and the person speaking Ainu said they actually were patterns (!).

The wink I am making says - Why not give up the old "only n-gram" approach
and start dealing with something more sophisticated? After all, Shannon
proposed n-grams over 50 years ago. I would love to see something like
"Google patterns".

Michal


----------------
Od: Angus Grieve-Smith <grvsmth at panix.com>
Do: corpora at uib.no
Data: Thu, 12 May 2011 12:37:46 -0400
Temat: Re: [Corpora-List] 155 *billion* (155, 000, 000, 000) word corpus
of American English

On 5/12/2011 11:15 AM, Mark Davies wrote:
Is the corpus itself or part of it available for downloading? It would be
more useful if we could process the raw text for our own purpose rather
than accessing it from a web interface.
As mentioned previously, the underlying n-grams data is freely available
   from Google at http://ngrams.googlelabs.com/datasets (see
http://creativecommons.org/licenses/by/3.0/ re. licensing).

       When I try to use it, I get "Session expired. Click here to start new
session."

       In theory, though, all the books are available for free from
http://books.google.com/ .  In the Google ngram interface at
http://ngrams.googlelabs.com/ there are links to date ranges.  If you
click on those you will see a date range result for the search term on the
Google Books website.  You can then click the "Plain text" link in the
upper right hand corner to see the OCRed text.  Then you can appreciate
how rough some of the OCR has been.

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list