[Corpora-List] 155 *billion* (155, 000, 000, 000) word corpus of American English

Michal Ptaszynski ptaszynski at media.eng.hokudai.ac.jp
Fri May 13 15:54:14 UTC 2011


David: Thank you for the link to the StringNet and the paper. I checked
both.
A very good feature of is the grammatical information. However, it is
still based on n-gram approach, only with the grammar instead of words.
BTW, it is something very similar to what, e.g., Satoshi Sekine has been
doing for a long time for English and Japanese. But there are of course
many others, who utilize the KWIC idea (Key Word In Context). What I would
like to see, would have to be KPIC (P from Pattern).

Detmar: Thank you for the link to Saphre and Dale Gerdemann's works. This
approach seems to be right on the button, especially the "gappy phrases"
idea.
However, from what I scanned, Dale looks for phrases, which consist of two
elements (also separated). In my method it would be easily manageable,
since generating all 2-element combinations and checking their frequencies
is not a problem. The real problem starts with longer combinations. As we
know, the context of a sentence pattern is the more specified the longer a
pattern is ("longer = better", linguistics is sometimes really
chauvinistic...).

A word of explanation if someone was wondering about a difference between
a "phrase" and a "pattern". I see a phrase also as a pattern, but shorter.
However, I would not draw a stiff threshold for the distinction (like:
phrases=2 or 3 elements; patterns = more than 3), since some phrases could
be really long and you could find short patterns as well. Perhaps a good
distinction could mention that a phrase brings a bit of semantics to the
table itself (or - is comprehensible on its own, not only in the
sentence). The patterns I am looking for are not (or "not always")
something you could put into a dictionary. For example, in a sentence "Oh,
what a beautiful day it is today, isn't it!", my method finds a pattern
"Oh, what a * isn't it!", or  "what a * isn't it!" (btw, Skip Grams,
mentioned earlier, would probably fail here, since there are more than 4
skips and more than 4 grams). I also allow for more than one wildcard.

I did some background search but, although it would be reasonable that
such a method existed, I could not find any. If anyone knows about such a
method developed earlier, I would be in debt.

Best,

Michal



Dnia 13-05-2011 o 13:13:28 David Wible <wible at stringnet.org> napisał(a):

> Concerning Michal's 'btw' asking about alternatives to n-grams, Nai-Lung
> Tsao and I have pursued an alternative in recent years. We call them  
> 'hybrid
> n-grams', and use them to build a navigable model of English that we call
> StringNet (http://nav.stringnet.org). Of course we aren't the only ones  
> or
> the first to put POSs in n-grams, but we've tried to push it and exploit  
> it
> to created a lang. model that is a (navigable) network that captures
> relations among patterns.
> **
> Here's a paper with details and with a description of its use for  
> exploring
> linguistic constructions:  
> http://www.aclweb.org/anthology/W/W10/W10-0804.pdf
>
> If you do visit http://nav.stringnet.org, submit a word the way you would
> with a concordancer. When results show up, click on everything. Clicking  
> on
> a POS tag in any of the results gives a pop-up showing all words  
> attested in
> that slot and their frequencies (be patient with pop-ups). Click on  
> parent
> or child links beside the patterns in the search results to navigate 'up'
> and 'down'.
>
> Best,
> David
> **
> **
> On Fri, May 13, 2011 at 1:57 AM, Michal Ptaszynski <
> ptaszynski at media.eng.hokudai.ac.jp> wrote:
>
>> In theory, though, all the books are available for free from
>>> http://books.google.com/ .  In the Google ngram interface at
>>> http://ngrams.googlelabs.com/
>>>
>>
>> Luckily its downloadable (the n-grams), since you do NOT want to use a
>> hundred-billion word corpus on an interface which blocks you for a day
>> after 1000 queries. :D
>>
>> BTW. I was wondering why so many people still stick to n-grams, when we
>> all know that frequent sentence patterns usually consist of separated
>> entities (words, pos, or however you define sentence patterns). I  
>> remember
>> Yorik tried something called "skip grams". Apart from a huge number of
>> skip-grams they generated (in GBs), was this approach actually useful in
>> any way? (I mean, e.g., produced an effective tool or method - not just  
>> a
>> conference paper).
>> I am asking, since at present I am developing a method based on
>> combinatorial approach (extract all combinations of entities and check
>> which appear frequently).
>> It is a bit similar to skip grams, but it does not assume any
>> restrictions, neither in number of "skips" nor in "grams". Basically
>> working on this method reminds me prehistoric experiments, when people
>> were launching a computer program and taking two days off :)  However,  
>> the
>> results are interesting and seem promising - I could extract frequent
>> patterns from Ainu language (a polysynthetic relict which I don't know  
>> at
>> all) and the person speaking Ainu said they actually were patterns (!).
>>
>> The wink I am making says - Why not give up the old "only n-gram"  
>> approach
>> and start dealing with something more sophisticated? After all, Shannon
>> proposed n-grams over 50 years ago. I would love to see something like
>> "Google patterns".
>>
>> Michal
>>
>>
>> ----------------
>> Od: Angus Grieve-Smith <grvsmth at panix.com>
>> Do: corpora at uib.no
>> Data: Thu, 12 May 2011 12:37:46 -0400
>> Temat: Re: [Corpora-List] 155 *billion* (155, 000, 000, 000) word corpus
>> of American English
>>
>> On 5/12/2011 11:15 AM, Mark Davies wrote:
>> Is the corpus itself or part of it available for downloading? It would  
>> be
>> more useful if we could process the raw text for our own purpose rather
>> than accessing it from a web interface.
>> As mentioned previously, the underlying n-grams data is freely available
>>  from Google at http://ngrams.googlelabs.com/datasets (see
>> http://creativecommons.org/licenses/by/3.0/ re. licensing).
>>
>>      When I try to use it, I get "Session expired. Click here to start  
>> new
>> session."
>>
>>      In theory, though, all the books are available for free from
>> http://books.google.com/ .  In the Google ngram interface at
>> http://ngrams.googlelabs.com/ there are links to date ranges.  If you
>> click on those you will see a date range result for the search term on  
>> the
>> Google Books website.  You can then click the "Plain text" link in the
>> upper right hand corner to see the OCRed text.  Then you can appreciate
>> how rough some of the OCR has been.
>>
>> _______________________________________________
>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/listinfo/corpora
>>


-- 
Michal PTASZYNSKI
ptaszynski at ieee.org

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list