[Corpora-List] Corpora Digest, Vol 47, Issue 14

Benjamin Allison ballison at staffmail.ed.ac.uk
Fri May 13 09:40:09 UTC 2011


Why one might use n-grams really depends on what one wishes to do with  
them. In truth, I think for many of the applications where they're  
primarily used (anything that uses a language model) there isn't  
really a feasible alternative. I agree they're not ideal for  
visualisation and inspection of patterns, but I don't think  
concordancers or similar use n-grams anyway.

Note however that n-grams does not just mean all three word sequences  
- rather they are a means of estimating the probability of observing a  
word given the words that have come before it. Language models based  
on sequences of words have improved beyond recognition since the early  
days of information theory -- if you're interested, check Yee Whye  
Teh's recent work on the sequence memoizer, which is an infinite-order  
n-gram model.

B

Quoting David Wible <wible at stringnet.org> on Fri, 13 May 2011 12:13:28 +0800:

> Concerning Michal's 'btw' asking about alternatives to n-grams, Nai-Lung
> Tsao and I have pursued an alternative in recent years. We call them 'hybrid
> n-grams', and use them to build a navigable model of English that we call
> StringNet (http://nav.stringnet.org). Of course we aren't the only ones or
> the first to put POSs in n-grams, but we've tried to push it and exploit it
> to created a lang. model that is a (navigable) network that captures
> relations among patterns.
> **
> Here's a paper with details and with a description of its use for exploring
> linguistic constructions: http://www.aclweb.org/anthology/W/W10/W10-0804.pdf
>
> If you do visit http://nav.stringnet.org, submit a word the way you would
> with a concordancer. When results show up, click on everything. Clicking on
> a POS tag in any of the results gives a pop-up showing all words attested in
> that slot and their frequencies (be patient with pop-ups). Click on parent
> or child links beside the patterns in the search results to navigate 'up'
> and 'down'.
>
> Best,
> David
> **
> **
> On Fri, May 13, 2011 at 1:57 AM, Michal Ptaszynski <
> ptaszynski at media.eng.hokudai.ac.jp> wrote:
>
>> In theory, though, all the books are available for free from
>>> http://books.google.com/ .  In the Google ngram interface at
>>> http://ngrams.googlelabs.com/
>>>
>>
>> Luckily its downloadable (the n-grams), since you do NOT want to use a
>> hundred-billion word corpus on an interface which blocks you for a day
>> after 1000 queries. :D
>>
>> BTW. I was wondering why so many people still stick to n-grams, when we
>> all know that frequent sentence patterns usually consist of separated
>> entities (words, pos, or however you define sentence patterns). I remember
>> Yorik tried something called "skip grams". Apart from a huge number of
>> skip-grams they generated (in GBs), was this approach actually useful in
>> any way? (I mean, e.g., produced an effective tool or method - not just a
>> conference paper).
>> I am asking, since at present I am developing a method based on
>> combinatorial approach (extract all combinations of entities and check
>> which appear frequently).
>> It is a bit similar to skip grams, but it does not assume any
>> restrictions, neither in number of "skips" nor in "grams". Basically
>> working on this method reminds me prehistoric experiments, when people
>> were launching a computer program and taking two days off :)  However, the
>> results are interesting and seem promising - I could extract frequent
>> patterns from Ainu language (a polysynthetic relict which I don't know at
>> all) and the person speaking Ainu said they actually were patterns (!).
>>
>> The wink I am making says - Why not give up the old "only n-gram" approach
>> and start dealing with something more sophisticated? After all, Shannon
>> proposed n-grams over 50 years ago. I would love to see something like
>> "Google patterns".
>>
>> Michal
>>
>>
>> ----------------
>> Od: Angus Grieve-Smith <grvsmth at panix.com>
>> Do: corpora at uib.no
>> Data: Thu, 12 May 2011 12:37:46 -0400
>> Temat: Re: [Corpora-List] 155 *billion* (155, 000, 000, 000) word corpus
>> of American English
>>
>> On 5/12/2011 11:15 AM, Mark Davies wrote:
>> Is the corpus itself or part of it available for downloading? It would be
>> more useful if we could process the raw text for our own purpose rather
>> than accessing it from a web interface.
>> As mentioned previously, the underlying n-grams data is freely available
>>  from Google at http://ngrams.googlelabs.com/datasets (see
>> http://creativecommons.org/licenses/by/3.0/ re. licensing).
>>
>>      When I try to use it, I get "Session expired. Click here to start new
>> session."
>>
>>      In theory, though, all the books are available for free from
>> http://books.google.com/ .  In the Google ngram interface at
>> http://ngrams.googlelabs.com/ there are links to date ranges.  If you
>> click on those you will see a date range result for the search term on the
>> Google Books website.  You can then click the "Plain text" link in the
>> upper right hand corner to see the OCRed text.  Then you can appreciate
>> how rough some of the OCR has been.
>>
>> _______________________________________________
>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/listinfo/corpora
>>
>



-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list