<div>Concerning Michal's 'btw' asking about alternatives to n-grams, Nai-Lung Tsao and I have pursued an alternative in recent years. We call them 'hybrid n-grams', and use them to build a navigable model of English that we call StringNet (<a href="http://nav.stringnet.org/">http://nav.stringnet.org</a>). Of course we aren't the only ones or the first to put POSs in n-grams, but we've tried to push it and exploit it to created a lang. model that is a (navigable) network that captures relations among patterns.</div>


<div><em></em> </div>

<div>Here's a paper with details and with a description of its use for exploring linguistic constructions: <a href="http://www.aclweb.org/anthology/W/W10/W10-0804.pdf">http://www.aclweb.org/anthology/W/W10/W10-0804.pdf</a></div>


<div> </div>

<div>If you do visit <a href="http://nav.stringnet.org/">http://nav.stringnet.org</a>, submit a word the way you would with a concordancer. When results show up, click on everything. Clicking on a POS tag in any of the results gives a pop-up showing all words attested in that slot and their frequencies (be patient with pop-ups). Click on parent or child links beside the patterns in the search results to navigate 'up' and 'down'.</div>


<div> </div>

<div>Best,</div>

<div>David</div>

<div><em></em> </div>

<div><em></em> </div>

<div class="gmail_quote">On Fri, May 13, 2011 at 1:57 AM, Michal Ptaszynski <span dir="ltr"><<a href="mailto:ptaszynski@media.eng.hokudai.ac.jp">ptaszynski@media.eng.hokudai.ac.jp</a>></span> wrote:<br>

<blockquote style="BORDER-LEFT: #ccc 1px solid; MARGIN: 0px 0px 0px 0.8ex; PADDING-LEFT: 1ex" class="gmail_quote">

<blockquote style="BORDER-LEFT: #ccc 1px solid; MARGIN: 0px 0px 0px 0.8ex; PADDING-LEFT: 1ex" class="gmail_quote">In theory, though, all the books are available for free from <a href="http://books.google.com/" target="_blank">http://books.google.com/</a> .  In the Google ngram interface at <a href="http://ngrams.googlelabs.com/" target="_blank">http://ngrams.googlelabs.com/</a><br>

</blockquote><br>Luckily its downloadable (the n-grams), since you do NOT want to use a<br>hundred-billion word corpus on an interface which blocks you for a day<br>after 1000 queries. :D<br><br>BTW. I was wondering why so many people still stick to n-grams, when we<br>

all know that frequent sentence patterns usually consist of separated<br>entities (words, pos, or however you define sentence patterns). I remember<br>Yorik tried something called "skip grams". Apart from a huge number of<br>

skip-grams they generated (in GBs), was this approach actually useful in<br>any way? (I mean, e.g., produced an effective tool or method - not just a<br>conference paper).<br>I am asking, since at present I am developing a method based on<br>

combinatorial approach (extract all combinations of entities and check<br>which appear frequently).<br>It is a bit similar to skip grams, but it does not assume any<br>restrictions, neither in number of "skips" nor in "grams". Basically<br>

working on this method reminds me prehistoric experiments, when people<br>were launching a computer program and taking two days off :)  However, the<br>results are interesting and seem promising - I could extract frequent<br>

patterns from Ainu language (a polysynthetic relict which I don't know at<br>all) and the person speaking Ainu said they actually were patterns (!).<br><br>The wink I am making says - Why not give up the old "only n-gram" approach<br>

and start dealing with something more sophisticated? After all, Shannon<br>proposed n-grams over 50 years ago. I would love to see something like<br>"Google patterns".<br><br>Michal<br><br><br>----------------<br>

Od: Angus Grieve-Smith <<a href="mailto:grvsmth@panix.com" target="_blank">grvsmth@panix.com</a>><br>Do: <a href="mailto:corpora@uib.no" target="_blank">corpora@uib.no</a><br>Data: Thu, 12 May 2011 12:37:46 -0400<br>

Temat: Re: [Corpora-List] 155 *billion* (155, 000, 000, 000) word corpus<br>of American English<br><br>On 5/12/2011 11:15 AM, Mark Davies wrote:<br>Is the corpus itself or part of it available for downloading? It would be<br>

more useful if we could process the raw text for our own purpose rather<br>than accessing it from a web interface.<br>As mentioned previously, the underlying n-grams data is freely available<br> from Google at <a href="http://ngrams.googlelabs.com/datasets" target="_blank">http://ngrams.googlelabs.com/datasets</a> (see<br>

<a href="http://creativecommons.org/licenses/by/3.0/" target="_blank">http://creativecommons.org/licenses/by/3.0/</a> re. licensing).<br><br>     When I try to use it, I get "Session expired. Click here to start new<br>

session."<br><br>     In theory, though, all the books are available for free from<br><a href="http://books.google.com/" target="_blank">http://books.google.com/</a> .  In the Google ngram interface at<br><a href="http://ngrams.googlelabs.com/" target="_blank">http://ngrams.googlelabs.com/</a> there are links to date ranges.  If you<br>

click on those you will see a date range result for the search term on the<br>Google Books website.  You can then click the "Plain text" link in the<br>upper right hand corner to see the OCRed text.  Then you can appreciate<br>

how rough some of the OCR has been.<br><br>_______________________________________________<br>UNSUBSCRIBE from this page: <a href="http://mailman.uib.no/options/corpora" target="_blank">http://mailman.uib.no/options/corpora</a><br>

Corpora mailing list<br><a href="mailto:Corpora@uib.no" target="_blank">Corpora@uib.no</a><br><a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/listinfo/corpora</a><br></blockquote></div>

<br>