[Corpora-List] Corpora Digest, Vol 47, Issue 14

David Wible wible at stringnet.org
Fri May 13 04:13:28 UTC 2011


Concerning Michal's 'btw' asking about alternatives to n-grams, Nai-Lung
Tsao and I have pursued an alternative in recent years. We call them 'hybrid
n-grams', and use them to build a navigable model of English that we call
StringNet (http://nav.stringnet.org). Of course we aren't the only ones or
the first to put POSs in n-grams, but we've tried to push it and exploit it
to created a lang. model that is a (navigable) network that captures
relations among patterns.
**
Here's a paper with details and with a description of its use for exploring
linguistic constructions: http://www.aclweb.org/anthology/W/W10/W10-0804.pdf

If you do visit http://nav.stringnet.org, submit a word the way you would
with a concordancer. When results show up, click on everything. Clicking on
a POS tag in any of the results gives a pop-up showing all words attested in
that slot and their frequencies (be patient with pop-ups). Click on parent
or child links beside the patterns in the search results to navigate 'up'
and 'down'.

Best,
David
**
**
On Fri, May 13, 2011 at 1:57 AM, Michal Ptaszynski <
ptaszynski at media.eng.hokudai.ac.jp> wrote:

> In theory, though, all the books are available for free from
>> http://books.google.com/ .  In the Google ngram interface at
>> http://ngrams.googlelabs.com/
>>
>
> Luckily its downloadable (the n-grams), since you do NOT want to use a
> hundred-billion word corpus on an interface which blocks you for a day
> after 1000 queries. :D
>
> BTW. I was wondering why so many people still stick to n-grams, when we
> all know that frequent sentence patterns usually consist of separated
> entities (words, pos, or however you define sentence patterns). I remember
> Yorik tried something called "skip grams". Apart from a huge number of
> skip-grams they generated (in GBs), was this approach actually useful in
> any way? (I mean, e.g., produced an effective tool or method - not just a
> conference paper).
> I am asking, since at present I am developing a method based on
> combinatorial approach (extract all combinations of entities and check
> which appear frequently).
> It is a bit similar to skip grams, but it does not assume any
> restrictions, neither in number of "skips" nor in "grams". Basically
> working on this method reminds me prehistoric experiments, when people
> were launching a computer program and taking two days off :)  However, the
> results are interesting and seem promising - I could extract frequent
> patterns from Ainu language (a polysynthetic relict which I don't know at
> all) and the person speaking Ainu said they actually were patterns (!).
>
> The wink I am making says - Why not give up the old "only n-gram" approach
> and start dealing with something more sophisticated? After all, Shannon
> proposed n-grams over 50 years ago. I would love to see something like
> "Google patterns".
>
> Michal
>
>
> ----------------
> Od: Angus Grieve-Smith <grvsmth at panix.com>
> Do: corpora at uib.no
> Data: Thu, 12 May 2011 12:37:46 -0400
> Temat: Re: [Corpora-List] 155 *billion* (155, 000, 000, 000) word corpus
> of American English
>
> On 5/12/2011 11:15 AM, Mark Davies wrote:
> Is the corpus itself or part of it available for downloading? It would be
> more useful if we could process the raw text for our own purpose rather
> than accessing it from a web interface.
> As mentioned previously, the underlying n-grams data is freely available
>  from Google at http://ngrams.googlelabs.com/datasets (see
> http://creativecommons.org/licenses/by/3.0/ re. licensing).
>
>      When I try to use it, I get "Session expired. Click here to start new
> session."
>
>      In theory, though, all the books are available for free from
> http://books.google.com/ .  In the Google ngram interface at
> http://ngrams.googlelabs.com/ there are links to date ranges.  If you
> click on those you will see a date range result for the search term on the
> Google Books website.  You can then click the "Plain text" link in the
> upper right hand corner to see the OCRed text.  Then you can appreciate
> how rough some of the OCR has been.
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20110513/7e2f768f/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list