[Corpora-List] Questions for Google syntactic N-grams corpus

Adam Kilgarriff adam at lexmasterclass.com
Wed Nov 13 09:44:49 UTC 2013


Dear Gang Tian

While N-grams is a fascinating resource, it is not full sentences (and I'm
not sure how much not-text and duplication it includes, this was a problem
with the first version) so what you can do is constrained,  We have found
triplets for, corpora of up to 70b words (for English - 11 b words for a
range of other languages) so you may consider using those resources instead
/ as well. (A comparative evaluation using the two resources and seeing how
they compare, would be very interesting)

The 70b corpus is from the CLUEWEB 09 crawl, which we cleaned,
deduplicated, lemmatised, pos-tagged, and parsed, and loaded into the
Sketch Engine, as reported
here<http://www.lrec-conf.org/proceedings/lrec2012/summaries/1047.html>.
 To get access you need to sign a licence with Carnegie Mellon, see
http://lemurproject.org/clueweb09/.  API access also possible.

Regards

Adam




On 13 November 2013 08:44, tg <beijixingboy at hotmail.com> wrote:

> Hi, dear all,
>
>
> I am extremely interested in the new edition of Google N-grams corpus.My
> research topic is using the sentence dependence parsing skill to mining the
> web scale textual corpus for semantics understanding.
>
>
> And I want to ask two questions as following,
>
>
> Q1: how to use this large scale data? Is there any existing tools, e.g.
> indexing and search tools like lucene (maybe not available for this big
> data)? Any other index tools?
>
>
> Q2: I want to extract the typical triplets dependent relations (S-V-O,
> e.g. "lion - chase - zebra"), could you help me for how to do this
> efficiently?
>
> Gang Tian | Phd Student
>
> School of Information Technologies | Faculty of Engineering & IT
>
> THE UNIVERSITY OF SYDNEY
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>


-- 
========================================
Adam Kilgarriff <http://www.kilgarriff.co.uk/>
adam at lexmasterclass.com
Director                                    Lexical Computing
Ltd<http://www.sketchengine.co.uk/>

Visiting Research Fellow                 University of
Leeds<http://leeds.ac.uk>

*Corpora for all* with the Sketch Engine <http://www.sketchengine.co.uk>

                        *DANTE: a lexical database for English
<http://www.webdante.com>                  *
========================================
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20131113/74b30411/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list