<div dir="ltr">Dear Gang Tian<div><br></div><div>While N-grams is a fascinating resource, it is not full sentences (and I'm not sure how much not-text and duplication it includes, this was a problem with the first version) so what you can do is constrained,  We have found triplets for, corpora of up to 70b words (for English - 11 b words for a range of other languages) so you may consider using those resources instead / as well. (A comparative evaluation using the two resources and seeing how they compare, would be very interesting)</div>


<div><br></div><div>The 70b corpus is from the CLUEWEB 09 crawl, which we cleaned, deduplicated, lemmatised, pos-tagged, and parsed, and loaded into the Sketch Engine, as reported <a href="http://www.lrec-conf.org/proceedings/lrec2012/summaries/1047.html">here</a>.  To get access you need to sign a licence with Carnegie Mellon, see <a href="http://lemurproject.org/clueweb09/">http://lemurproject.org/clueweb09/</a>.  API access also possible.</div>


<div><br></div><div>Regards</div><div><br></div><div>Adam</div><div><br></div><div><br></div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On 13 November 2013 08:44, tg <span dir="ltr"><<a href="mailto:beijixingboy@hotmail.com" target="_blank">beijixingboy@hotmail.com</a>></span> wrote:<br>


<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


<div><div dir="ltr"><p class="MsoNormal"><font size="3"><a name="14250a743beb59aa_OLE_LINK63"></a><a name="14250a743beb59aa_OLE_LINK58"></a><a name="14250a743beb59aa_OLE_LINK57"><span lang="EN-US" style>Hi,

dear all,</span></a></font></p><p class="MsoNormal"><font size="3"><a name="14250a743beb59aa_OLE_LINK57"><span lang="EN-US" style><br></span></a></font></p>


<p class="MsoNormal" align="left"><font size="3"><a name="14250a743beb59aa_OLE_LINK59"><span lang="EN-US">I

am extremely interested in the new edition of Google N-grams

corpus.My research topic is using the sentence dependence parsing skill to

mining the web scale textual corpus for semantics understanding.<u></u><u></u></span></a></font></p><p class="MsoNormal" align="left"><br></p>


<p class="MsoNormal" align="left"><font size="3"><span lang="EN-US" style="color:rgb(68,68,68)">And I want to ask two questions as following,</span><span lang="EN-US"><u></u><u></u></span></font></p><p class="MsoNormal" align="left">


<font size="3"><span lang="EN-US" style="color:rgb(68,68,68)"><br></span></font></p>


<p class="MsoNormal" align="left"><font size="3"><span lang="EN-US" style="color:rgb(68,68,68)">Q1: how to use this large scale data? Is there any existing

tools, e.g. indexing and search tools like lucene (maybe not available for this

big data)? Any other index tools?</span></font></p><p class="MsoNormal" align="left"><font size="3"><span lang="EN-US" style="color:rgb(68,68,68)"><br></span></font></p>


<span lang="EN-US" style="color:rgb(68,68,68)"><font size="3">Q2: I want to extract the typical triplets dependent

relations (S-V-O, e.g. "lion - chase - zebra"), could you help me for

how to do this efficiently?</font></span><br><br><p style="line-height:21.81818199157715px;color:rgb(68,68,68);font-family:'Microsoft YaHei UI','Microsoft YaHei',宋体,Calibri,sans-serif;font-size:15.454545021057129px">


Gang Tian | Phd Student</p><p style="line-height:21.81818199157715px;color:rgb(68,68,68);font-family:'Microsoft YaHei UI','Microsoft YaHei',宋体,Calibri,sans-serif;font-size:15.454545021057129px"></p><p style="line-height:21.81818199157715px;color:rgb(68,68,68);font-family:'Microsoft YaHei UI','Microsoft YaHei',宋体,Calibri,sans-serif;font-size:15.454545021057129px">


School of Information Technologies | Faculty of Engineering & IT</p><p style="line-height:21.81818199157715px;color:rgb(68,68,68);font-family:'Microsoft YaHei UI','Microsoft YaHei',宋体,Calibri,sans-serif;font-size:15.454545021057129px">


THE UNIVERSITY OF SYDNEY</p>                                        </div></div>

<br>_______________________________________________<br>

UNSUBSCRIBE from this page: <a href="http://mailman.uib.no/options/corpora" target="_blank">http://mailman.uib.no/options/corpora</a><br>

Corpora mailing list<br>

<a href="mailto:Corpora@uib.no">Corpora@uib.no</a><br>

<a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/listinfo/corpora</a><br>

<br></blockquote></div><br><br clear="all"><div><br></div>-- <br>========================================<br><a href="http://www.kilgarriff.co.uk/" target="_blank">Adam Kilgarriff</a>                  <a href="mailto:adam@lexmasterclass.com" target="_blank">adam@lexmasterclass.com</a>                                             <br>


Director                                    <a href="http://www.sketchengine.co.uk/" target="_blank">Lexical Computing Ltd</a>                <br>Visiting Research Fellow                 <a href="http://leeds.ac.uk" target="_blank">University of Leeds</a>     <div>


<i><font color="#006600">Corpora for all</font></i> with <a href="http://www.sketchengine.co.uk" target="_blank">the Sketch Engine</a>                 </div><div>                        <i><a href="http://www.webdante.com" target="_blank">DANTE: <font color="#009900">a lexical database for English</font></a><font color="#009900"> </font>                 </i><div>


========================================</div></div>

</div>