<div dir="ltr">Dear Gang Tian<div><br></div><div>While N-grams is a fascinating resource, it is not full sentences (and I'm not sure how much not-text and duplication it includes, this was a problem with the first version) so what you can do is constrained, We have found triplets for, corpora of up to 70b words (for English - 11 b words for a range of other languages) so you may consider using those resources instead / as well. (A comparative evaluation using the two resources and seeing how they compare, would be very interesting)</div>
<div><br></div><div>The 70b corpus is from the CLUEWEB 09 crawl, which we cleaned, deduplicated, lemmatised, pos-tagged, and parsed, and loaded into the Sketch Engine, as reported <a href="http://www.lrec-conf.org/proceedings/lrec2012/summaries/1047.html">here</a>. To get access you need to sign a licence with Carnegie Mellon, see <a href="http://lemurproject.org/clueweb09/">http://lemurproject.org/clueweb09/</a>. API access also possible.</div>
<div><br></div><div>Regards</div><div><br></div><div>Adam</div><div><br></div><div><br></div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On 13 November 2013 08:44, tg <span dir="ltr"><<a href="mailto:beijixingboy@hotmail.com" target="_blank">beijixingboy@hotmail.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div><div dir="ltr"><p class="MsoNormal"><font size="3"><a name="14250a743beb59aa_OLE_LINK63"></a><a name="14250a743beb59aa_OLE_LINK58"></a><a name="14250a743beb59aa_OLE_LINK57"><span lang="EN-US" style>Hi,
dear all,</span></a></font></p><p class="MsoNormal"><font size="3"><a name="14250a743beb59aa_OLE_LINK57"><span lang="EN-US" style><br></span></a></font></p>
<p class="MsoNormal" align="left"><font size="3"><a name="14250a743beb59aa_OLE_LINK59"><span lang="EN-US">I
am extremely interested in the new edition of Google N-grams
corpus.My research topic is using the sentence dependence parsing skill to
mining the web scale textual corpus for semantics understanding.<u></u><u></u></span></a></font></p><p class="MsoNormal" align="left"><br></p>
<p class="MsoNormal" align="left"><font size="3"><span lang="EN-US" style="color:rgb(68,68,68)">And I want to ask two questions as following,</span><span lang="EN-US"><u></u><u></u></span></font></p><p class="MsoNormal" align="left">
<font size="3"><span lang="EN-US" style="color:rgb(68,68,68)"><br></span></font></p>
<p class="MsoNormal" align="left"><font size="3"><span lang="EN-US" style="color:rgb(68,68,68)">Q1: how to use this large scale data? Is there any existing
tools, e.g. indexing and search tools like lucene (maybe not available for this
big data)? Any other index tools?</span></font></p><p class="MsoNormal" align="left"><font size="3"><span lang="EN-US" style="color:rgb(68,68,68)"><br></span></font></p>
<span lang="EN-US" style="color:rgb(68,68,68)"><font size="3">Q2: I want to extract the typical triplets dependent
relations (S-V-O, e.g. "lion - chase - zebra"), could you help me for
how to do this efficiently?</font></span><br><br><p style="line-height:21.81818199157715px;color:rgb(68,68,68);font-family:'Microsoft YaHei UI','Microsoft YaHei',宋体,Calibri,sans-serif;font-size:15.454545021057129px">
Gang Tian | Phd Student</p><p style="line-height:21.81818199157715px;color:rgb(68,68,68);font-family:'Microsoft YaHei UI','Microsoft YaHei',宋体,Calibri,sans-serif;font-size:15.454545021057129px"></p><p style="line-height:21.81818199157715px;color:rgb(68,68,68);font-family:'Microsoft YaHei UI','Microsoft YaHei',宋体,Calibri,sans-serif;font-size:15.454545021057129px">
School of Information Technologies | Faculty of Engineering & IT</p><p style="line-height:21.81818199157715px;color:rgb(68,68,68);font-family:'Microsoft YaHei UI','Microsoft YaHei',宋体,Calibri,sans-serif;font-size:15.454545021057129px">
THE UNIVERSITY OF SYDNEY</p> </div></div>
<br>_______________________________________________<br>
UNSUBSCRIBE from this page: <a href="http://mailman.uib.no/options/corpora" target="_blank">http://mailman.uib.no/options/corpora</a><br>
Corpora mailing list<br>
<a href="mailto:Corpora@uib.no">Corpora@uib.no</a><br>
<a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/listinfo/corpora</a><br>
<br></blockquote></div><br><br clear="all"><div><br></div>-- <br>========================================<br><a href="http://www.kilgarriff.co.uk/" target="_blank">Adam Kilgarriff</a> <a href="mailto:adam@lexmasterclass.com" target="_blank">adam@lexmasterclass.com</a> <br>
Director <a href="http://www.sketchengine.co.uk/" target="_blank">Lexical Computing Ltd</a> <br>Visiting Research Fellow <a href="http://leeds.ac.uk" target="_blank">University of Leeds</a> <div>
<i><font color="#006600">Corpora for all</font></i> with <a href="http://www.sketchengine.co.uk" target="_blank">the Sketch Engine</a> </div><div> <i><a href="http://www.webdante.com" target="_blank">DANTE: <font color="#009900">a lexical database for English</font></a><font color="#009900"> </font> </i><div>
========================================</div></div>
</div>