<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
  <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">
  <title></title>
</head>
<body bgcolor="#ffffff" text="#000000">
<br>
one may have a look into this paper if interested in suffix array stuff
... didn't expect so many papers on it in recent years, but there are
some smart algorithms for its implementation.<br>
<br>
best,<br>
kit<br>
<br>
<br>
<table border="0" cellpadding="0" cellspacing="0">
  <tbody>
    <tr>
      <td colspan="3"><a class="medium-text"
 href="citation.cfm?id=1242471.1242472&coll=GUIDE&dl=ACM&CFID=8957818&CFTOKEN=40985920"
 target="_self">A taxonomy of suffix array construction algorithms</a>
      <div class="authors"><a
 href="author_page.cfm?id=81100272116&coll=GUIDE&dl=ACM&CFID=8957818&CFTOKEN=40985920">Simon
J. Puglisi</a>, <a
 href="author_page.cfm?id=81338490986&coll=GUIDE&dl=ACM&CFID=8957818&CFTOKEN=40985920">W.
F. Smyth</a>, <a
 href="author_page.cfm?id=81100229960&coll=GUIDE&dl=ACM&CFID=8957818&CFTOKEN=40985920">Andrew
H. Turpin</a> </div>
      </td>
    </tr>
    <tr valign="top">
      <td class="small-text" nowrap="nowrap">July 2007 </td>
      <td><br>
      </td>
      <td>
      <div class="addinfo"><strong>Computing Surveys (CSUR)</strong> <font
 size="-2">, Volume 39 Issue 2</font> </div>
      </td>
    </tr>
    <tr valign="top">
      <td class="small-text" colspan="3" nowrap="nowrap"><strong>Publisher:</strong> ACM
      </td>
    </tr>
    <tr valign="top">
      <td class="smaller-text" colspan="3" valign="top">
      <table border="0" cellpadding="0" cellspacing="0" width="100%">
        <tbody>
          <tr valign="top">
          </tr>
        </tbody><colgroup><col width="35%"><col width="65%"></colgroup><tbody>
          <tr>
            <td class="smaller-text">
            <table border="0" cellpadding="0">
              <tbody>
                <tr valign="top">
                  <td class="smaller-text" nowrap="nowrap">Full text
available:</td>
                  <td class="smaller-text" nowrap="nowrap" valign="top"><a
 title="Pdf"
 href="ft_gateway.cfm?id=1242472&type=pdf&coll=GUIDE&dl=ACM&CFID=8957818&CFTOKEN=40985920"
 target="_blank"><img style="margin-right: 2px;" alt="Pdf"
 src="cid:part1.08060502.05060707@cityu.edu.hk" align="middle"
 border="0">Pdf</a> (381.17 KB) </td>
                </tr>
              </tbody>
            </table>
            </td>
            <td class="smaller-text">
            <table border="0" cellpadding="0">
              <tbody>
                <tr valign="top">
                  <td class="smaller-text" nowrap="nowrap">Additional
Information:</td>
                  <td class="smaller-text"><img alt=""
 src="cid:part2.03040907.07080202@cityu.edu.hk" align="texttop"
 border="0" height="16" width="1"><a
 href="citation.cfm?id=1242471.1242472&jmp=cit&coll=GUIDE&dl=ACM&CFID=8957818&CFTOKEN=40985920#CIT"
 target="_self">full citation</a>, <a
 href="citation.cfm?id=1242471.1242472&jmp=abstract&coll=GUIDE&dl=ACM&CFID=8957818&CFTOKEN=40985920#abstract"
 target="_self">abstract</a>, <a
 href="citation.cfm?id=1242471.1242472&jmp=references&coll=GUIDE&dl=ACM&CFID=8957818&CFTOKEN=40985920#references"
 target="_self">references</a>, <a
 href="citation.cfm?id=1242471.1242472&jmp=indexterms&coll=GUIDE&dl=ACM&CFID=8957818&CFTOKEN=40985920#indexterms"
 target="_self">index terms</a></td>
                </tr>
              </tbody>
            </table>
            </td>
          </tr>
          <tr valign="top">
            <td class="small-text" style="padding-top: 5px;" colspan="3"
 nowrap="nowrap"><strong>Bibliometrics</strong>:  Downloads (6 Weeks):
53,   Downloads (12 Months): 649,   Citation Count: 3</td>
          </tr>
          <tr>
            <td height="5"><img alt=""
 src="cid:part3.06080604.04000801@cityu.edu.hk" border="0" height="1"
 width="1"></td>
          </tr>
        </tbody>
      </table>
      </td>
    </tr>
    <tr valign="top">
      <td colspan="3">
      <div class="abstract2"><br>
      <p>In 1990, Manber and Myers proposed suffix arrays as a
space-saving alternative to suffix trees and described the first
algorithms for suffix array construction and use. Since that time, and
especially in the last few years, suffix array construction ...</p>
      </div>
      </td>
    </tr>
  </tbody>
</table>
<br>
<br>
<br>
Alexandre Rafalovitch wrote:
<blockquote
 cite="mid9194dc590810311800i4128db6bpf482c67e0d46e959@mail.gmail.com"
 type="cite">
  <pre wrap="">One of the interesting papers around suffix arrays is by Chunyu KIT:
"The Virtual Corpus approach to deriving n-gram statistics from large
scale corpora"
<a class="moz-txt-link-freetext" href="http://personal.cityu.edu.hk/~ctckit/papers/vc.pdf">http://personal.cityu.edu.hk/~ctckit/papers/vc.pdf</a>

I have some (scary) java code based around those concepts that can do
statistical analysis on n-grams with n above 140 and does look at
(n-1)-grams and (n+1)-grams with the same length.

If that's of interest, I would be happy to discuss this further in a
direct email (to not bother the list).

Regards,
    Alex.

Personal blog: <a class="moz-txt-link-freetext" href="http://blog.outerthoughts.com/">http://blog.outerthoughts.com/</a>
Research group: <a class="moz-txt-link-freetext" href="http://www.clt.mq.edu.au/Research/">http://www.clt.mq.edu.au/Research/</a>



On Tue, Oct 28, 2008 at 11:21 AM, Adam Lopez <a class="moz-txt-link-rfc2396E" href="mailto:alopez@inf.ed.ac.uk"><alopez@inf.ed.ac.uk></a> wrote:
  </pre>
  <blockquote type="cite">
    <blockquote type="cite">
      <pre wrap="">I was wondering whether anybody is aware of ideas and/or automated
processes to reduce n-gram output by solving the common problem that
shorter n-grams can be fragments of larger structures (e.g. the 5-
gram 'at the end of the' as part of the 6-gram 'at the end of the
day')

I am only aware of Paul Rayson's work on c-grams (collapsed-grams).
      </pre>
    </blockquote>
    <pre wrap="">Suffix trees, suffix arrays, and their relatives are compact data
structures following exactly the intuition that smaller strings are
substrings of larger ones. They represent all possible n-grams
(without limit on n) of a text in space proportional to the length of
the text and support efficient retrieval, counting, and other queries
on substrings of the text; there is a vast literature on their various
applications (and theory linking them to compressibility, etc.).
    </pre>
  </blockquote>
  <pre wrap=""><!---->
_______________________________________________
Corpora mailing list
<a class="moz-txt-link-abbreviated" href="mailto:Corpora@uib.no">Corpora@uib.no</a>
<a class="moz-txt-link-freetext" href="http://mailman.uib.no/listinfo/corpora">http://mailman.uib.no/listinfo/corpora</a>

  </pre>
</blockquote>
<br>
<br>
<pre class="moz-signature" cols="72">-- 
Chunyu Kit, PhD
Associate Professor, Computational Linguistics

Dept. of Chinese, Translation & Linguistics
City University of Hong Kong
83 Tat Chee Ave., Kowloon

<a class="moz-txt-link-freetext" href="http://personal.cityu.edu.hk/~ctckit/">http://personal.cityu.edu.hk/~ctckit/</a>
Tel: (+852)2788 9310 (O), 9380 1738 (M)
     (+86)135 3948 3937 (China Mobile)
Fax: (+852)2788 8706, 2788 7320</pre>
</body>
</html>