<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

  <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">

  <title></title>

</head>

<body bgcolor="#ffffff" text="#000000">

<br>

one may have a look into this paper if interested in suffix array stuff

... didn't expect so many papers on it in recent years, but there are

some smart algorithms for its implementation.<br>

<br>

best,<br>

kit<br>

<br>

<br>

<table border="0" cellpadding="0" cellspacing="0">

  <tbody>

    <tr>

      <td colspan="3"><a class="medium-text"

 href="citation.cfm?id=1242471.1242472&coll=GUIDE&dl=ACM&CFID=8957818&CFTOKEN=40985920"

 target="_self">A taxonomy of suffix array construction algorithms</a>

      <div class="authors"><a

 href="author_page.cfm?id=81100272116&coll=GUIDE&dl=ACM&CFID=8957818&CFTOKEN=40985920">Simon

J. Puglisi</a>, <a

 href="author_page.cfm?id=81338490986&coll=GUIDE&dl=ACM&CFID=8957818&CFTOKEN=40985920">W.

F. Smyth</a>, <a

 href="author_page.cfm?id=81100229960&coll=GUIDE&dl=ACM&CFID=8957818&CFTOKEN=40985920">Andrew

H. Turpin</a> </div>

      </td>

    </tr>

    <tr valign="top">

      <td class="small-text" nowrap="nowrap">July 2007 </td>

      <td><br>

      </td>

      <td>

      <div class="addinfo"><strong>Computing Surveys (CSUR)</strong> <font

 size="-2">, Volume 39 Issue 2</font> </div>

      </td>

    </tr>

    <tr valign="top">

      <td class="small-text" colspan="3" nowrap="nowrap"><strong>Publisher:</strong> ACM

      </td>

    </tr>

    <tr valign="top">

      <td class="smaller-text" colspan="3" valign="top">

      <table border="0" cellpadding="0" cellspacing="0" width="100%">

        <tbody>

          <tr valign="top">

          </tr>

        </tbody><colgroup><col width="35%"><col width="65%"></colgroup><tbody>

          <tr>

            <td class="smaller-text">

            <table border="0" cellpadding="0">

              <tbody>

                <tr valign="top">

                  <td class="smaller-text" nowrap="nowrap">Full text

available:</td>

                  <td class="smaller-text" nowrap="nowrap" valign="top"><a

 title="Pdf"

 href="ft_gateway.cfm?id=1242472&type=pdf&coll=GUIDE&dl=ACM&CFID=8957818&CFTOKEN=40985920"

 target="_blank"><img style="margin-right: 2px;" alt="Pdf"

 src="cid:part1.08060502.05060707@cityu.edu.hk" align="middle"

 border="0">Pdf</a> (381.17 KB) </td>

                </tr>

              </tbody>

            </table>

            </td>

            <td class="smaller-text">

            <table border="0" cellpadding="0">

              <tbody>

                <tr valign="top">

                  <td class="smaller-text" nowrap="nowrap">Additional

Information:</td>

                  <td class="smaller-text"><img alt=""

 src="cid:part2.03040907.07080202@cityu.edu.hk" align="texttop"

 border="0" height="16" width="1"><a

 href="citation.cfm?id=1242471.1242472&jmp=cit&coll=GUIDE&dl=ACM&CFID=8957818&CFTOKEN=40985920#CIT"

 target="_self">full citation</a>, <a

 href="citation.cfm?id=1242471.1242472&jmp=abstract&coll=GUIDE&dl=ACM&CFID=8957818&CFTOKEN=40985920#abstract"

 target="_self">abstract</a>, <a

 href="citation.cfm?id=1242471.1242472&jmp=references&coll=GUIDE&dl=ACM&CFID=8957818&CFTOKEN=40985920#references"

 target="_self">references</a>, <a

 href="citation.cfm?id=1242471.1242472&jmp=indexterms&coll=GUIDE&dl=ACM&CFID=8957818&CFTOKEN=40985920#indexterms"

 target="_self">index terms</a></td>

                </tr>

              </tbody>

            </table>

            </td>

          </tr>

          <tr valign="top">

            <td class="small-text" style="padding-top: 5px;" colspan="3"

 nowrap="nowrap"><strong>Bibliometrics</strong>:  Downloads (6 Weeks):

53,   Downloads (12 Months): 649,   Citation Count: 3</td>

          </tr>

          <tr>

            <td height="5"><img alt=""

 src="cid:part3.06080604.04000801@cityu.edu.hk" border="0" height="1"

 width="1"></td>

          </tr>

        </tbody>

      </table>

      </td>

    </tr>

    <tr valign="top">

      <td colspan="3">

      <div class="abstract2"><br>

      <p>In 1990, Manber and Myers proposed suffix arrays as a

space-saving alternative to suffix trees and described the first

algorithms for suffix array construction and use. Since that time, and

especially in the last few years, suffix array construction ...</p>

      </div>

      </td>

    </tr>

  </tbody>

</table>

<br>

<br>

<br>

Alexandre Rafalovitch wrote:

<blockquote

 cite="mid9194dc590810311800i4128db6bpf482c67e0d46e959@mail.gmail.com"

 type="cite">

  <pre wrap="">One of the interesting papers around suffix arrays is by Chunyu KIT:

"The Virtual Corpus approach to deriving n-gram statistics from large

scale corpora"

<a class="moz-txt-link-freetext" href="http://personal.cityu.edu.hk/~ctckit/papers/vc.pdf">http://personal.cityu.edu.hk/~ctckit/papers/vc.pdf</a>

I have some (scary) java code based around those concepts that can do

statistical analysis on n-grams with n above 140 and does look at

(n-1)-grams and (n+1)-grams with the same length.

If that's of interest, I would be happy to discuss this further in a

direct email (to not bother the list).

Regards,

    Alex.

Personal blog: <a class="moz-txt-link-freetext" href="http://blog.outerthoughts.com/">http://blog.outerthoughts.com/</a>

Research group: <a class="moz-txt-link-freetext" href="http://www.clt.mq.edu.au/Research/">http://www.clt.mq.edu.au/Research/</a>

On Tue, Oct 28, 2008 at 11:21 AM, Adam Lopez <a class="moz-txt-link-rfc2396E" href="mailto:alopez@inf.ed.ac.uk"><alopez@inf.ed.ac.uk></a> wrote:

  </pre>

  <blockquote type="cite">

    <blockquote type="cite">

      <pre wrap="">I was wondering whether anybody is aware of ideas and/or automated

processes to reduce n-gram output by solving the common problem that

shorter n-grams can be fragments of larger structures (e.g. the 5-

gram 'at the end of the' as part of the 6-gram 'at the end of the

day')

I am only aware of Paul Rayson's work on c-grams (collapsed-grams).

      </pre>

    </blockquote>

    <pre wrap="">Suffix trees, suffix arrays, and their relatives are compact data

structures following exactly the intuition that smaller strings are

substrings of larger ones. They represent all possible n-grams

(without limit on n) of a text in space proportional to the length of

the text and support efficient retrieval, counting, and other queries

on substrings of the text; there is a vast literature on their various

applications (and theory linking them to compressibility, etc.).

    </pre>

  </blockquote>

  <pre wrap=""><!---->

_______________________________________________

Corpora mailing list

<a class="moz-txt-link-abbreviated" href="mailto:Corpora@uib.no">Corpora@uib.no</a>

<a class="moz-txt-link-freetext" href="http://mailman.uib.no/listinfo/corpora">http://mailman.uib.no/listinfo/corpora</a>

  </pre>

</blockquote>

<br>

<br>

<pre class="moz-signature" cols="72">-- 

Chunyu Kit, PhD

Associate Professor, Computational Linguistics

Dept. of Chinese, Translation & Linguistics

City University of Hong Kong

83 Tat Chee Ave., Kowloon

<a class="moz-txt-link-freetext" href="http://personal.cityu.edu.hk/~ctckit/">http://personal.cityu.edu.hk/~ctckit/</a>

Tel: (+852)2788 9310 (O), 9380 1738 (M)

     (+86)135 3948 3937 (China Mobile)

Fax: (+852)2788 8706, 2788 7320</pre>

</body>

</html>