<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">
<title></title>
</head>
<body bgcolor="#ffffff" text="#000000">
<br>
one may have a look into this paper if interested in suffix array stuff
... didn't expect so many papers on it in recent years, but there are
some smart algorithms for its implementation.<br>
<br>
best,<br>
kit<br>
<br>
<br>
<table border="0" cellpadding="0" cellspacing="0">
<tbody>
<tr>
<td colspan="3"><a class="medium-text"
href="citation.cfm?id=1242471.1242472&coll=GUIDE&dl=ACM&CFID=8957818&CFTOKEN=40985920"
target="_self">A taxonomy of suffix array construction algorithms</a>
<div class="authors"><a
href="author_page.cfm?id=81100272116&coll=GUIDE&dl=ACM&CFID=8957818&CFTOKEN=40985920">Simon
J. Puglisi</a>, <a
href="author_page.cfm?id=81338490986&coll=GUIDE&dl=ACM&CFID=8957818&CFTOKEN=40985920">W.
F. Smyth</a>, <a
href="author_page.cfm?id=81100229960&coll=GUIDE&dl=ACM&CFID=8957818&CFTOKEN=40985920">Andrew
H. Turpin</a> </div>
</td>
</tr>
<tr valign="top">
<td class="small-text" nowrap="nowrap">July 2007 </td>
<td><br>
</td>
<td>
<div class="addinfo"><strong>Computing Surveys (CSUR)</strong> <font
size="-2">, Volume 39 Issue 2</font> </div>
</td>
</tr>
<tr valign="top">
<td class="small-text" colspan="3" nowrap="nowrap"><strong>Publisher:</strong> ACM
</td>
</tr>
<tr valign="top">
<td class="smaller-text" colspan="3" valign="top">
<table border="0" cellpadding="0" cellspacing="0" width="100%">
<tbody>
<tr valign="top">
</tr>
</tbody><colgroup><col width="35%"><col width="65%"></colgroup><tbody>
<tr>
<td class="smaller-text">
<table border="0" cellpadding="0">
<tbody>
<tr valign="top">
<td class="smaller-text" nowrap="nowrap">Full text
available:</td>
<td class="smaller-text" nowrap="nowrap" valign="top"><a
title="Pdf"
href="ft_gateway.cfm?id=1242472&type=pdf&coll=GUIDE&dl=ACM&CFID=8957818&CFTOKEN=40985920"
target="_blank"><img style="margin-right: 2px;" alt="Pdf"
src="cid:part1.08060502.05060707@cityu.edu.hk" align="middle"
border="0">Pdf</a> (381.17 KB) </td>
</tr>
</tbody>
</table>
</td>
<td class="smaller-text">
<table border="0" cellpadding="0">
<tbody>
<tr valign="top">
<td class="smaller-text" nowrap="nowrap">Additional
Information:</td>
<td class="smaller-text"><img alt=""
src="cid:part2.03040907.07080202@cityu.edu.hk" align="texttop"
border="0" height="16" width="1"><a
href="citation.cfm?id=1242471.1242472&jmp=cit&coll=GUIDE&dl=ACM&CFID=8957818&CFTOKEN=40985920#CIT"
target="_self">full citation</a>, <a
href="citation.cfm?id=1242471.1242472&jmp=abstract&coll=GUIDE&dl=ACM&CFID=8957818&CFTOKEN=40985920#abstract"
target="_self">abstract</a>, <a
href="citation.cfm?id=1242471.1242472&jmp=references&coll=GUIDE&dl=ACM&CFID=8957818&CFTOKEN=40985920#references"
target="_self">references</a>, <a
href="citation.cfm?id=1242471.1242472&jmp=indexterms&coll=GUIDE&dl=ACM&CFID=8957818&CFTOKEN=40985920#indexterms"
target="_self">index terms</a></td>
</tr>
</tbody>
</table>
</td>
</tr>
<tr valign="top">
<td class="small-text" style="padding-top: 5px;" colspan="3"
nowrap="nowrap"><strong>Bibliometrics</strong>: Downloads (6 Weeks):
53, Downloads (12 Months): 649, Citation Count: 3</td>
</tr>
<tr>
<td height="5"><img alt=""
src="cid:part3.06080604.04000801@cityu.edu.hk" border="0" height="1"
width="1"></td>
</tr>
</tbody>
</table>
</td>
</tr>
<tr valign="top">
<td colspan="3">
<div class="abstract2"><br>
<p>In 1990, Manber and Myers proposed suffix arrays as a
space-saving alternative to suffix trees and described the first
algorithms for suffix array construction and use. Since that time, and
especially in the last few years, suffix array construction ...</p>
</div>
</td>
</tr>
</tbody>
</table>
<br>
<br>
<br>
Alexandre Rafalovitch wrote:
<blockquote
cite="mid9194dc590810311800i4128db6bpf482c67e0d46e959@mail.gmail.com"
type="cite">
<pre wrap="">One of the interesting papers around suffix arrays is by Chunyu KIT:
"The Virtual Corpus approach to deriving n-gram statistics from large
scale corpora"
<a class="moz-txt-link-freetext" href="http://personal.cityu.edu.hk/~ctckit/papers/vc.pdf">http://personal.cityu.edu.hk/~ctckit/papers/vc.pdf</a>
I have some (scary) java code based around those concepts that can do
statistical analysis on n-grams with n above 140 and does look at
(n-1)-grams and (n+1)-grams with the same length.
If that's of interest, I would be happy to discuss this further in a
direct email (to not bother the list).
Regards,
Alex.
Personal blog: <a class="moz-txt-link-freetext" href="http://blog.outerthoughts.com/">http://blog.outerthoughts.com/</a>
Research group: <a class="moz-txt-link-freetext" href="http://www.clt.mq.edu.au/Research/">http://www.clt.mq.edu.au/Research/</a>
On Tue, Oct 28, 2008 at 11:21 AM, Adam Lopez <a class="moz-txt-link-rfc2396E" href="mailto:alopez@inf.ed.ac.uk"><alopez@inf.ed.ac.uk></a> wrote:
</pre>
<blockquote type="cite">
<blockquote type="cite">
<pre wrap="">I was wondering whether anybody is aware of ideas and/or automated
processes to reduce n-gram output by solving the common problem that
shorter n-grams can be fragments of larger structures (e.g. the 5-
gram 'at the end of the' as part of the 6-gram 'at the end of the
day')
I am only aware of Paul Rayson's work on c-grams (collapsed-grams).
</pre>
</blockquote>
<pre wrap="">Suffix trees, suffix arrays, and their relatives are compact data
structures following exactly the intuition that smaller strings are
substrings of larger ones. They represent all possible n-grams
(without limit on n) of a text in space proportional to the length of
the text and support efficient retrieval, counting, and other queries
on substrings of the text; there is a vast literature on their various
applications (and theory linking them to compressibility, etc.).
</pre>
</blockquote>
<pre wrap=""><!---->
_______________________________________________
Corpora mailing list
<a class="moz-txt-link-abbreviated" href="mailto:Corpora@uib.no">Corpora@uib.no</a>
<a class="moz-txt-link-freetext" href="http://mailman.uib.no/listinfo/corpora">http://mailman.uib.no/listinfo/corpora</a>
</pre>
</blockquote>
<br>
<br>
<pre class="moz-signature" cols="72">--
Chunyu Kit, PhD
Associate Professor, Computational Linguistics
Dept. of Chinese, Translation & Linguistics
City University of Hong Kong
83 Tat Chee Ave., Kowloon
<a class="moz-txt-link-freetext" href="http://personal.cityu.edu.hk/~ctckit/">http://personal.cityu.edu.hk/~ctckit/</a>
Tel: (+852)2788 9310 (O), 9380 1738 (M)
(+86)135 3948 3937 (China Mobile)
Fax: (+852)2788 8706, 2788 7320</pre>
</body>
</html>