[Corpora-List] Reducing n-gram output

Chunyu Kit ctckit at cityu.edu.hk
Sat Nov 1 03:29:06 UTC 2008


one may have a look into this paper if interested in suffix array stuff 
... didn't expect so many papers on it in recent years, but there are 
some smart algorithms for its implementation.

best,
kit


A taxonomy of suffix array construction algorithms 
<citation.cfm?id=1242471.1242472&coll=GUIDE&dl=ACM&CFID=8957818&CFTOKEN=40985920> 

Simon J. Puglisi 
<author_page.cfm?id=81100272116&coll=GUIDE&dl=ACM&CFID=8957818&CFTOKEN=40985920>, 
W. F. Smyth 
<author_page.cfm?id=81338490986&coll=GUIDE&dl=ACM&CFID=8957818&CFTOKEN=40985920>, 
Andrew H. Turpin 
<author_page.cfm?id=81100229960&coll=GUIDE&dl=ACM&CFID=8957818&CFTOKEN=40985920> 

July 2007 	
	
Computing Surveys (CSUR) , Volume 39 Issue 2
Publisher: ACM
Full text available: 	PdfPdf 
<ft_gateway.cfm?id=1242472&type=pdf&coll=GUIDE&dl=ACM&CFID=8957818&CFTOKEN=40985920> 
(381.17 KB)

	
Additional Information: 	full citation 
<citation.cfm?id=1242471.1242472&jmp=cit&coll=GUIDE&dl=ACM&CFID=8957818&CFTOKEN=40985920#CIT>, 
abstract 
<citation.cfm?id=1242471.1242472&jmp=abstract&coll=GUIDE&dl=ACM&CFID=8957818&CFTOKEN=40985920#abstract>, 
references 
<citation.cfm?id=1242471.1242472&jmp=references&coll=GUIDE&dl=ACM&CFID=8957818&CFTOKEN=40985920#references>, 
index terms 
<citation.cfm?id=1242471.1242472&jmp=indexterms&coll=GUIDE&dl=ACM&CFID=8957818&CFTOKEN=40985920#indexterms> 


Bibliometrics:  Downloads (6 Weeks): 53,   Downloads (12 Months): 649,   
Citation Count: 3


In 1990, Manber and Myers proposed suffix arrays as a space-saving 
alternative to suffix trees and described the first algorithms for 
suffix array construction and use. Since that time, and especially in 
the last few years, suffix array construction ...




Alexandre Rafalovitch wrote:

>One of the interesting papers around suffix arrays is by Chunyu KIT:
>"The Virtual Corpus approach to deriving n-gram statistics from large
>scale corpora"
>http://personal.cityu.edu.hk/~ctckit/papers/vc.pdf
>
>I have some (scary) java code based around those concepts that can do
>statistical analysis on n-grams with n above 140 and does look at
>(n-1)-grams and (n+1)-grams with the same length.
>
>If that's of interest, I would be happy to discuss this further in a
>direct email (to not bother the list).
>
>Regards,
>    Alex.
>
>Personal blog: http://blog.outerthoughts.com/
>Research group: http://www.clt.mq.edu.au/Research/
>
>
>
>On Tue, Oct 28, 2008 at 11:21 AM, Adam Lopez <alopez at inf.ed.ac.uk> wrote:
>  
>
>>>I was wondering whether anybody is aware of ideas and/or automated
>>>processes to reduce n-gram output by solving the common problem that
>>>shorter n-grams can be fragments of larger structures (e.g. the 5-
>>>gram 'at the end of the' as part of the 6-gram 'at the end of the
>>>day')
>>>
>>>I am only aware of Paul Rayson's work on c-grams (collapsed-grams).
>>>      
>>>
>>Suffix trees, suffix arrays, and their relatives are compact data
>>structures following exactly the intuition that smaller strings are
>>substrings of larger ones. They represent all possible n-grams
>>(without limit on n) of a text in space proportional to the length of
>>the text and support efficient retrieval, counting, and other queries
>>on substrings of the text; there is a vast literature on their various
>>applications (and theory linking them to compressibility, etc.).
>>    
>>
>
>_______________________________________________
>Corpora mailing list
>Corpora at uib.no
>http://mailman.uib.no/listinfo/corpora
>
>  
>


-- 
Chunyu Kit, PhD
Associate Professor, Computational Linguistics

Dept. of Chinese, Translation & Linguistics
City University of Hong Kong
83 Tat Chee Ave., Kowloon

http://personal.cityu.edu.hk/~ctckit/
Tel: (+852)2788 9310 (O), 9380 1738 (M)
     (+86)135 3948 3937 (China Mobile)
Fax: (+852)2788 8706, 2788 7320

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20081101/be10a92a/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: imagetypes\pdf_logo.gif
Type: image/gif
Size: 355 bytes
Desc: not available
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20081101/be10a92a/attachment-0003.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: doc_blank.gif
Type: image/gif
Size: 51 bytes
Desc: not available
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20081101/be10a92a/attachment-0004.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: blanks.gif
Type: image/gif
Size: 54 bytes
Desc: not available
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20081101/be10a92a/attachment-0005.gif>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list