[Corpora-List] longest common subsequenceS algorithms in corpora research ...

Paul McNamee paul.mcnamee at jhuapl.edu
Tue Feb 19 16:38:42 UTC 2013


Here's one example using longest common prefixes:

   Yamamoto and Church, Using Suffix Arrays to Compute Term Frequency
   and Document Frequency for All Substrings in a Corpus,
   Comp. Linguistics, 2000.

   http://www.cs.jhu.edu/~kchurch/wwwfiles/CL_suffix_array.pdf


Some years back I used this technique to help identify bilingual
phrasal equivalents

   McNamee and Mayfield, Translation of Multiword Expressions Using Parallel
   Suffix Arrays, AMTA 2006.

   http://www.mt-archive.info/AMTA-2006-McNamee.pdf


An actual use of LC substring is found in proper name variant matching (i.e.,
is "Mikhail Sergeyevich Gorbachev" coreferent with "Michail Gorbatchev")
   http://www.cs.utah.edu/contest/2005/spellingErrors.pdf
   http://cs.anu.edu.au/~Peter.Christen/publications/tr-cs-06-02.pdf


LCS is also widely used as a means to identify spans of text that are
duplicates or near duplicates; similar methods can also be applied to
the problems of plagarism detection and authorship attribution.

- Paul


On Tue, 19 Feb 2013, Albretch Mueller wrote:

> LCS algorithms are heavily used in bioinformatics to analyze DNA sequences
>
> How are they used in corpora research?
>
> thanks,
> lbrtchx
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list