[Corpora-List] longest common subsequenceS algorithms in corpora research ...

Jin-Dong Kim jdkim at dbcls.rois.ac.jp
Fri Feb 22 02:54:23 UTC 2013


We used LCS algorithm to align corpus and annotation in different
versions with subtle differences in the text.

@InProceedings{kim-wang:2012:BioNLP,
  author    = {Kim, Jin-Dong  and  Wang, Yue},
  title     = {PubAnnotation - a persistent and sharable corpus and
annotation repository},
  booktitle = {{BioNLP}: Proceedings of the 2012 Workshop on
Biomedical Natural Language Processing},
  month     = {June},
  year      = {2012},
  address   = {Montr{\'e}al, Canada},
  publisher = {Association for Computational Linguistics},
  pages     = {202--205},
  url       = {http://www.aclweb.org/anthology/W12-2425}
}

Hope it helps.

Jin-Dong


On Wed, Feb 20, 2013 at 1:38 AM, Paul McNamee <paul.mcnamee at jhuapl.edu> wrote:
> Here's one example using longest common prefixes:
>
>   Yamamoto and Church, Using Suffix Arrays to Compute Term Frequency
>   and Document Frequency for All Substrings in a Corpus,
>   Comp. Linguistics, 2000.
>
>   http://www.cs.jhu.edu/~kchurch/wwwfiles/CL_suffix_array.pdf
>
>
> Some years back I used this technique to help identify bilingual
> phrasal equivalents
>
>   McNamee and Mayfield, Translation of Multiword Expressions Using Parallel
>   Suffix Arrays, AMTA 2006.
>
>   http://www.mt-archive.info/AMTA-2006-McNamee.pdf
>
>
> An actual use of LC substring is found in proper name variant matching
> (i.e.,
> is "Mikhail Sergeyevich Gorbachev" coreferent with "Michail Gorbatchev")
>   http://www.cs.utah.edu/contest/2005/spellingErrors.pdf
>   http://cs.anu.edu.au/~Peter.Christen/publications/tr-cs-06-02.pdf
>
>
> LCS is also widely used as a means to identify spans of text that are
> duplicates or near duplicates; similar methods can also be applied to
> the problems of plagarism detection and authorship attribution.
>
> - Paul
>
>
>
> On Tue, 19 Feb 2013, Albretch Mueller wrote:
>
>> LCS algorithms are heavily used in bioinformatics to analyze DNA sequences
>>
>> How are they used in corpora research?
>>
>> thanks,
>> lbrtchx
>>
>> _______________________________________________
>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/listinfo/corpora
>>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora



-- 
Jin-Dong Kim, Ph.D,
Project Associate Professor,
Database Center for Life Science (DBCLS),
Research Organization of Information and Systems (ROIS)
home: http://dbcls.rois.ac.jp/~jdkim
e-mail: jdkim at dbcls.rois.ac.jp

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list