[Corpora-List] Keyness across Texts (fwd)

Sat Aug 4 18:05:17 UTC 2007

---------- Forwarded message ----------
Date: Thu, 12 Jul 2007 13:14:44 -0600
From: Eric Ringger <ringger at cs.byu.edu>
To: CORPORA at hd.uib.no
Subject: Re: [Corpora-List] Keyness across Texts

Good afternoon.

To follow up on my earlier post and to facilitate experimentation, here is a
list of publicly available implementations of LDA:

The Mallet toolkit implements LDA as well as extensions for n-grams (in
Java):

http://mallet.cs.umass.edu/mallet/javadoc/edu/umass/cs/mallet/base/topics/LD
A.html

Mark Steyvers and Tom Griffiths have a Matlab implementation:

 	http://psiexp.ss.uci.edu/research/programs_data/toolbox.htm

David Blei has an implementation in C.

 	http://www.cs.princeton.edu/~blei/lda-c/

There are further links at the bottom of David's LDA-C page.

Regards,
--Eric

-----Original Message-----
From: owner-corpora at lists.uib.no [mailto:owner-corpora at lists.uib.no] On
Behalf Of Eric Ringger
Sent: Tuesday, July 10, 2007 8:30 PM
To: CORPORA at hd.uib.no
Subject: RE: [Corpora-List] Keyness across Texts

Good evening.

This thread has encouraged me to wonder to what degree corpus linguists have
explored techniques such as LSA (latent semantic analysis), pLSI
(probabilistic latent semantic indexing), and LDA (latent Dirichlet
allocation) for the discovery of topics and keywords in large collections of
text.  Although these techniques have their origins in information retrieval
and natural language processing, they seem, at least superficially, relevant
to a discussion of keyness in texts.  Of the three techniques I list above,
LDA is the most recent and has been demonstrated to be superior (in some
sense) to the others.  If you would like to learn more about this technique,
I encourage you to peruse the following article by Thomas Griffiths and Mark
Steyvers, published in the Proceedings of the National Academy of Science:

http://psiexp.ss.uci.edu/research/papers/sciencetopics.pdf

Much of this article is relatively accessible to a general technical
audience.  The gist of the idea is a method for automatic unsupervised
discovery of 1. topics shared by a collection of documents and 2. the words
that make those topics manifest.  "Unsupervised" is a slightly loaded term
here, since some of the parameters of the model may require some insight for
tuning, especially in the form reported by Griffiths and Steyvers above.

I would be interested in hearing from anyone who has explored the utility of
LDA (or its predecessors) for identifying the key-ness of words in
comparison with some of the techniques that have grown directly out of the
community of corpus linguists.  If you are not aware of any such work, I
believe this could be an interesting line of inquiry for an NLP researcher
collaborating with a corpus linguist.

Kind regards,
--Eric
http://nlp.cs.byu.edu/

-----Original Message-----
From: owner-corpora at lists.uib.no [mailto:owner-corpora at lists.uib.no] On
Behalf Of John F. Sowa
Sent: Tuesday, July 10, 2007 11:03 AM
To: Przemek Kaszubski
Cc: CORPORA at hd.uib.no
Subject: Re: [Corpora-List] Keyness across Texts

Przemek,

That is a "key" idea about "keyness":

  > One fantastic feature of KeyWords is of course the possibility
  > of extracting key clusters. One can, for example, group and
  > count those clusters in which specific key words repeat, and
  > this way additionally confirm and contextualize their status,
  > very nicely indeed.

The fundamental issue about any version of "keyness" is the
definition and the algorithms that implement the definition.

The simple definition in terms of frequency counts is the most
widely used because it can be implemented in simple algorithms.
But even then, questions arise about lemmata:  are the algorithms
counting words or lemmata?  And how do the algorithms deal with
lemmata that are lexicalized in different parts of speech?

Clusters can provide a more precise way of defining keyness,
but the number of variations of clustering algorithms is
immense, and each one defines a different version of keyness.

The next step is to apply syntactic and/or semantic techniques
to determine how the words/lemmata are related.  Then the
syntactic and/or semantic structures could be used as input
to the counting and/or clustering methods.

And of course, you could also apply an ontology to relate the
words and/or lemmata.  In fact, you might even use the technology
to extract a document-specific ontology from the texts. And then
one could use that ontology for analyzing other documents.

In short, there is no clear distinction between keyness and
any other issues of semantics.  There is nothing wrong with
using simple, special-purpose techniques for addressing a
particular problem, but it is important to recognize their
limitations and their relationships to broader semantic issues.

John Sowa

_______________________________________________
corpora mailing list
corpora at uib.no
http://mailman.uib.no/listinfo/corpora