[Corpora-List] Finding Temporality of Sentences with TempoWordNet

Mon May 12 10:20:00 UTC 2014

I am forwarding this post on request of Mr. Sujit Pal, Director, Search R&D
at Healthline Networks Inc., San Francisco, United States, who has carried
out the following experiment.The post could be accessed directly from his
blog (
http://sujitpal.blogspot.com/2014/04/find-temporalaty-of-sentences-with.html<https://www.linkedin.com/redirect?url=http%3A%2F%2Fsujitpal%2Eblogspot%2Ecom%2F2014%2F04%2Ffind-temporalaty-of-sentences-with%2Ehtml&urlhash=OLJJ&_t=tracking_disc>).

In my previous post<http://sujitpal.blogspot.com/2014/04/scala-implementation-of-negex-algorithm.html>,
I described a Scala implementation of the Negex
algorithm<https://code.google.com/p/negex/>,
that was initially developed to detect negation of phrases in a sentence
using manually built pre- and post- trigger terms. Later on, it was found
that the same algorithm could be used to also detect temporality and
experiencer characteristics of a phrase in a sentence, using a different
set of pre- and post- trigger terms.

I also recently became aware of the TempoWordNet
project<https://tempowordnet.greyc.fr/>from a group at Normandie
University <https://www.greyc.fr/node/35>. The project provides a free
lexical resource that contains each synset of Wordnet marked up with its
probability of being past, present, future or atemporal. This
paper<http://www.aclweb.org/anthology/E/E14/E14-4002.pdf>describes the
process by which these probabilities were generated. There is another
paper <http://dl.acm.org/citation.cfm?id=2579042> which one of the authors
referenced on LinkedIn, but its unfortunately behind an ACM paywall, and I
am no longer a member starting this year, so could not read it.

In running the Negex algorithm against the annotated list that comes with
it, I found that the Historical and Hypothetical annotations had lower
accuracies (0.90 and 0.89 respectively) compared to the other two (Negation
and Experiencer). Thinking about it, I realized that the Historical and
Hypothetical annotators are a pair of binary annotators used to annotate a
phrase into one of 3 classes - Historical, Recent and Not Particular. With
this understanding, some small changes in how I measured the accuracy
brought them up to 0.93 and 0.99 respectively. But I figured that may be
possible to *also* compute temporality using the TempoWordNet file, similar
to how one does classical sentiment analysis. This post describes that work.

Each synset in the TempoWordNet file is written as a triple of word, part
of speech and synset ID. From this, I build a LingPipe
ExactDictionaryChunker<http://alias-i.com/lingpipe/docs/api/com/aliasi/dict/ExactDictionaryChunker.html>for
each word/POS pair for each temporality state. I check to see that I
only capture the first synset ID for the word/POS combination, so hopefully
I capture the probabilities for the most dominant synset (the first one). I
have written earlier about the LingPipe ExactDictionaryChunker, it
implements the Aho-Corasick string matching algorithm which is very fast
and space efficient.

Each sentence is tokenized into words and POS tagged (using OpenNLP's POS
tagger<http://blog.dpdearing.com/2011/06/part-of-speech-pos-tagging-with-opennlp-1-5-0/>.
Each word/POS combination is matched against each of the
ExactDictionaryChunkers and the probability of matching word for each tense
summed across all words. The class for which the sum of probabilities of
individual words is the highest is the class of the sentence. Since OpenNLP
uses the Penn Treebank
tags<https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html>,
they need to be translated to Wordnet
tags<http://wordnet.princeton.edu/man/wndb.5WN.html>before matching.

Here is the code for the TemporalAnnotator.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94

// Source: src/main/scala/com/mycompany/scalcium/transformers/TemporalAnnotator.scalapackage
com.mycompany.scalcium.transformers
import java.io.Fileimport java.util.regex.Pattern
import scala.Array.canBuildFromimport
scala.collection.JavaConversions.asScalaIteratorimport
scala.collection.mutable.ArrayBufferimport scala.io.Source
import com.aliasi.chunk.Chunkerimport
com.aliasi.dict.DictionaryEntryimport
com.aliasi.dict.ExactDictionaryChunkerimport
com.aliasi.dict.MapDictionaryimport
com.aliasi.tokenizer.IndoEuropeanTokenizerFactoryimport
com.mycompany.scalcium.utils.Tokenizer
class TemporalAnnotator(val tempoWNFile: File) {

  val targets = List("Past", "Present", "Future")
  val chunkers = buildChunkers(tempoWNFile)
  val tokenizer = Tokenizer.getTokenizer("opennlp")

  val ptPoss = List("JJ", "NN", "VB", "RB")
    .map(p => Pattern.compile(p + ".*"))
  val wnPoss = List("s", "n", "v", "r")

  def predict(sentence: String): String = {
    val scoreTargetPairs = chunkers.map(chunker => {
      val taggedSentence = tokenizer.posTag(sentence)
        .map(wtp => wtp._1.replaceAll("\\p{Punct}", "") +
          "/" + wordnetPos(wtp._2))
        .mkString(" ")
      val chunking = chunker.chunk(taggedSentence)
      chunking.chunkSet().iterator().toList
        .map(chunk => chunk.score())
        .foldLeft(0.0D)(_ + _)
      })
      .zipWithIndex
      .filter(stp => stp._1 > 0.0D)
    if (scoreTargetPairs.isEmpty) "Present"
    else {
      val bestTarget = scoreTargetPairs
        .sortWith((a,b) => a._1 > b._1)
        .head._2
      targets(bestTarget)
    }
  }

  def buildChunkers(datafile: File): List[Chunker] = {
    val dicts = ArrayBuffer[MapDictionary[String]]()
    Range(0, targets.size).foreach(i =>
      dicts += new MapDictionary[String]())
    val pwps = scala.collection.mutable.Set[String]()
    Source.fromFile(datafile).getLines()
      .filter(line => (!(line.isEmpty() || line.startsWith("#"))))
      .foreach(line => {
        val cols = line.split("\\s{2,}")
        val wordPos = getWordPos(cols(1))
        val probs = cols.slice(cols.size - 4, cols.size)
          .map(x => x.toDouble)
        if (! pwps.contains(wordPos)) {
          Range(0, targets.size).foreach(i =>
            dicts(i).addEntry(new DictionaryEntry[String](
              wordPos, targets(i), probs(i))))
        }
        pwps += wordPos
    })
    val chunkers = new ArrayBuffer[Chunker]()
    dicts.map(dict => new ExactDictionaryChunker(
        dict, IndoEuropeanTokenizerFactory.INSTANCE,
        false, false))
      .toList
  }

  def getWordPos(synset: String): String = {
    val sscols = synset.split("\\.")
    val words = sscols.slice(0, sscols.size - 2)
    val pos = sscols.slice(sscols.size - 2, sscols.size - 1).head
    words.mkString("")
      .split("_")
      .map(word => word + "/" + (if ("s".equals(pos)) "a" else pos))
      .mkString(" ")
  }

  def wordnetPos(ptPos: String): String = {
    val matchIdx = ptPoss.map(p => p.matcher(ptPos).matches())
      .zipWithIndex
      .filter(mip => mip._1)
      .map(mip => mip._2)
    if (matchIdx.isEmpty) "o" else wnPoss(matchIdx.head)
  }}

As you can see, I just compute the best of Past, Present and Future and
ignore the Atemporal probabilities. I had initially included it as well,
but accuracy scores on the Negex annotated test data was coming out at
0.89. Changing the logic to only look at Past and flag it as Historical if
the sum of past probabilities of matching sentences was greater than 0 got
me an even worse accuracy of 0.5. Finally, after a bit of trial and error,
removing the Atemporal chunker resulted in an accuracy of 0.904741, so
thats what I stayed with.

Here is the JUnit test for evaluating the TemporalWordnetAnnotator using
the annotated list of sentences from Negex. Our default is "Recent", and
only when we can confidently say something about the temporality of the
sentence, we will change to either "Historical" or "Recent". Our annotators
will return a score for each of Past, Present and Future. If the result is
"Past" from our annotators, it will be converted to "Historical" for
comparison.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50

// Source: src/test/scala/com/mycompany/scalcium/transformers/TemporalAnnotatorTest.scalapackage
com.mycompany.scalcium.transformers
import java.io.File
import scala.io.Source
import org.junit.Test
class TemporalAnnotatorTest {

  val tann = new TemporalAnnotator(
    new File("src/main/resources/TempoWnL_1.0.txt"))
  val input = new File("src/main/resources/negex/test_input.txt")

  @Test
  def evaluate(): Unit = {
    var numTested = 0
    var numCorrect = 0
    Source.fromFile(input).getLines().foreach(line => {
      val cols = line.split("\t")
      val sentence = cols(3)
      val actual = cols(5)
      if ((! "Not particular".equals(actual))) {
        val predicted = tann.predict(sentence)
        val correct = actual.equals(translate(predicted))
        if (! correct) {
          Console.println("%s|[%s] %s|%s"
            .format(sentence, (if (correct) "+" else "-"),
              actual, predicted))
        }
        numCorrect += (if (correct) 1 else 0)
        numTested += 1
      }
    })
    Console.println("Accuracy=%8.6f"
      .format(numCorrect.toDouble / numTested.toDouble))
  }

  /**   * Converts predictions made by TemporalAnnotator to   *
predictions that match annotations in our testcase.   */
  def translate(pred: String): String = {
    pred match {
      case "Past" => "Historical"
      case _ => "Recent"
    }
  }}

This approach gives us an accuracy of 0.904741, which is not as good as
Negex, but its lack of accuracy is somewhat offset by its ease of use. You
can send the entire sentence to the annotator, no need (as in the case of
Negex) to concept map the sentence before sending so it identifies the
"important" phrases.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20140512/4ed7aa56/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora