[Corpora-List] Broader linguistic resources

Tue Feb 12 16:05:57 UTC 2013

On 2/12/2013 8:55 AM, Dominic P Rout wrote:
> What are some useful, broad and accessible overview books
> about language (specifically in English) that might be useful
> to  a student of NLP wishing to broaden their horizons?

That question reminds me of a thread from last week (Feb 5).
The subject line was "New techniques in text processing":

Amac Herdagdelen asked:
> Is there anything new/fun that jumps to mind that I should read up on?
> ... What new things do we have/know to offer other fields?

Phil Gooch replied:
> If you're interested in extracting narrative event chains, then this
> might be worth looking at
>
> http://malt.ml.cmu.edu/mw/index.php/Chambers_and_Jurafsky,_Unsupervised_Learning_of_Narrative_Event_Chains,_ACL_2008
>
> Also, application of deep learning techniques might be of interest
>
> http://deeplearning.net/

Adam Kilgarriff replied:
> as well as tools you can trust, you need data you can trust.
> Techniques I describe in Getting to know your corpus
> http://trac.sketchengine.co.uk/attachment/wiki/AK/Papers/Kilgarriff_TSD2012.pdf?format=raw
> are designedto help researchers find the characteristics,
> quirks and biases  of their dataset
>
> (video version http://www.youtube.com/watch?v=0XvWh6YqgkU)

An excerpt from Adam's paper:
> We show, with examples, how keyword lists (of one corpus vs: another)
> are a direct, practical and fascinating way to explore the characteristics
> of corpora, and of text types. Our method is to classify the top one hundred
> keywords of corpus1 vs: corpus2, and corpus2 vs: corpus1. This promptly reveals
> a range of contrasts between all the pairs of corpora we apply it to. We also
> present improved maths for keywords, and briefly discuss quantitative comparisons
> between corpora. All the methods discussed (and almost all of the corpora)
> are available in the Sketch Engine, a leading corpus query tool.

An excerpt from the Chambers & Jurafsky paper:
> Hand-coded scripts were used in the 1970-80s as knowledge backbones
> that enabled inference and other NLP tasks requiring deep semantic
> knowledge.  We propose unsupervised induction of similar schemata
> called narrative event chains from raw newswire text.

Adam's paper describes important methods for analyzing corpora.
They belong in the toolkit of anyone who processes large volumes
of NL texts.

But the paper by Chambers & Jurafsky shows how issues that were
popular 30 years ago can be revived as "cutting edge" research today.
The important difference is that the old hand-coded scripts can now
be derived by new methods of "deep learning".

For an example of a narrative structure by Chambers & Jurafsky, see
Figure 6 of http://acl.eldoc.ub.rug.nl/mirror/P/P08/P08-1090.pdf

For structures in kidnap, bombing, attack, and arson, see
http://www.stanford.edu/~jurafsky/acl2011-chambers-templates.pdf

The moral of this story is that research directions are heavily
influenced by available technology. With new technology, research
questions from decades ago (or even millennia ago) can be revived
and addressed with new methods.

The implication for education is that research techniques can become
obsolete, but fundamental questions never become obsolete.  Sometimes
the most fruitful research can be inspired by old questions that were
abandoned because the available technology was inadequate.

John Sowa

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora