Fwd: Fwd: How hierarchical is language use?

Thu Nov 22 21:59:35 UTC 2012

Hi Funknetters,

For anyone interested, I'm attaching below a follow-up conversation between
Stefan Frank and me on his paper in PsychScience, suggesting that there is
no hierarchical structure that is accessed during sentence processing. I
removed all parts that weren't crucial to the discussion.

Florian

-------------------------------------------------------------------

First, Stefan's reply to my post:

Hi Florian,

Thanks for  cc-ing me (and for not being as harsh on us as some linguistic
bloggers). I just wanted to correct you on one details: In the PsychSci
paper, we use echo-state networks (ESNs), not SRNs. The difference matters
because claims about hierarchical processing by SRNs always relied on the
internal representations learned by the networks. But ESNs do not learn any
internal representations (their recurrent weights remain untrained) so the
results we found cannot be due to the ESNs learning to deal with
hierarchical structure.

All the best,

Stefan

[...]

ESNs and SRNs have the same architecture (at least, they do when I use
them) but are trained differently. Crucially, an ESN's input and recurrent
connection weights are not adapted to the training data. Simply put, each
input to an ESN is "randomly" mapped to a point in a very high-dimensional
space (the network's hidden units). The output connection weights are then
trained on these hidden vectors using linear regression.

This is not to say that there is no useful structure in the hidden-unit
space: Čerňanský et al. (Neural Networks, 2007) showed that untrained
recurrent neural nets have a Markovian bias (more specifically, they
correspond to Variable Length Markov Models).

[...]
---------- Forwarded message ----------
From: Stefan Frank <s.frank at ucl.ac.uk>
Date: Sun, Nov 18, 2012 at 10:57 AM
Subject: Re: Fwd: How hierarchical is language use?
To: "T. Florian Jaeger" <tiflo at csli.stanford.edu>

[...]

  From this article, it seems to me that ESNs still have latent
> structure. I think I see what you mean by saying that it's not 'useful'
> latent structure, though I'd say that it is, it's just not readily
> interpretable.
>

Actually, what I wrote was quite the opposite, but I admit that my
double-negation construction ("This is not to say that there is no useful
structure") was confusing. So, yes, there is structure, and it is useful.

 I assume that ESNs are great as long as a) the number N
> of units in the reservoir is great enough or b) the number of instances
> of the ESN over which we marginalize is large (did you do that in your
> paper -- average across ESNs, each of them being a simulated
> 'comprehender'?) or c) the statistical process that underlies the random
> output variable is sufficient simple in its structure. Is that a correct
> characterization?
>

Yes, I'd agree with that. In our paper, we did not marginalize over many
ESNs. Instead, we trained three ESNs of each size and presented results for
the one with median performance.

 I merely mean that ESNs
> can presumably do a good job at modeling random variables generated by a
> hierarchical generative process because they after all have a way to
> capture that latent structure by driving the states of a sufficiently
> large reservoir (which, if I get this right, is a set of computing units
> with all-to-all connections that have weights that are initialized
> randomly and never changed by training?). I assume that if that
> reservoir wouldn't be sufficiently large the constraint that input to
> reservoir and reservoir to output connections are linear would not allow
> the model to learn much. so this model essentially does by breadth what
> other models do by depths. is that roughly correct?
>

Indeed, that is roughly correct. However, the structure imposed by the
random recurrent network has properties similar to a variable length markov
model (see Čerňanský et al., 2007), that is, it does not reflect the
input's hierarchical structure. A standard Elman network can (at least in
theory) adapt its input and recurrent connection weights to hierarchical
structure in the input but in an ESN these weights remain random. So the
only way it could make use of hierarchical structure is if it gets encoded
in the (learned) transformation from the hidden-unit space to the output
space, but since this is a linear mapping I don't see how that would work
(which, admittedly, does not mean it cannot work).

[...]

Cheers,

Stefan