How hierarchical is language use?

Thu Nov 22 23:56:57 UTC 2012

Hi Brian,

I let Stefan speak for himself. My point was orthogonal to the issues
you're raising. All current models of syntax (including the one that you
seem to have in mind) assume that we are able to learn latent structure.
What type of latent structure we're learning is what many discussions seem
to center around. Stefan was reacting to my claim that the models used in
his article (echo state networks) still implicitly capture latent structure
and that this would stand in conflict with the conclusion that his results
argue against latent structure. As you can see from his response, the
situation turns out to be somewhat more complicated.

I think that the points you're raising do not affect the point of my
question, but they might independently be of interest in interpreting
Stefan's findings.

Florian

On Thu, Nov 22, 2012 at 6:17 PM, Brian MacWhinney <macw at cmu.edu> wrote:

> Dear Florian and Stefan,
>
>     It is interesting to see you and Stefan discussing whether SRNs and/or
> ESNs can
> represent hierarchy internally.  As I see it, the attempt to capture
> hierarchy inside an exclusively
> syntactic network, be it SRN, ESN, ATN  is wide of the mark.  The
> discussions of structural
> dependency (and the logical problem) assume that embedded
> clauses are processed by the same algorithmic machine that processes the
> rest of
> syntax.  In practice, Elman, Chomsky, and others all make this same
> assumption.
>      In fact, relative clauses and other embedded structures often convey
> frozen old information
> that is then slotted on the fly into the new information
> of the main clause.  This was the point of Ross's work decades ago on
> freezing.  If one views syntactic constructions as methods for combining
> information from diverse neural processing circuits, then it makes sense to
> view chunks
> as slotted into a basically linear mechanism.  It is the process of
> slotting in the chunks that produces
> the emergent hierarchical structure, because the material being slotted in
> has a structure
> created during earlier processing.  Slotting is not the only process that
> produces hierarchy.
> Enumeration and pairing in mental model space can have the same effect.  I
> think that all of the phenomena at the heart of this debate -- crossed
> dependencies, raising constraints, deletions during coordination,
> "long-distance" phenomena,
> "respectively", and the like all arise from the
> fact that syntax is unifying information from other neurocognitive
> resources.  Trying to
> analyze syntax as if it is doing all of this in a single syntactic network
> without memory for previous strings, chunking, enumeration, anaphora, and
> deixis is not
> going to come up with a veridical account of language processing.  But
> maybe this was somehow
> implicit in Stefan's article and I just missed it.
>
> --Brian MacWhinney
>
> On Nov 22, 2012, at 4:59 PM, T. Florian Jaeger <tiflo at csli.stanford.edu>
> wrote:
>
> > Hi Funknetters,
> >
> > For anyone interested, I'm attaching below a follow-up conversation
> between
> > Stefan Frank and me on his paper in PsychScience, suggesting that there
> is
> > no hierarchical structure that is accessed during sentence processing. I
> > removed all parts that weren't crucial to the discussion.
> >
> > Florian
> >
> >
> > -------------------------------------------------------------------
> >
> > First, Stefan's reply to my post:
> >
> > Hi Florian,
> >
> > Thanks for  cc-ing me (and for not being as harsh on us as some
> linguistic
> > bloggers). I just wanted to correct you on one details: In the PsychSci
> > paper, we use echo-state networks (ESNs), not SRNs. The difference
> matters
> > because claims about hierarchical processing by SRNs always relied on the
> > internal representations learned by the networks. But ESNs do not learn
> any
> > internal representations (their recurrent weights remain untrained) so
> the
> > results we found cannot be due to the ESNs learning to deal with
> > hierarchical structure.
> >
> >
> > All the best,
> >
> > Stefan
> >
> > [...]
> >
> > ESNs and SRNs have the same architecture (at least, they do when I use
> > them) but are trained differently. Crucially, an ESN's input and
> recurrent
> > connection weights are not adapted to the training data. Simply put, each
> > input to an ESN is "randomly" mapped to a point in a very
> high-dimensional
> > space (the network's hidden units). The output connection weights are
> then
> > trained on these hidden vectors using linear regression.
> >
> > This is not to say that there is no useful structure in the hidden-unit
> > space: Čerňanský et al. (Neural Networks, 2007) showed that untrained
> > recurrent neural nets have a Markovian bias (more specifically, they
> > correspond to Variable Length Markov Models).
> >
> > [...]
> > ---------- Forwarded message ----------
> > From: Stefan Frank <s.frank at ucl.ac.uk>
> > Date: Sun, Nov 18, 2012 at 10:57 AM
> > Subject: Re: Fwd: How hierarchical is language use?
> > To: "T. Florian Jaeger" <tiflo at csli.stanford.edu>
> >
> > [...]
> >
> >
> >  From this article, it seems to me that ESNs still have latent
> >> structure. I think I see what you mean by saying that it's not 'useful'
> >> latent structure, though I'd say that it is, it's just not readily
> >> interpretable.
> >>
> >
> > Actually, what I wrote was quite the opposite, but I admit that my
> > double-negation construction ("This is not to say that there is no useful
> > structure") was confusing. So, yes, there is structure, and it is useful.
> >
> >
> > I assume that ESNs are great as long as a) the number N
> >> of units in the reservoir is great enough or b) the number of instances
> >> of the ESN over which we marginalize is large (did you do that in your
> >> paper -- average across ESNs, each of them being a simulated
> >> 'comprehender'?) or c) the statistical process that underlies the random
> >> output variable is sufficient simple in its structure. Is that a correct
> >> characterization?
> >>
> >
> > Yes, I'd agree with that. In our paper, we did not marginalize over many
> > ESNs. Instead, we trained three ESNs of each size and presented results
> for
> > the one with median performance.
> >
> >
> > I merely mean that ESNs
> >> can presumably do a good job at modeling random variables generated by a
> >> hierarchical generative process because they after all have a way to
> >> capture that latent structure by driving the states of a sufficiently
> >> large reservoir (which, if I get this right, is a set of computing units
> >> with all-to-all connections that have weights that are initialized
> >> randomly and never changed by training?). I assume that if that
> >> reservoir wouldn't be sufficiently large the constraint that input to
> >> reservoir and reservoir to output connections are linear would not allow
> >> the model to learn much. so this model essentially does by breadth what
> >> other models do by depths. is that roughly correct?
> >>
> >
> > Indeed, that is roughly correct. However, the structure imposed by the
> > random recurrent network has properties similar to a variable length
> markov
> > model (see Čerňanský et al., 2007), that is, it does not reflect the
> > input's hierarchical structure. A standard Elman network can (at least in
> > theory) adapt its input and recurrent connection weights to hierarchical
> > structure in the input but in an ESN these weights remain random. So the
> > only way it could make use of hierarchical structure is if it gets
> encoded
> > in the (learned) transformation from the hidden-unit space to the output
> > space, but since this is a linear mapping I don't see how that would work
> > (which, admittedly, does not mean it cannot work).
> >
> > [...]
> >
> >
> > Cheers,
> >
> > Stefan
> >
>
>