How hierarchical is language use?

Thu Nov 22 23:17:26 UTC 2012

Dear Florian and Stefan,

    It is interesting to see you and Stefan discussing whether SRNs and/or ESNs can
represent hierarchy internally.  As I see it, the attempt to capture hierarchy inside an exclusively 
syntactic network, be it SRN, ESN, ATN  is wide of the mark.  The discussions of structural
dependency (and the logical problem) assume that embedded
clauses are processed by the same algorithmic machine that processes the rest of
syntax.  In practice, Elman, Chomsky, and others all make this same assumption.  
     In fact, relative clauses and other embedded structures often convey frozen old information 
that is then slotted on the fly into the new information
of the main clause.  This was the point of Ross's work decades ago on freezing.  If one views syntactic constructions as methods for combining information from diverse neural processing circuits, then it makes sense to view chunks
as slotted into a basically linear mechanism.  It is the process of slotting in the chunks that produces
the emergent hierarchical structure, because the material being slotted in has a structure
created during earlier processing.  Slotting is not the only process that produces hierarchy.
Enumeration and pairing in mental model space can have the same effect.  I think that all of the phenomena at the heart of this debate -- crossed dependencies, raising constraints, deletions during coordination, "long-distance" phenomena, 
"respectively", and the like all arise from the
fact that syntax is unifying information from other neurocognitive resources.  Trying to 
analyze syntax as if it is doing all of this in a single syntactic network
without memory for previous strings, chunking, enumeration, anaphora, and deixis is not
going to come up with a veridical account of language processing.  But maybe this was somehow
implicit in Stefan's article and I just missed it.

--Brian MacWhinney

On Nov 22, 2012, at 4:59 PM, T. Florian Jaeger <tiflo at csli.stanford.edu> wrote:

> Hi Funknetters,
> 
> For anyone interested, I'm attaching below a follow-up conversation between
> Stefan Frank and me on his paper in PsychScience, suggesting that there is
> no hierarchical structure that is accessed during sentence processing. I
> removed all parts that weren't crucial to the discussion.
> 
> Florian
> 
> 
> -------------------------------------------------------------------
> 
> First, Stefan's reply to my post:
> 
> Hi Florian,
> 
> Thanks for  cc-ing me (and for not being as harsh on us as some linguistic
> bloggers). I just wanted to correct you on one details: In the PsychSci
> paper, we use echo-state networks (ESNs), not SRNs. The difference matters
> because claims about hierarchical processing by SRNs always relied on the
> internal representations learned by the networks. But ESNs do not learn any
> internal representations (their recurrent weights remain untrained) so the
> results we found cannot be due to the ESNs learning to deal with
> hierarchical structure.
> 
> 
> All the best,
> 
> Stefan
> 
> [...]
> 
> ESNs and SRNs have the same architecture (at least, they do when I use
> them) but are trained differently. Crucially, an ESN's input and recurrent
> connection weights are not adapted to the training data. Simply put, each
> input to an ESN is "randomly" mapped to a point in a very high-dimensional
> space (the network's hidden units). The output connection weights are then
> trained on these hidden vectors using linear regression.
> 
> This is not to say that there is no useful structure in the hidden-unit
> space: Čerňanský et al. (Neural Networks, 2007) showed that untrained
> recurrent neural nets have a Markovian bias (more specifically, they
> correspond to Variable Length Markov Models).
> 
> [...]
> ---------- Forwarded message ----------
> From: Stefan Frank <s.frank at ucl.ac.uk>
> Date: Sun, Nov 18, 2012 at 10:57 AM
> Subject: Re: Fwd: How hierarchical is language use?
> To: "T. Florian Jaeger" <tiflo at csli.stanford.edu>
> 
> [...]
> 
> 
>  From this article, it seems to me that ESNs still have latent
>> structure. I think I see what you mean by saying that it's not 'useful'
>> latent structure, though I'd say that it is, it's just not readily
>> interpretable.
>> 
> 
> Actually, what I wrote was quite the opposite, but I admit that my
> double-negation construction ("This is not to say that there is no useful
> structure") was confusing. So, yes, there is structure, and it is useful.
> 
> 
> I assume that ESNs are great as long as a) the number N
>> of units in the reservoir is great enough or b) the number of instances
>> of the ESN over which we marginalize is large (did you do that in your
>> paper -- average across ESNs, each of them being a simulated
>> 'comprehender'?) or c) the statistical process that underlies the random
>> output variable is sufficient simple in its structure. Is that a correct
>> characterization?
>> 
> 
> Yes, I'd agree with that. In our paper, we did not marginalize over many
> ESNs. Instead, we trained three ESNs of each size and presented results for
> the one with median performance.
> 
> 
> I merely mean that ESNs
>> can presumably do a good job at modeling random variables generated by a
>> hierarchical generative process because they after all have a way to
>> capture that latent structure by driving the states of a sufficiently
>> large reservoir (which, if I get this right, is a set of computing units
>> with all-to-all connections that have weights that are initialized
>> randomly and never changed by training?). I assume that if that
>> reservoir wouldn't be sufficiently large the constraint that input to
>> reservoir and reservoir to output connections are linear would not allow
>> the model to learn much. so this model essentially does by breadth what
>> other models do by depths. is that roughly correct?
>> 
> 
> Indeed, that is roughly correct. However, the structure imposed by the
> random recurrent network has properties similar to a variable length markov
> model (see Čerňanský et al., 2007), that is, it does not reflect the
> input's hierarchical structure. A standard Elman network can (at least in
> theory) adapt its input and recurrent connection weights to hierarchical
> structure in the input but in an ESN these weights remain random. So the
> only way it could make use of hierarchical structure is if it gets encoded
> in the (learned) transformation from the hidden-unit space to the output
> space, but since this is a linear mapping I don't see how that would work
> (which, admittedly, does not mean it cannot work).
> 
> [...]
> 
> 
> Cheers,
> 
> Stefan
>