More on language and genome

Sat May 8 05:56:25 UTC 2010

Some time ago I posted here and elsewhere about apparent parallels between morphosyntactic typology and the structures of genomes in living organisms. Much has happened since then, and the numbers of these parallelisms has been growing rapidly.

I have also suggested that the numbers of ideophones in languages correlates (barring cultural or educational biases against them) with morphosyntactic type. High synthesis and/or fusion appears to militate against large numbers of ideophones. Numbers of ideophones in low synthesis, low fusion languages can be in the thousands, and over dialect areas, tens to hundreds of thousands, it appears. Yet individual knowledge and use varies wildly, and ideophones can associate one with a particular village, perhaps even family, so are a mark of identity. Individual ideophones are very mobile, areally.
Ideophones also provide a source of fresh root material for the lexicon, as they are relatively refractory to historical change, although they will adapt to a language's phonology.

Over the last several decades the role of viruses has changed radically in the eyes of researchers. First thought to be rare, annoying, and occasionally fatal castoffs from the genomes of cellular lifeforms, it now is apparent that they are everywhere- a sample of water from a lake in Germany had 254 MILLION per milliliter, i.e. one gram. 

Other interesting factoids- unlike eukaryotic cells, which generally exhibit 'vertical transmission' of genetic information (family trees), bacteria and archaea (the two non-nucleated cellular types) make primary use of 'horizontal transmission', trading genes, or groups of related genes, like baseball cards, between like and unlike 'species' ambivalently. This is why drug resistance, or virulance, spreads so quickly. Though some of this is done through special interconnecting tubes, the bulk comes from viruses.

Other viruses are able to incorporate themselves (with or without these fresh genes) into the host genome. This may account for perhaps 10 to 20 percent of the bacterial genome, according to my readings. Genes like these are regularly deleted from these genomes- probably when their usefulness ceases, though how a bacterial cell could know that eludes me. New ones refresh the system regularly.

The big surprise, though, is what happens in OUR cells. It looks like maybe most of our total amount of DNA (the so called 'junk') is of viral origin. Some is relatively fresh, requiring suppression or excission and deletion- the rest is of variable age, even from very ancient times in the history of life. And it has regulatory function. Much of the management of the genome comes from virally-transmitted genes.

Viruses can even infect other viruses! Surprises all around. But the most interesting thing is that the relative numbers of viral genes within any type of life (if you consider viruses part of this) vary typologically in exactly the same way as the numbers of ideophones vary in human languages considered against their morphosyntactic type.

There are other parallels- people have compared the genetic code to a phonological system, and proteins translated from genes are thought of as 'words'. But if one considers these letter by letter, as it were, then this isn't quite right. Instead, entire proteins are more like entire clauses, or multiclause formations. Consider the bacterial operon, a physically unbroken chain of genes all transcribed together into an unbroken messanger RNA, controlled by a single activating signal. When the RNA is translated into enzymes, each of these is connected as well, and each one in sequence takes the reaction product of the one before and passes its own new product off to the next. Serial verbs!!! In nucleated (eukaryotic) cells, gene clusters are broken up, not only from each other, but also internally, allowing variable editing of the messanger RNA, and translation of multiple protein products, all from the same underlying gene. This is why you and I only need around 23000 genes in stored form, yet have hundreds of thousands, if not millions, of finished products (and this doesn't even start to include post-translational modifications (derivations!), which multiply things still more.

Gene products from such genes can come together (quaternary structure) to create larger units (hemoglobin with its four elements is a good example), but this is often hierarchical, not serial. Bacteria can do this too- I should point that out.

There is also something akin to Saussurean arbitrarization going on. In simple proteins all the information to fold up and become functional resides in the primary structure consisting of the ordering of the constituent amino acids defined by the translation of the gene's codons. The structure of the genetic code isn't close to being arbitrary, though there are minor variations found in some organisms as well as in cellular organelles, such as the mitochondrion, which were once free-living. All the variations revolve around a symmetrical underlying consensus form of the code. Amino acids by codons have side chains that define their physical and chemical properties, as well as preferred functions within protein folds, internal bonds, docking sites, and catalytic sites. The code is arranged in such a way that any mutation will more likely than not give a new amino acid product with the same properties (the code's degeneracy), or one very similar to it. 'Sound symbolism' in ideophones is defined by diagrammatic iconicity utilizing the feature geometry of the language's phonology.

In many cases small changes in features will result in ideophones with similar meaning- for example in Japanese, where voiced stops connote the same idea as unvoiced ones, only a larger version (the periodic table arranges elements in similar fashion, but I won't here go into other parallels there as well). So the genetic code has much in common with phonology as used in ideophones, which possess the most iconic form/meaning mappings of segemental strings in language.

As I've mentioned above, besides chemical and physical properties amino acids also have a role in certain cases in other functions of protein folding, etc. Yet it is nearly impossible to predict the three dimensional conformation of a protein from its primary structural sequence of amino acids. The best one can do is secondary structure, the alpha helices and beta sheets and similar structures, and then only approximately. Something else is going on. Similarly, though you might expect these structures, and higher level ones, to fold up spontaneously, in many cases this doesn't work.

There is a very important class of proteins in cells, called chaperonins or heat-shock proteins, that play the role of mandrel, or shoehorn depending on your preference. Often partially folded new proteins come up against energetically ambivalent choices- go this way or that, or prefer a nonfunctional pattern that could be dangerous (various brain-wasting diseases caused by 'prions' spread by misfolding of proteins that then recruit other, 'normally' folded ones to change over to the dark side, as it were). The chaperonins also can re-fold many proteins that have been misfolded due to temperature or other issues. A very high percentage of proteins in cells make use of this mechanism.

So we have a sort of disconnect between iconic coding, as found in the mapping of the genetic code to many simpler protein products, and more arbitrarized coding, where conformation is mediated not by the code, but by functions and structures higher up in the system, as well as post-translational modifications that can add, subtract, or move elements of the protein.

I'll leave off here, and see whether anyone bites (Tom, are you out there? Perhaps not as 'out there' as me, no?). 

There are other parallels, and hopefully I'll be able to throw together a paper on this before long that might get some attention. 

Best to all,
Jess Tauber
phonosemantics at earthlink.net