22 October 2021

Philologist of Immanuel Kant Baltic Federal University proposed to present the genetic code as a language with its alphabet, grammar, and lexicon. His model is based on the importance of the context: nucleotides don’t make meaning alone but by joining together in a triplet that corresponds to a certain amino acid in a protein, they can transmit information. As in any language, the development of genetic code goes according to certain laws, but in this work, it is happening due to biochemistry but linguistics. This approach provides a different perspective on the study of hereditary material and the principles of its evolution. With the support of the Russian Scientific Foundation RSF), the article was published in the Biosystems journal.

Genetic code is a form of genetic information data in a living cell. It is encoded as a sequence of nucleotides: information RNA is formed from the DNA matrix, on which specific cell “machines” — ribosomes — synthesize proteins. The latter is a chain of amino acids, the order of which is precisely determined by the initial sequence of nucleotides: their triplets, called codons, show which amino acids should join at the end of the grown ribosome of the filament.


The biochemical representation and ‘grammar’ of the genetic code are known to every high school student. Although yet there are no models that take into account the context, that is, the environment of each nucleotide. External factors may influence the genetic code in a certain way and have the role of information phenomena that change its ‘meaning’. Earlier, I presented the possibility of a semiotic description. From this point of view, the genetic code becomes a system of signs to which a value is assigned in a certain way, just as in language”, says Suren Zolian, the author of the work, a Ph.D. in Philology, Professor at the Institute of Humanities, Immanuel Kant Baltic Federal University.

He proposed a linguistic model of genetic code based on context-dependent grammar. If grammar were context-free, for each triplet there would be many encoded amino acids and its position would have no particular role, but the actual situation is different. Like any grammar, there are rules and a certain linear ordering. This approach allows studying the evolution of amino acid coding from the position of linguistics rather than from the position of biochemistry. Since there are no closely related languages of the nucleotide “language”, it is necessary to apply the method of internal reconstruction: by comparing different fragments within the genetic code, one can try to find repetitive sequences characteristic of certain proteins.

With the description of the biochemical substance of the genetic code, it is possible to present the processes as information phenomena and consider them as semiotic systems. The notion of the Semiopoesis was proposed to describe the way of processing information where the bio-world became organized through concepts of meaning and purpose. Based on the heterogeneity and irregularity of the current standard genetic code, it is possible to explicate possible previous states and various ways of forming mechanisms of coding and textualization. From this point of view, the outer medium “forces” the cell to synthesize certain proteins for certain purposes. Because they are coded with nucleic acids, DNA sequences will change accordingly.

The scientist concluded that a linear context-free linguistic model was not suitable for such a problem. Instead, it is possible to use a grammar where elements act as context-dependent variables and simultaneously contextual operators (functors). Naturally, the alphabet includes only four elements — nucleotides A (adenine), U (uracil), G (glycine), C (cytosine). Due to the fact that any nucleotide can occupy any position in a triplet, the guiding principle should be the functional characteristic of the nucleotide that it obtains when a triplet (codon) is formed, rather than its biochemical properties. Because the same elements, depending on the context, perform different functions and encode different amino acids.

Nucleotide becomes only one of the categories used in the grammar of genetic code. It also includes a triplet, a nucleotide converter into a duplet, and a duplet converter into a triplet. “Sections” of grammar developed by the scientist can be called, for example, rules of comparison of the positions of codon with nucleotides and rules of correspondence between codons and amino acids.

By studying the language of the genetic code, it will be possible to understand how structures appear — from short fragments, such as signal tails, that show where the protein should go to the cell, to complex protein forms, such as beta-Sheets — several linked amino acid chains. The latter plays a key role in a number of diseases whose causes and course are still unclear, among them Alzheimer’s disease. With understanding the principles of language, it will be possible to do work without tautology, that is to say, repetitions, during the processing of information. It will save resources and help to solve the problems of bioinformatics”, explains Suren Zolian.

Suren Zolyan