The Genealogical World of Phylogenetic Networks

Biology, anthropology, computational science, and networks in phylogenetic analysis


XML feed

Last update

1 hour 6 min ago

July 5, 2015


In an earlier blog post, I discussed some of the evocative Metaphors for evolutionary relationships, particularly reticulating ones.

In that post I listed the concept of a "braided river", and mentioned a 1994 paper by John Moore as my earliest source for the image. However, the metaphor actually goes back more than 100 years earlier. It occurs as the central metaphor in this quite remarkable book on comparative religion:
Forlong, J.G.R. (1883) Rivers of Life: or Sources and Streams of the Faiths of Man in All Lands, Showing the Evolution of Faiths from the Rudest Symbolisms to the Latest Spiritual Developments. 2 vols. Bernard Quaritch: London.James George Roche Forlong was a Scottish engineer serving in the British army that occupied India during the 19th century. He apparently had a life-long interest in comparative religion, and his book arose from his personal experience of non-Christian religions (facilitated by his knowledge of several languages). The book involves a serious re-interpretation of the evolutionary history of world religions, as a series of six inter-connecting rivers running from ancient times into the modern world, each river representing a different type of worship.

The illustrative chart that accompanies the book can be viewed here. A low-resolution copy is shown below.

June 30, 2015

There are several processes that create reticulate phylogenetic topologies, including hybridization, introgression (or admixture) and horizontal gene transfer (HGT). Biologically, introgression operates via the same mechanism as does hybridization (ie. during sexual reproduction), but it results in only a small amount of genetic material entering the recipient genome, making an admixed genome that is similar to the end result of HGT.

Constructing phylogenetic networks in situations where introgression or HGT have occurred has been somewhat different in practice to that used for hybridization. Hybridization has usually been tackled by merging incongruent tree topologies, based on the idea that the different topologies represent the phylogenetic history of the different genomes of the hybrid taxon. Introgression and HGT have usually been tackled by adding reticulation edges to a phylogenetic tree, on the basis that the tree represents the phylogenetic history of the main part of the genome.

So, the study of introgression (and HGT) involves (a) constructing a phylogenetic tree from some genomic sample, and (b) detecting the introgressed (or HGT) parts of the genome. This is potentially a problematic procedure, because how do we construct a phylogenetic tree from data that already contain non-tree components? Apparently, the expectation is that a single tree will be supported by the majority of the data, and the remainder will represent the introgressed (or HGT) pathways(s), plus whatever other components have created the observed genomic variability (such as incomplete lineage sorting, gene duplication-loss, and stochastic mutations).

Recently, there have been quite a few studies published that have adopted a specific protocol for this procedure, usually under the rubric of admixture. Most of these have involved the study of ancient human DNA, but there have also been studies of contemporary humans, as well as ancient non-humans, An example of the latter is shown in the next two figures, which represent parts (a) and (b), respectively. They are taken from this study of the relatives of horses: Hákon Jónsson, et alia (2014) Speciation with gene flow in equids despite extensive chromosomal plasticity. Proceedings of the National Academy of Sciences of the USA 111: 18655-18660.

The phylogenetic tree (step a) was constructed using "maximum likelihood inference and 20,374 protein-coding genes ... based on a relaxed molecular clock." So, only stochastic mutations were accounted for when constructing the tree, and not incomplete lineage sorting or gene duplication-loss.

The detection of introgression (step b) used "the D statistics approach, which tests for an excess of shared polymorphisms between one of two closely related lineages (E1 or E2) and a third lineage (E3)". The reticulations representing the detected gene flow were then added to the tree manually.

The D-statistic is also known as the ABBA-BABA test (see: Patterson NJ et alia. 2012. Ancient admixture in human history. Genetics 192: 1065-1093). It operates as follows for sets of four taxa, applied to character data.

Let the species tree be this, where E1–E3 are the three taxa being compared, and O is the outgroup:

There are three possible allele trees for each binary character (ie. single nucleotide polymorphism) in which states are shared pairwise:

In the first tree, E3 shares the ancestral character state with the outgroup, which is expected to be the most common pattern in the absence of gene flow. E1 and E2 share the ancestral state with the outgroup in the second and third trees, respectively.

The admixture test compares the ABBA tree to the BABA tree. The expectation is that if there has been no introgression then the data support for these two trees should be equal. That is, under the null hypothesis that there is no gene flow between the species (and the underlying species tree is correct), the difference in the expected number of occurrences of the ABBA and BABA patterns should be zero. Deviation from this expectation is statistically evaluated using a jackknife procedure.

When there are more than three ingroup taxa, they are tested in groups of three (plus the outgroup). No correction for multiple hypothesis testing seems ever to be applied. Recently, the test has been extended to five taxa (Pease JB, Hahn MW. 2015. Detection and polarization of introgression in a five-taxon phylogeny. Systematic Biology 64: 651-662).

Note that this test assumes that:
  • the "excess of shared polymorphisms" arises solely from gene flow, with or without incomplete lineage sorting, rather than from any other tree-like processes such as gene duplication-loss or ancestral population structure
  • there are no other sources of co-ordinated polymorphisms, such as character-state reversals due to adaptation / selection
  • any gene flow that does exist is due to introgression, rather than to hybridization or HGT.
How realistic these assumptions are is not immediately obvious.

June 28, 2015


I have noted before that common usage of expressions like "family tree" often extend far beyond actual pedigrees. This particular expression is often used to describe any sort of historical relationship, not just genealogical ones. It is also sometimes used simply to describe any sort of personal inter-connection. All of these usages occurred in a short-lived magazine from 25 years ago called Wigwag.

Wigwag magazine formally debuted in October 1989 (after a test issue in 1988), and published its last issue in February 1991, for a total of 15 issues. It was a sort of cozy version of the New Yorker magazine. Similarly, it had a number of regular features, such as the Road Trip, the Map, and Letters From Home. The one that is of interest to us was called The Family Tree.

This feature mapped cultural relationships, having been described as "a field guide to the genealogy of influence in American life". It included human relationships, but it also included things like cars (the tree of which is reproduced in the book by Nobuhiro Minaka & Kunihiko Sugiyama. 2012. Phylogeny Mandala: Chain, Tree, and Network) and comic-book superheroes.

I have been unable to locate any decent copies, but four of the "trees" are included below.

As you can see, sometimes The Family Tree was actually a genealogical tree, but just as often it was simply a network of pairwise cultural connections. The latter, of course, usually formed a complex network that did not really map historical relationships.

This last Family Tree is from the original trial issue, and shows the inter-relationships of the writers and producers of American TV sitcoms.

You can read a bit more about the magazine, and its history, here:

June 23, 2015


One of the perennially most popular posts in this blog has been the one about the domestication of dogs: Why do we still use trees for the dog genealogy?

In that post I noted that, up to 2012, there were three distinct trends in the presentation of the genealogy of dog breeds:
  1. the study of whole-genome data, in which the results are presented solely as a neighbor-joining tree
  2. the study of mtDNA sequence data, in which the results are presented both as a tree and as a haplotype network
  3. the study of combined Y-chromosome and mtDNA sequence data, in which the results are presented solely as a haplotype network.
This pattern has continued. For example, the following diagram is taken from:
Skoglund P, Ersmark E, Palkopoulou E, Dalén L (2015) Ancient wolf genome reveals an early divergence of domestic dog ancestors and admixture into high-latitude breeds. Current Biology 25:1515-1519.
The tree is based on mitochondrial genome data for the highlighted fossil, compared to the mitochondrial sequences of modern-day dogs and wolves, as well as ancient canids. The use of a phylogenetic tree seems to be based on the idea that mitochondria consist of tightly linked genes that are uniparentally inherited. However, neither of these characteristics is universal, and so a network might be more appropriate.

The dog genealogy is recognized as being characterized by introgression with wolves, as the authors themselves note. Also, the origin of dogs is not directly from wolf ancestors, but both modern wolves and modern dogs are derived from a common ancestor. For example, this next diagram is from:
Freedman AH, et alia (2014) Genome sequencing highlights the dynamic early history of dogs. PLoS Genetics 10:e1004016.
The width of each population branch is proportional to inferred population size. Note that wolves and dogs originated at roughly the same time, as the result of bottlenecks in the ancestral population size. Wolves diversified slightly earlier than dogs. Also, Skoglund et al. dispute the dating of the splits, suggesting that the dog-wolf divergence was "at least 27,000 years ago".

As a final note, there is a tendency to credit Charles Darwin with originating just about everything in the study of genealogy, although he was a synthesizer as much as an innovator. For example, David Grimm suggests (Dawn of the dog. Science 348: 274-279):
Charles Darwin fired the first shot in the dog wars. Writing in 1868 in The Variation of Animals and Plants under Domestication, he wondered whether dogs had evolved from a single species or from an unusual mating, perhaps between a wolf and a jackal.However, the first hypothesized genealogy was actually published more than a century earlier, by Georges-Louis Leclerc, comte de Buffon (see the blog post on The first phylogenetic network), who suggested a common origin with wolves.

June 21, 2015


We all know how painful it is to deal with computer login passwords. Computer administrators keep telling us to have "secure" passwords, and to not reuse them, but of course we ignore this advice. Who can remember all of these passwords anyway? So, we keep them simple, and we reuse them.

The SplashData group, which markets what they call a "secure password and record management solution", provide an annual list of the 25 most common passwords found on the Internet. These are compiled from leaked passwords posted online by hackers. I have looked at the lists for 2011, 2012, 2013 and 2014.

As usual, I have used a phylogenetic network as a form of exploratory data analysis. I first used the steinhaus similarity to calculate the pairwise similarity of the 43 passwords that appear — this similarity ignores what are called "negative matches" (which is important because most of the passwords do not appear in the lists for all four years). This was followed by a Neighbor-net analysis to display the between-word similarities as a phylogenetic network. So, passwords that are closely connected in the network are similar to each other based on their popularity across the four years, and those that are further apart are progressively more different from each other. Those passwords that are in the top 25 for all four years are marked in red.

You will note the similarity among many of these passwords. They are mostly simple combinations of numbers, words, or a row of keys on the standard English keyboard. Obviously, these are not secure passwords.

The numbers one and two passwords for all four years were "password" and "123456", with "12345678" right behind. Oddly, there has been a distinct increase in "1234", "12345" and "123456789" during the years — they are grouped at the bottom right of the network. The passwords grouped at the bottom left have decreased in popularity through time.

Clearly, many people do not take login security very seriously. However, the problem also comes from the fact that system administrators fob the job of security off on the users —there have been many discussions of the lunacy of asking users to use unique "secure" passwords for each and every system (eg. Robert McMillan, of Wired magazine: Do you really need a password you can barely remember?). Indeed, Mat Honan, also writing at Wired magazine, has pointed out that even secure passwords are out of place in the Internet world (Kill the password: why a string of characters can’t protect us anymore). It will be interesting to see what happens next.

June 16, 2015


The standard references for the conceptual history of phylogenetic trees and networks are:
  • Ragan, Mark A. (2009) Trees and networks before and after Darwin. Biology Direct 4: 43.
  • Tassy, Passcal (2011) Trees before and after Darwin. Journal of Zoological Systematics and Evolutionary Research 49: 89-101.

Over the past 20 years, a number of books have appeared that expand upon this topic, which it is worthwhile to list here.

• Barsanti, Giulio (1992) La Scala, la Mappa, l'Albero: Immagini e Classificazioni Della Natura Fra Sei e Ottocento. Firenze: Sansoni Editore.
In Italian. A conceptual history of classification up to the time of Haeckel, and the beginning of phylogenetics. Covers the development of ideas and images equally. Networks are treated as "maps".
• Stevens, Peter F. (1994) The Development of Biological Systematics: Antoine-Laurent de Jussieu, Nature, and the Natural System. New York: Columbia University Press.
A conceptual history of classification and phylogenetics, mainly as related to plants. Focuses on the early development of ideas within biosystematics, with accompanying illustrations. Networks are effectively treated as variants of trees.
• Pietsch, Theodore W. (2012) Trees of Life: a Visual History of Evolution. Baltimore: Johns Hopkins University Press.
A richly illustrated history of trees, with a few networks. Focuses on the illustrations, with some accompanying text. The best source to see what people have drawn in the way of trees, but weaker on networks.
• Minaka, Nobuhiro, and Sugiyama, Kunihiko (2012) Phylogeny Mandala: Chain, Tree, and Network. Tokyo: NTT Publishing.
In Japanese. Covers the development of the tree metaphor (with a few networks), as related to pedigrees, phylogenies, and knowledge representation in general. The breadth of the topic is indicated in the "mandala" of the title, which is "a generic term for any diagram, chart or geometric pattern that represents the cosmos metaphysically or symbolically".
• Archibald, J. David (2014) Aristotle's Ladder, Darwin’s Tree: The Evolution of Visual Metaphors for Biological Order. New York: Columbia University Press.
A conceptual history of trees, with a few networks. Focuses on the development of the ideas, with accompanying illustrations. Starts with pedigrees, and proceeds from there to phylogenies. The best coverage of phylogeny concepts, but explicitly treats networks as "trees with reticulations".

June 14, 2015


I have noted several times in this blog that it is not just biological organisms that can be considered to have a phylogenetic history. Many human artifacts also do, provided that their history results from diversification from a common ancestor. For example, there are blog posts about the following topics:
All of these can be considered to have a phylogenetic history of shared common ancestors. For instance, manuscript copies do share ancestors — the source manuscripts that have been copied.

However, while all human artifacts have a history, not everything has a phylogenetic history. There can be transformational history, for example, where concepts simply change through time without diversifying. This can represented by a timeline rather than a phylogeny, as discussed in these blog posts:

There are also situations where artifacts simply cluster together, based on their similarity. This can be represented as a tree-like diagram or a network, but such a tree/network is not a phylogeny, because the clustering does not necessarily have anything to do with common ancestry. Examples discussed in this blog include:

The problem with this latter situation is that we can always mathematically measure the similarity between concepts or objects, and therefore we can always cluster them based on this similarity, even if the clusters have little meaning. I have previously discussed this issue in this blog, noting that if the similarity measure used does not model evolutionary patterns then it cannot be expected to produce a phylogeny (Non-model distances in phylogenetics).

Another case in point is the work of William Shakespeare. Can the plays, for example, be considered to have a phylogeny? Each play certainly has a phylogeny on its own, because the Shakespearean author is well known for having taken the ideas for the plays from previous sources. So, each play has a phylogeny (a reticulate history) based on the historical connections among its sources. However, the plays as a group do not have a phylogeny (not unless they have been plagiarized from each other, anyway). Does Othello really share a common ancestor with King Lear? It certainly has similarities, if only on the basis that it is one of the Tragedies (along with Macbeth, etc). But they are not phylogenetic similarities, and there is no common ancestral Shakespearean play.

As shown by the picture above, this point is not always appreciated. The alleged phylogeny is taken from a press release from the Lawrence Berkeley National Laboratory. The textual similarities among the plays are based on what are called "feature frequency profiles", which have nothing to do with evolutionary history. So, while the data analysis may or may not be helpful for identifying the author(s) of the so-called Shakespearean plays, it is not much help for constructing a phylogeny.

The data analysis is discussed in more detail by:

June 9, 2015


What is a language?

It is not easy to define exactly, what a language is. We find one reason for this in the daily use of the word “language” in non-linguistic contexts. What we call a language does not depend on purely linguistic criteria. The criteria we normally use are social and cultural.

If we were to define languages with help of linguistic criteria, we would use the degree to which speakers understand each other; and in most cases, we could draw some line around areas of what linguists would call “mutual intelligibility” (similar to the criterion of “interbreedability” in biology). But mutual intelligibility does not usually serve as the criterion by which we define languages in everyday situations. For example, we tend to say that the people from Shanghai, Beijing, and Meixian (all cities in China) all speak “Chinese”. On the other hand, we think that people from Scandinavia speak “Norwegian”, “Swedish”, and “Danish”, although there three are no more different than are the former three.

The table above (taken from List 2014: 11f, with adaptations) gives phonetic transcriptions of translations of the sentence “The North Wind and the Sun were disputing which was the stronger” in three Chinese “dialects” (Beijing Chinese, which is also called Mandarin or Standard Chinese, spoken in Beijing and all over the country as a second language; Shanghainese, spoken in Shanghai; and Hakka Chinese, spoken in Meixian), and three Scandinavian “languages” (Norwegian, Swedish, and Danish). In the table, I have put all words that have the same meaning in one column (ie. I have aligned them semantically). Furthermore, I have highlighted the words which share a common etymological origin (call them “homologs” or “cognates”) with a gray background. In red, I have added a more or less literal translation of the respective column.

As the phonetic transcriptions of the sentences show, the Chinese varieties differ to a similar, if not even greater, degree as the Scandinavian ones. And we find this variation both in the way the meaning of the sentence is expressed by the choice of words, and in the degree of etymological similarity between the words. Note, further, that none of the three Chinese dialects is mutually intelligible with any other of the dialects, while we know from famous TV series like Broen/Bronthat Danish and Swedish people can often understand each other quite well (with some effort); and Norwegians and Swedes are mutually intelligible most of the time. Nevertheless, we address the latter three speech traditions as the three languages “Norwegian”, “Swedish”, and “Danish”, while we say that the speech of the people in Shanghai, Beijing, and Meixian are merely specific variants of one and the same “Chinese” language.

Languages as Diasystems

One could say that this is just a cultural problem, not a linguistic one, we are facing here. So we could say that there are two different ways of distinguishing languages from dialects. One would be the linguistic one, which uses mutual intelligibility as a unique criterion to tell languages from dialects. The other one would be the cultural definition of languages as, say, “dialects with an army” (a definition usually attributed to Uriel Weinreich).

But this is, unfortunately, only part of the real story, since the cultural definition of the boundaries of a language has a direct impact on the way languages evolve. In societies such as China, for example, a very largeproportion of all speakers is bilingual. Apart from their home dialect, speakers are also able to speak Standard Chinese (also called Mandarin Chinese), and they use it to talk to people from different regions, or to read and to write. So, from a pure linguistic viewpoint, it is not necessarily useful to break up the Chinese dialects into distinct languages, since these dialects are located within a larger speech society that is united by a common language for written and interdialectal communication.

In order to describe this complex structure of our modern languages, linguists have proposed the model of the “diasystem”, which is very common in the discipline of sociolinguistics. This model goes back to the aforementioned dialectologist Uriel Weinreich (1926–1967) who originally thought of some linguistic construct which would make it possible to describe different dialects in a uniform way (Weinreich 1954). According to the modern form of the model, a language is a complex aggregate of different linguistic systems, “which coexist and mutually influence each other” (Coseriu 1973: 40, my translation from the German).

An important aspect for determining a linguistic diasystem is the presence of a “Dachsprache” (“roof language”). This is a linguistics variety that serves as a standard for interdialectal communication (Goossens 1973: 11). The different linguistic varieties (dialects, but also sociolects) that are connected by such a standard constitute the “variety space” of a language (Oesterreicher 2001). I have tried to illustrate this in the following figure (taken from List 2014: 13).

As you can see from the figure, there are different “dimensions” according to which the varieties of a language can differ. The figure shows three of them. First, there are “diatopic varieties” which point to the division of a language into different dialects (varying regarding the place where they are spoken).

Second, there are “diastratic varieties”, pointing to different social layers in which the varieties are used. Compare, for example, the language of a football player with that of a politician, which are similar in their tendency to say nothing in many words (especially after hard defeats or before unpopular decisions to be told to the public), but which differ a lot regarding their choice of words. Third, there are “diastratic varieties”, which are varieties depending on the situation in which people speak. Compare, for example, the way our politician speeks when giving a speech to the public with the speech when discussion big politics behind closed doors.

But these three dimensions of language variation are not all that a diasystem of a language has to offer! We can further identify different speech habits when looking at the medium that is used to produce language; and there are significant differences in many respects when writing or reading something, or when speaking and listening. This dimension is commonly called “diamesic” (varying in dependency of the “medium”).

Last, but not least, we should also note that we do not necessarily speak and understand the language from only one time. Think of modern German kids in school who are forced to read Goethe's Faust, bitterly lamenting the old-fashioned style of the language, but think also about different generations of speakers living in the same speech society. This last dimension of language variety is usually called the “diachronic dimension”. The following image tries to summarize the different dimensions in which the diasystem of a language can vary.

Diasystematic aspects of language change

Given all of these fancy terms starting with “dia” and ending in “ic”, one may think that they are a mere play with thoughts developed by a bunch of linguist geeks who are interested in sociology. Why can't we just forget about all these different kinds of “variation” and keep on modeling our languages as bags of words? Applying computational methods from biology will be much easier, and as long as we use networks once in a while, we are not completely giving ourselves in to the dark side of the Force, which knows only trees. Unfortunately, this is not possible, since the diasystematic structure actually has an impact on the way in which languages change!

As an example from practice, let me tell you how I tried to buy cigarettes when I was in China for the first time. At the time, I had just started to learn Mandarin Chinese, and was really suffering from the difficulty of the language. But I had searched my dictionary several times, and looked up all the important words I needed to tell the man at the kiosk which cigarettes I wanted to have. My choice was “Marlboro”, since it was the only brand I recognized.

Although having only a complete beginner's knowledge of Chinese, I knew, as a linguist, that the language is peculiar in one specific respect — it has a very, very restricted structure of possible syllables. So one can't say “Saint Petersburg” in Chinese, since syllables in Chinese are not allowed to end in a “t” (as in “Saint”), an “s” (as in the syllable “ters”), or a “g” (as in the syllable “burg”). Instead, Chinese speakers will say Shèngbǐdébǎo. I also knew that there is no sound for “r”, and that this sound is often rendered by using a “l” instead.

So, based on this background knowledge, I “translated” the pronunciation of the word “Marlboro” into what I thought by then was perfectly understandable Mandarin, and told the man at the shop that I wanted to have a pocket of mābóluō cigarettes. Unfortunately, he didn't understand at all, what I wanted, and only when I pointed with my finger to the packets of Marlboro cigarettes did he finally understand, and say, “Ah, wànbǎolù !”.

So, I learned that “Marlboro” in Mandarin Chinese is called wànbǎolù, not mābóluō, written 万宝路, literally meaning 10 000-treasure-road, which can be translated as “road of 10 000 treasures”. (Good brand name, actually, especially for cigarettes.) It was only some months later that I understood why my prediction for the Mandarin Chinese pronunciation of “Marlboro” failed so dramatically, when I heard people from Hong Kong pronouncing the word wànbǎolù 万宝路 in Cantonese, the Chinese dialect they speak in Hong Kong. There, wànbǎolù 万宝路 becomes something like [maːn²²-pow³⁵-low³²] (the numbers are tone marks), which sounds very, very similar to the mābóluō I had falsely predicted for Mandarin Chinese.

In the image above, I have tried to depict the process by which “Marlboro” becomes the “road of 10 000 treasures”. What we are dealing with here is a complex pattern of change: both phrases, Mandarin Chinese wànbǎolù and Cantonese [maːn²²-pow³⁵-low³²], are homologous. This applies to their three parts (10 000 + treasure + road), since the phrase itself was presumably not present in earlier stages of Chinese. In the ancestor language of Cantonese and Mandarin Chinese, a variety we usually call “Middle Chinese” (spoken around 600 AD), the phrase “road of 10 000 treasures” would have sounded approximately like [mjon³-paw²-lu³]. In Mandarin Chinese, the pronunciation changed greatly, while it changed only slightly in Cantonese.

When Marlboro entered China, it was probably only sold in Hong Kong in the beginning. So, in order to trigger the interest of Hong Kong consumers, the marketing stragegists did a good job in choosing a translation that sounded both very similar to the original product while at the same time having a nice and promising meaning. They would use Chinese characters to write down the product name. When Marlboro, or the “road of 10 000 treasures” then entered the rest of China, people would read the phrase, but pronounce it in their own way — reading the Chinese characters in Mandarin Chinese just yields wànbǎolù, and not mābóluō.

The transfer of the word from one dialect to another was thus made via the diamesic dimension, via the writing system, not via the spoken language. And this is the way that many, many words (also very basic terms) are exchanged between the Chinese dialect varieties — via their “roof language”, which is the common writing system. And since this change doesn't involve the direct borrowing of a spoken word, it is barely perceivable, since it leaves no direct traces in the pronunciation of the words. While normal borrowings in other languages usually sound outlandish, borrowings in Chinese dialects which make their way from one variety to another via the writing system just sound like any other possible word in the recipient dialect.


In the same way in which languages may change via the interaction between their written and spoken varieties, the interaction between varieties from the other dimensions may also trigger change. Words originating in one social layer may be transferred to other layers; dialect words of one dialect may become popular and henceforth be used in all dialects; and even those varieties of our languages which are only accessible via stories or books may be revived, at least in part, and find a new steady place in our regular speech, up to the moment where we again cease to use them. The diasystematic structure of languages plays a crucial role in their development. Due to the diasystematic character of languages, language change involves complex network-like structures within one and the same (dia)system. If we really aim to depict language evolution in all its complexity, then it is definitely not a good thing to ignore the diasystematic aspect of languages.


Coseriu, E. (1973) Probleme der strukturellen Semantik. Vorlesung gehalten im Wintersemester 1965/66 an der Universität Tübingen. Tübingen: Narr.

Goossens, J. (1973) “Sprache”. In: Niederdeutsch. Sprache und Literatur. Eine Einführung. Vol. 1. Ed. by J. Goossens. Neumünster: Karl Wachholtz.

List, J.-M. (2014) Sequence comparison in historical linguistics. Düsseldorf: Düsseldorf University Press.

Oesterreicher, W. (2001) “Historizität, Sprachvariation, Sprachverschiedenheit, Sprachwandel”. In: Language Typology and Language Universals. An International Handbook. Ed. by M. Haspelmath. Berlin and New York: Walter de Gruyter, 1554– 1595.

Weinreich, U. (1954) “Is a structural dialectology possible?” Word 10.2/3, 388–400.

June 7, 2015


A few years ago I wrote a few guest posts on the Guest Blogge at the Scientopia site (as described in the post Moonlighting at Scientopia). That blog has been inactive since 2013. Unfortunately, gremlins have since crept into the system, and all of my posts are no longer credited to me. The by-line for all of the posts is now "Christina Pikas" rather than "David Morrison". She is the author of a different blog at Scientopia, Christina's LIS Rant.

I hope that this sort of thing does not happen with my online research publications, as well !

June 2, 2015


There are at least two misleading expressions that one very commonly encounters in the professional phylogenetics literature: "basal branch of the tree", and "derived species".

The first expression is used to refer to an unbranched lineage arising near the common ancestor, when compared to a more-branched lineage. For example, in the first diagram below we might say that taxon A is on a "basal branch", whereas taxon B is not. The taxa associated with taxon B are then referred to as the "crown" of the tree. But, how can one lineage be more basal than another? After all, both lineages connect to the "base" of the tree at the same point. To claim that one is basal and the other not is like saying that one brother is more basal than another in a family tree just because he has fewer children!

The second expression refers to a species that has more "derived" characters than another. For example, in the diagram we might say that taxon B is more derived than taxon A. Characters change from ancestral to derived through time (eg. scaly skin covering is ancestral while fur is derived, because the latter arose later in time). However, this does not make any species more derived. It is the characters that are derived not the species — each species has a combination of ancestral characters and derived ones (including humans).

These issues seem to arise from the tree iconography. Some people seem to conceptualize this as a pine tree rather than a bush (as drawn by Charles Darwin in the Origin). A pine tree, indeed, does have basal branches and a crown. Here is an example from a sign in my local botanical garden, which tries to explain plant phylogenetic relationships to the general public. This tree does, indeed, have basal branches and a distinct crown.

This issue seems to have started with Ernst Haeckel in the late 1800s. Haeckel's first phylogenies (see Who published the first phylogenetic tree?) were drawn as multi-branched bushes, rather similar to the diagram that Darwin himself had published. However, Haeckel then veered away from this approach when explicitly discussing the evolution of humans. Here, he drew a tree with a distinct central trunk and much smaller side-branches (presumably modeled on an oak tree, rather than a bush). This image emphasizes one particular lineage at the expense of the others, because there is one taxon obviously sitting at the crown of the tree while the others are relegated to side-branches.

E. Haeckel (1874) Anthropogenie oder Entwickelungsgeschichte
des Menschen.
Engelmann, Leipzig.
This approach to drawing a phylogeny can be used to put any chosen organism at the crown of the tree, not just human beings, as illustrated by the following diagram from James Scott (which looks like it is actually modeled on a pine tree). This is a fundamental characteristic of a phylogeny — it can be drawn so that any part of the diagram is at the crown. However, to be accurate it should always be drawn so that no one lineage is emphasized over any other one — there should be no taxa sitting at the crown.

J.A. Scott (1986) The Butterflies of North America:
a Natural History and Field Guide.
Stanford University Press, Stanford.
Distorted images occur in several ways in modern evolutionary biology. This topic has received considerable attention in the literature, and there are a number of very readable expositions of various parts of it. Here is a brief list.

Gregory T.R. (2008) Understanding evolutionary trees. Evolution: Education and Outreach 1: 121-137.

O'Hara R.J. (1992) Telling the tree: narrative representation and the study of evolutionary history. Biology and Philosophy 7: 135-160.

Crisp M.D., Cook L.G. (2005) Do early branching lineages signify ancestral traits? Trends in Ecology and Evolution 20: 122-128.

Krell F.-T., Cranston P.S. (2004) Which side of a tree is more basal? Systematic Entomology 29: 279-281.

Omland K.E., Cook L.G., Crisp M.D. (2008) Tree thinking for all biology: the problem with reading phylogenies as ladders of progress. BioEssays 30: 854-867.

Sandvik H. (2009) Anthropocentrisms in cladograms. Biology and Philosophy 24: 425-440.

May 31, 2015


On this blog we occasionally draw your attention to the overlap between the scientific world and the artistic world. The language tree shown below is from the Stand Still Stay Silent site, which describes itself as "a post apocalyptic webcomic with elements from Nordic mythology". The tree data apparently come from the Ethnologue language database.

The detail about the Nordic languages derives from the fact that the author, Minna Sundberg, is Finnish-Swedish, and the Scandinavian languages have next to nothing in common with the Finno-Ugric languages.

Posters and prints of the tree are available for purchase.

May 26, 2015


Charles Darwin's most poetic published words concern his image of the Tree of Life. However, he did not claim to have originated the image. For example, Alfred Russel Wallace had already used it. Recently, the Natural History Apostilles blog has mentioned another important predecessor of both Englishmen, the Frenchman Charles Naudin, who deserves wider recognition.

Darwin's well-known words from On the Origin of Species (1859) are:
The affinities of all the beings of the same class have sometimes been represented by a great tree. I believe this simile largely speaks the truth. The green and budding twigs may represent existing species; and those produced during each former year may represent the long succession of extinct species ... As buds give rise by growth to fresh buds, and these, if vigorous, branch out and overtop on all sides many a feebler branch, so by generation I believe it has been with the great Tree of Life, which fills with its dead and broken branches the crust of the earth, and covers the surface with its ever branching and beautiful ramifications.Wallace seems to have developed the Tree of Life metaphor quite independently (1855. On the law which has regulated the introduction of new species. Annals and Magazine of Natural History, 2nd series 16: 184-196):
"the analogy of a branching tree [is] the best mode of representing the natural arrangement of species ... a complicated branching of the lines of affinity, as intricate as the twigs of a gnarled oak ... we have only fragments of this vast system, the stem and main branches being represented by extinct species of which we have no knowledge, while a vast mass of limbs and boughs and minute twigs and scattered leaves is what we have to place in order, and determine the true position each originally occupied with regard to the others."Darwin freely admitted having read Wallace's work. Moreover, he was well aware of the other of his predecessors, Charles Naudin, because on p.167 of his 'Books Read' and 'Books to be Read' notebook of 1852-1860 (see Darwin Online CUL-DAR128) he recorded:
"Revue Horticol Imp. 1852. p. 102. Naudin Consid. Phil, sur l'espèce"Charles Naudin's words are these, roughly translated from the original French (1852. Considérations philosophiques sur l'espèce et la variété. Revue Horticole, 4th series 1: 102-109) [NB. the long convoluted sentences are in the original]:
This doctrine of inbreeding among organic beings of same family, the same class, and perhaps of the same kingdom, is not new; men of talent, both in France as well as abroad, among them our learned Lamarck, have supported it with all of the authority of their names. We do not deny that, on more than one occasion, they have reasoned upon assumptions which were not adequately supported by observation, that they did sometimes apply to the facts forced interpretations, that finally resulted in exaggerations that have mainly helped to push their ideas. But these defects in details do not diminish the greatness and perfect rationality of the whole system that, alone, reflects, by the community of origin, the great fact of the organizational community of the other living beings of the same kingdom, the primary basis of our rankings of species into genera, families, orders and phyla. In the opposing system now in vogue, in this system which involves many partial and independent creations we recognize or think we recognize as distinct species, one is forced to be logical, to admit the similarities exhibited by these species are only fortuitous coincidence, that is to say an effect without a cause, concluding that the reason is not acceptable. In our own [system], on the contrary, these similarities are both the consequence and proof of a relationship, not metaphorical, but real, that they hold a common ancestor, which they left at times more or less remote and through a series of intermediaries greater or fewer in number; so they express the true relationships between species by saying that the sum of their mutual similarities is the expression of their degree of relationship, as the sum of the differences is that of the distance they are from the common stock from which they derive their origin.Considered from this point of view, the plant kingdom would present, not as a linear series whose terms would increase or decrease in organizational complexity, according as we consider starting with one end or the other; it would not be more of a disordered tangle of intersecting lines, like a geographical map, whose regions, different in shape and size, would touch by a greater or lesser number of points; it would be a tree the roots of which, mysteriously hidden in the depths of cosmological time, would have given birth to a limited number of successively divided and subdivided stems. These first stems would represent the primordial types of the kingdom; their last ramifications would be the current species.It follows from there that a perfect and rigorous classification of the other organized beings of the same kingdom, of the same order, of the same family, if something other than the family tree even of the species, indicates the relative age of each, its degree of speciation and the line of ancestors from which it descended. Thereby would be represented, in a manner of some sort so palpable and material, the different degrees of relationship of the species, such as that of groups of varying degrees, dating back to the primordial kinds. Such a classification, summarized in a graphical table, would be seized with much facility by the mind through the eyes, and present the most beautiful application of this principle generally accepted by naturalists: that nature is avaricious [stingy?] of causes and prodigal of effects.This is quite clearly a description of a modern phylogenetic tree, and the taxonomic consequences of adopting that conception.

It is, however, rather a pity that he explicitly rejects a network ("a disordered tangle of intersecting lines") as a suitable model, along with the chain ("a linear series").

May 24, 2015


We are often told that flying is the safest way to travel, at least as far as the use of commercial airlines is concerned. In an early stand-up comedy routine, Shelley Berman noted: "Statistics prove that flying is the safest way to travel. I don't know how much consideration they've given to walking!" Well, actually, they have included walking.

Governments like to keep a track of these things, and the Department for Transport in Great Britain has released statistics on "Passenger casualty rates for different modes of travel" for 2003-2012. These modes include:
  • Air (passenger casualties in accidents involving UK registered airline aircraft)
  • Rail (passenger casualties involved in train accidents and accidents occurring through movement of railway vehicles)
  • Water (passenger casualties on UK registered merchant vessels)
  • Bus or coach (passenger casualties)
  • Car (driver and passenger casualties)
  • Van (driver and passenger casualties)
  • Motorcycle (driver and passenger casualties)
  • Pedal cycle
  • Pedestrian
The data are yearly averages for Great Britain from 2003-2012 inclusive, standardized as persons per billion passenger kilometres. The data are provided separately for the number of people killed, seriously injured, or slightly injured.

As usual, we can employ a phylogenetic network as a form of exploratory data analysis for these data. I first used the manhattan distance to calculate the similarity of the seven transportation modes for which there are complete data, followed by a Neighbor-net analysis to display the between-mode similarities as a phylogenetic network. So, modes that are closely connected in the network are similar to each other based on their accident figures across the ten years, and those that are further apart are progressively more different from each other.

The probability of incidents increases from right to left in the graph.

Some notable conclusions from the data are:
  • The probabilities of being killed, seriously injured or even slightly injured are all minuscule for air travel compared to anything else. This is a topic explored more thoroughly in an earlier blog post (A network analysis of airplane disasters).
  • You are much more likely to be injured in a bus than in a truck, but more likely to be killed in the truck than in the bus.
  • You are slightly more likely to be killed walking than cycling, but much more likely to be injured cycling.
  • A motorbike is the most effective way to get killed or seriously injured in Britain.

The walking versus cycling data are likely to surprise many people, but the average data across the 10 years are clear:

Pedal cycle
Motorcycle Killed
92 Seriously injured
1,043 Slightly injured
Danny Yee (Walking and cycling: relative risks) provides one explanation:
People who wouldn't even contemplate wearing special high-visability clothing or a helmet for a walk to the shops do so when cycling the same route.

May 19, 2015


Splits graphs are a useful way of displaying contradictory information within evolutionary datasets, either incompatible characters (ie. those that cannot fit onto a single tree) or incompatible trees. Since the graphs are unrooted, they are usually treated as a form of multivariate data display, rather than interpreted as depicting evolutionary history.

However, it is possible to turn a splits graph into a evolutionary network (sometimes called a reticulation network) once a root is specified (Huson and Klöpper 2007). This is true irrespective of whether the splits are derived from character data (Huson and Kloepper 2005), in which case it usually called a recombination network, or whether they come from a set of trees (Huson et al. 2005), in which case it is usually called a hybridization network.

The SplitsTree4 program (Huson and Bryant 2006) carries out the relevant calculations under algorithms entitled Reticulation Network, Recombination Network or Hybridization Network, although these all produce the same outcome once the set of splits has been determined. These options are no longer available from the menu system (in the current release of the program), but they can still be effected via the Configure Pipeline menu option.

The point of this post is to point out that the calculations are affected by the same limitation that has been pointed out before under other circumstances (see the post A fundamental limitation of hybridization networks?). That is, reticulation cycles with three or fewer outgoing arcs are not uniquely defined with respect to rooted splits — there are three equally optimal mathematical solutions. In practice, this means that in a situation where two taxa are involved in producing a third taxon we cannot decide from the splits alone which is the reticulate taxon and which are the two "parents" (eg. which one is the hybrid).

An example

I will illustrate this point with a simple example. The data are taken from Wendel et al. (1991). The data consist of the presence-absence of 76 nuclear allozyme loci and 13 nuclear restriction sites, for five plant taxa, one of which is the outgroup. The first graph shows the splits graph using the default options in SplitsTree4 — both the NeighborNet and the ParsimonySplits analyses produce the same graph, which identifies a single reticulation.

In SplitsTree4, the outgroup for rooting the splits graph must be the first taxon in the datafile, which in this case is Gossypium robinsonii. The following three graphs are the result of then choosing the ReticulateNetwork analysis. They differ by having, respectively, Gossypium bickii as the final taxon in the dataset, Gossypium sturtianum as the final taxon, and Gossypium australe + Gossypium nelsonii as the final two taxa. Note that the ReticulateNetwork algorithm always identifies the dataset's final taxon as the reticulate one.

So, the hybrid taxon is indeterminable from the data given, and the algorithm simply makes a (consistent) choice from among the three possibilities. [That is, the algorithm chooses as the reticulate arc whichever of the three outgoing arcs is latest in the dataset.]

The original authors suggest that the nuclear and other data "indicate a biphyletic ancestry of G. bickii. Our preferred hypothesis involves an ancient hybridization, in which G. sturtianum, or a similar species, served as the maternal parent with a paternal donor from the lineage leading to G. australe and G. nelsoni." This doesn't quite match any of the three rooted networks shown above.


Huson DH, Bryant D (2006) Application of phylogenetic networks in evolutionary studies. Molecular Biology and Evolution 23: 254-267.

Huson DH, Kloepper TH (2005) Computing recombination networks from binary sequences. Bioinformatics 21: ii159-ii165.

Huson DH, Klöpper TH (2007) Beyond galled trees – decomposition and computation of galled networks. Lecture Notes in Bioinformatics 4453: 211-225.

Huson DH, Klöpper T, Lockhart PJ, Steel MA (2005) Reconstruction of reticulate networks from gene trees. Lecture Notes in Bioinformatics 3500: 233-249.

Wendel JF, Stewart JM, Rettig JH (1991) Molecular evidence for homoploid reticulate evolution among Australian species of Gossypium. Evolution 45: 694-711.

May 17, 2015


"Genealogies" produced on the web are frequently no such thing, they are merely timelines. However, the following alleged Genealogy of Automobile Companies seems to really be one, and it has a number of odd characteristics. These characteristics are quite common among manufactured products.

It is described as "A flowing history of more than 100 automobile companies across the complete time span of the automobile industry." Actually, it focuses on companies in the USA, up to 2012. You can zoom in on the details by visiting the original image at HistoryShots InfoArt.

First, note that the genealogy has multiple roots. Second, lineages coalesce forwards through time rather than diverging, so that the lineages become clustered. Moreover, some lineages do not connect to any others. Finally, there is horizontal transfer, because parts of companies get sold to other companies.

There is also a similar Genealogy of US Airlines, and a Genealogy of International Airlines.

May 12, 2015


This is a guest blog post, following on from his previous post, by:
Johann-Mattis ListCentre des Recherches Linguistiques sur l'Asie Orientale, Paris, France


All languages constantly change. Words are lost when speakers cease to use them, new words are gained when new concepts evolve, and even the pronunciation of the words changes slightly over time. Slight modifications that can barely be noticed during a person's lifetime sum up to great changes in the system of a language over centuries. When the speakers of a language diverge, their speech keeps on changing independently in the two communities, and at a certain point of time the independent changes are so great that they can no longer communicate with each other — what was one language has become two.

Demonstrating that two languages once were one is one of the major tasks of historical linguistics. If no written documents of the ancestral language exist, one has to rely on specific techniques for linguistic reconstruction (see the examples in this previous post). These techniques require us to first identify those words in the descendant languages that presumably go back to a common word form in the ancestral language. In identifying these words, we infer historical relations between them. The most fundamental historical relation between words is the relation of common descent. However, similarly to evolutionary biology, where homology can be further subdivided into the more specific relations of orthology, paralogy, and xenology, more specific fundamental historical relations between words can be defined for historical linguistics, depending on the underlying evolutionary scenario.

Homology and Cognacy in Linguistics and Biology

In evolutionary biology there is a rather rich terminological framework describing fundamental historical relations between genes and morphological characters. Discussions regarding the epistemological and ontological aspects of these relations are still ongoing (see the overview in Koonin 2005, but also this recent post by David). Linguists, in contrast, have rarely addressed these questions directly. They rather assumed that the fundamental historical relations between words are more or less self-evident, with only few counter-examples, which were largely ignored in the literature (Arapov and Xerc 1974; Holzer 1996; Katičić 1966). As a result, our traditional terminology to describe the fundamental historical relations between words is very imprecise and often leads to confusion, especially when it comes to computational applications that are based on software originally developed for applications in evolutionary biology.

As an example, consider the fundamental concept of homology in evolutionary biology. According to Koonin (2005: 311), it "designates a relationship of common descent between any entities, without further specification of the evolutionary scenario". The terms orthology, paralogy, and xenology are used to address more specific relations. Orthology refers to "genes related via speciation" (Koonin 2005: 311); that is, genes related via direct descent. Paralogy refers to "genes related via duplication" (ibid.); that is, genes related via indirect descent. Xenology, a notion which was introduced by Gray and Fitch (1983), refers to genes "whose history, since their common ancestor, involves an interspecies (horizontal) transfer of the genetic material for at least one of those characters" (Fitch 2000: 229); i.e. to genes related via descent involving lateral transfer.

In historical linguistics, the only relation that is explicitly defined is cognacy (also called cognation). Cognacy usually refers to words related via “descent from a common ancestor” (Trask 2000: 63), and it is strictly distinguished from descent involving lateral transfer (borrowing). The term cognacy itself, however, covers both direct and indirect descent. Hence, traditionally, German Zahn 'tooth' is cognate with English tooth, and German selig 'blessed' with English silly, and German Geburt 'birth' with English birth, although the historical processes that shaped the present appearance of these three word pairs are quite different. Apart from the sound shape, Zahn and tooth have regularly developed from Proto-Germanic *tanθ-; selig and silly both go back to Proto-Germanic *sæli- 'happy', but the meaning of the English word has changed greatly; Geburt and birth stem from Proto-Germanic *ga-burdi-, but the English word has lost the prefix as a result of specific morphological processes during the development of the English language (all examples follow Kluge and Seebold 2002, with modifications for the pronunciation of Proto-Germanic). Thus, of the three examples of cognate words given, only the first would qualify as having evolved by direct inheritance, while the inheritance of the latter two could be labelled as indirect, involving processes which are largely language-specific and irregular, such as meaning shift and morpheme loss. Trask (2000: 234) suggests the term oblique cognacy to label these cases of indirect inheritance, but this term seems to be rarely used in historical linguistics; and at least in the mainstream literature of historical linguistics I could not find even a single instance where the term was employed (apart from the passage by Trask).

In the table above (with modifications taken from List 2014: 39), I have tried to contrast the terminology used in evolutionary biology and historical linguistics by comparing to which degree they reflect fundamental historical relations between words or genes. Here, common descent is treated as a basic relation which can be further subdivided into relations of direct common descent, indirect common descent, and common descent involving lateral transfer. As one can easily see, historical linguistics lacks proper terms for at least half of the relations, offering no exact counterparts for homology, orthology, and xenology in evolutionary biology.

Cognacy in historical linguistics is often deemed to be identical with homology in evolutionary biology, but this is only true if one ignores common descent involving lateral transfer. One may argue that the notion of xenology is not unknown to linguists, since the borrowing of words is a very common phenomenon in language history. However, the specific relation which is termed xenology in biology has no direct counterpart in historical linguistics: the term borrowing refers to a distinct process, not a relation resulting from the process. There is no common term in historical linguistics which addresses the specific relation between such words as German kurz 'short' and English short. These words are not cognate, since the German word has been borrowed from Latin cŭrtus 'mutilated' (Kluge and Seebold 2002). They share, however, a common history, since Latin cŭrtus and English short both (may) go back to Proto-Indo-European *(s)ker- 'cut off' (Vaan 2008: 158). The specific history behind these relations is illustrated in the following figure.

A specific advantage of the biological notion of homology as a basic relation covering any kind of historical relatedness, compared to the linguistic notion of cognacy as a basic relation covering direct and indirect common descent, is that the former is much more realistic regarding the epistemological limits of historical research. Up to a certain point, it can be fairly reliably demonstrated that the basic entities in the respective disciplines (words, genes, or morphological characters) share a common history. Demonstrating that more detailed relations hold, however, is often much harder. The strict notion of cognacy has forced linguists to set goals for their discipline which may often be far too ambitious to achieve. We need to adjust our terminology accordingly and bring our goals into balance with the epistemological limits of our discipline. In order to do so, I have proposed to refine our current terminology in historical linguistics to the schema shown in the table below (with modifications taken from List 2014: 44):

Fifty Shades of Cognacy

In a recent blog post, David pointed to the relative character of homology in evolutionary biology in emphasizing that it "only applies locally, to any one level of the hierarchy of character generalization". Recalling his example of bat wings compared to bird wings, which are homologous when comparing them as forelimbs but who are analogous when comparing them as wings, we can find similar examples in historical linguistics.

If we consider words for 'to give' in the four Romance languages Portuguese, Spanish, Provencal and French, then we can state that both Portuguese dar and Spanish dar are homologous, as are Provencal douna and French donner. The former pair go back to the Latin word dare 'to give', and the latter pair go back to the Latin word donare 'to gift (give as a present)'. In those times when Latin was commonly spoken, both dare and donare were clearly separated words denoting clearly separated contexts and being used in clearly separated contexts. The verb donare itself was derived from Latin donum 'present, gift'. Similarly to English where nouns can be easily used as verbs, Latin allowed for specific morphological processes. In contrast to English, however, these processes required that the form of the noun was modified (compare English gift vs. to gift with Latin donum vs. donare).

What the ancient Romans (who spoke Latin as their native tongue) were not aware of is that Latin donum 'gift' and Latin dare 'to give' themselve go back to a common word form. This was no longer evident in Latin, but it was in Proto-Indo-European, the ancestor of the Latin language. Thus, Latin dare goes back to Proto-Indo-European *deh3- 'to give', and Latin donum goes back to Proto-Indo-European *deh3-no- 'that which is given (the gift)' (Meiser 1999; what is written as *h3 in this context was probably pronounced as [x] or [h]). The word form *deh3-no- is a regular derivation from *deh3-, so at the Indo-European level both forms are homologous, since one is derived from the other. That means, in turn, that Latin dare and donum are also homologs, since they are the residual forms of the two homologous words in Proto-Indo-European. And since Latin donare is a regular derivation of donum, this means, again, that Latin dare and donare are also homologous, as are the words in the four descendant languages, Portuguese dar, Spanish dar, Provencal douna, and French donner. Depending on the time depth we apply, we will arrive at different homology decisions. I have tried to depict the complex history of the words in the following figure:

Judging from the treatment in linguistic databases, many scholars do not regard these different "shades of homology" as a real problem. In most cases, scholars use a "lumping approach" and label as cognates all words that go back to a common root, no matter how far that root goes back in time (compare, for example, the cognate labeling for reflexes of Proto-Indo-European *deh3- in the IELex).

Importantly, this labeling practice, however, may be contrary to the models that are used to analyze the data afterwards. All computational analyses model language evolution as a process of word gain and word loss. The words for the analyses are sampled from an initial set of concepts (such as 'give', 'hand', 'foot', 'stone', etc.) which are translated into the languages under investigation. If we did not know about the deeper history of Latin dare and donare, we would assume a regular process of language evolution here: at some point, the speakers of Gallo-Romance would cease to use the word dare to express the meaning 'to give' and use the word donare instead, while the speakers of Ibero-Romance would keep on using the word dare. This well-known process of lexical replacement (illustrated in the graphic below), which may provide strong phylogenetic signals, is lost in the current encoding practice where all four words are treated as homologs. Our current practice of cognate coding masks vital processes of language change.


Historical linguistics needs a more serious analysis of the fundamental processes of language change and the fundamental historical relations resulting from these processes. In the last two decades a large arsenal of quantitative methods has been introduced in historical linguistics. The majority of these methods come from evolutionary biology. While we have quickly learned to adapt and apply these methods to address questions of language classification and language evolution, we have forgotten to ask whether the processes these methods are supposed to model actually coincide with the fundamental processes of language evolution. Apart from adapting only the methods from evolutionary biology, we should consider also adapting the habit of having deeper discussions regarding the very basics of our methodology.


Arapov MV, Xerc MM (1974) Математические методы в исторической лингвистике [Mathematical methods in historical linguistics]. Moscow: Nauka. German translation: Arapov, M. V. and M. M. Cherc (1983). Mathematische Methoden in der historischen Linguistik. Trans. by R. Köhler and P. Schmidt. Bochum: Brockmeyer.

Fitch WM (2000) Homology: a personal view on some of the problems. Trends in Genetics 16.5, 227-231.

Gray GS, Fitch WM (1983) Evolution of antibiotic resistance genes: the DNA sequence of a kanamycin resistance gene from Staphylococcus aureus. Molecular Biology and Evolution 1.1, 57-66.

Holzer G (1996) Das Erschließen unbelegter Sprachen. Zu den theoretischen Grundlagen der genetischen Linguistik. Frankfurt am Main: Lang

Katičić R (1966) Modellbegriffe in der vergleichenden Sprachwissenschaft. Kratylos 11, 49-67.

Kluge F, Seebold E (2002) Etymologisches Wörterbuch der deutschen Sprache. 24th ed. Berlin: de Gruyter.

List J-M (2014) Sequence Comparison in Historical Linguistics. Düsseldorf: Düsseldorf University Press.

Meiser G (1999) Historische Laut- und Formenlehre der lateinischen Sprache. Wissenschaftliche Buchgesellschaft: Darmstadt.

Vaan M (2008) Etymological Dictionary of Latin and the Other Italic Languages. Leiden and Boston: Brill.

May 10, 2015


Actually, if you do a search you will find that there are lots of non-humorous papers on the evolution of humor, in the variational sense not the transformational one, as used here.

May 5, 2015


It is obvious that there is a big cultural difference between biologists and computationalists, irrespective of whether we think its a good idea or not. This follows simply from the nature of the activities in the two professions — the activities are different and therefore different personalities are attracted to those professions.

Some of these differences are well known. For example, computations require algorithmic repeatability, along with proof that the algorithms achieve the explicitly stated goal. This means that computationalists have to be pedants in order to succeed. On the other hand, no-one can be pedantic and succeed in biology. Biodiversity is a concept that makes it clear that there are no rules to biological phenomena — any generalization that you can think of will turn out to have numerous exceptions. In the biological sciences we do not look for universal "laws" (as in the physical sciences), because there are none; and if you can't handle that fact then you should not try to become a biologist.

This leads to a further difference between the two professions that I think is sometimes poorly appreciated. In general, computationalists focus on patterns, whereas biologists focus on processes. Many processes can produce the same patterns, and therefore the same computations can be used to detect those patterns; and this is of interest to people who are developing algorithms. On the other hand, in biology processes can produce many different patterns, so that patterns are often unpredictable. Biologists are aware that patterns and processes can be poorly connected, and the biological interest is primarily on understanding the processes, because these are frequently more generalizable than are the patterns.

As a simple example of this dichotomy, consider the following diagram (from Loren H. Rieseberg and Richard D. Noyes. 1998. Genetic map-based studies of reticulate evolution in plants. Trends in Plant Science 3: 254-259). It shows the eight haploid chromosomes of a particular plant species.

Perusal of the figure will lead you to identify the pattern, and this is straightforward to detect computationally. Each chromosomal segment is triplicated, but the triplicates are arranged arbitrarily and are sometimes segmented.

On its own this is of little biological interest. The interest lies in the processes that led to the pattern. These processes could produce an infinite number of similar patterns, and so predicting the exact pattern in this species is impossible. We use abduction to proceed from the pattern to the processes (see What we know, what we know we can know, and what we know we cannot know).

We appear to be looking at a case of allopolyploidy (the nuclear genome is hexaploid) followed by recombination. Neither of these processes necessarily produces patterns that can be predicted in detail.

So, the computation focuses on the pattern and the biology on the process. Sometimes biologists forget this, and naively interpret patterns as inevitably implying a particular process. And sometimes computationalists naively expect patterns to be predictable when they are not.

May 3, 2015


I have noted before that many of the diagrams on the web purporting to show "evolution" actually show transformational evolution rather than variational evolution, as is done in biology and the historical social sciences (eg. Non-phylogenetic trees; Evolution and timelines; The evolutionary March of Progress in popular culture).

This diagram seems to be an improvement, however. Perhaps its geekiness is responsible for this?

This is an evolutionary network because it is rooted, at "Geekus Prime". You will note that it is a population network rather than strictly a phylogenetic network. That is, many of the internal nodes are labeled with extant taxa, so that both ancestors and their descendants appear. It is a network rather than a tree, because the "World of Warcraft Geek" is a hybrid between the "Dungeons and Dragons Geek" and the ancestor of the "Video Game Geek".

April 28, 2015


I have noted before that Pedigrees and phylogenies are networks not trees. For example, a human family "tree" is a tree only if it includes one sex alone. Otherwise, it must be a network when traced backwards from any single individual through both parents, because the lineages must eventually coalesce in a pair of shared common ancestors.

This potentially creates a problem for maintaining genetic diversity within species. If a pedigree is tree-like, then each person would, for example, have 32 great-great-great grand-parents. These 32 people's genes are mixed more-or-less randomly (depending on recombination and assortment) to produce the great-great-great grand-child. This heterozygosity is a good thing, evolutionarily, because there is then genetic diversity within that person.

However, inbreeding turns a tree into a network. This increases the probability that identical alleles will be paired in any one individual. If deleterious recessive alleles are thereby expressed, then genetic problems can ensue, which is called inbreeding depression. However, this situation is not inevitable, but depends on the probability of alleles becoming paired. Indeed, for domesticated organisms, inbreeding is the norm (see Thoroughbred horses and reticulate pedigrees).

I have discussed examples of well-known historical figures who have encountered the unfortunate effects of inbreeding, including Charles Darwin (Charles Darwin's family pedigree network) and Henri Toulouse-Lautrec (Toulouse-Lautrec: family trees and networks). In both cases the problems arose because of consanguineous relationships, which involve people who are first cousins or more closely related.

I have also discussed the extreme case of consanguinity, incest. In particular, royalty have often been exempt from taboos against sibling and parent-child couplings, as noted in Tutankhamun and extreme consanguinity and also in Cleopatra, ambition and family networks. At least for Tutankhamun there is evidence of genetic problems (an accumulation of malformations is evident), but apparently not in Cleopatra's case (there is no convincing evidence of infertility, infant mortality or genetic defects, for example). Royalty have not been the only exceptions to the incest taboo (see Evolutionary fitness and incest).

In Tutankhamun's case it has been suggested that his mother was his father's (Akhenaten) sister (name not known), which is surprising, because only two wives of Akhenaten, Nefertiti and Kiya, are known to have had the title of Great Royal Wife, which the father of the royal heir should bear. As a way out of this dilemma, Marc Gabolde has suggested that the apparent genetic closeness of Tutankhamun's parents is because his mother was his father's first cousin, Nefertiti. The apparent genetic closeness is then not the result of a single brother-sister mating but instead is due to three successive instances of marriage between first cousins.

To explain this idea we can look at an actual example. An historical example of how consanguinity can produce the same genetic effects as incest is provided by the Spanish branch of the Habsburg dynasty in 1700, as discussed in Family trees, pedigrees and hybridization networks.

This example can be explained using inbreeding F values. For any specified offspring, these indicate the probability of paired alleles being identical by descent (ie. due to the close relationship of the parents). For close family relationships the F values are:
uncle-niece = aunt-nephew
double first cousins
first cousins
first cousins once removed
second cousins 0.500
0.016Note that incest produces F values of 0.250 while consanguinity values are 0.063 or greater.

If we consider the case of King Charles II of Spain (1661-1700), then his inbreeding F = 0.254, which was achieved entirely without incestuous relationships. His pedigree is shown in the post Family trees, pedigrees and hybridization networks.

This pedigree shows that the parents of each person had the following relationships:

himself = uncle-niece [ie. his parents were uncle and niece]

father = first cousins once removed [ie. his father's parents were first cousins once removed]
mother = first cousins

father's father = (a) = uncle-niece
father's mother = (b) = uncle-niece
mother's father = first cousins
mother's mother = first cousins once removed

father's father's father = not closely related
father's father's mother = first cousins
father's mother's father = not closely related
father's mother's mother = not closely related
mother's father's father = uncle-niece
mother's father's mother = second cousins
mother's mother's father = see person (a)
mother's mother's mother = see person (b)

Thus, on his father's side he was the third generation of consecutive consanguinity, and on his mother's side he was the fourth generation of consecutive consanguinity. This is simply an accumulating series of probabilities — consanguinity potentially produces problems and consecutive consanguinity simply increases the probability.

It is not surprising, then, that Charles suffered genetic problems (he was disfigured, physically disabled and mentally retarded) to such an extent that his royal lineage came to an end, and the Spanish branch of the Habsburg dynasty ceased to rule.

Incidentally, the scientist who devised the quantity F, Sewall Wright, himself had a rather high amount of inbreeding — his parents were first cousins.