The Genealogical World of Phylogenetic Networks

Biology, computational science, and networks in phylogenetic analysis


XML feed

Last update

1 hour 38 min ago

May 19, 2015


Splits graphs are a useful way of displaying contradictory information within evolutionary datasets, either incompatible characters (ie. those that cannot fit onto a single tree) or incompatible trees. Since the graphs are unrooted, they are usually treated as a form of multivariate data display, rather than interpreted as depicting evolutionary history.

However, it is possible to turn a splits graph into a evolutionary network (sometimes called a reticulation network) once a root is specified (Huson and Klöpper 2007). This is true irrespective of whether the splits are derived from character data (Huson and Kloepper 2005), in which case it usually called a recombination network, or whether they come from a set of trees (Huson et al. 2005), in which case it is usually called a hybridization network.

The SplitsTree4 program (Huson and Bryant 2006) carries out the relevant calculations under algorithms entitled Reticulation Network, Recombination Network or Hybridization Network, although these all produce the same outcome once the set of splits has been determined. These options are no longer available from the menu system (in the current release of the program), but they can still be effected via the Configure Pipeline menu option.

The point of this post is to point out that the calculations are affected by the same limitation that has been pointed out before under other circumstances (see the post A fundamental limitation of hybridization networks?). That is, reticulation cycles with three or fewer outgoing arcs are not uniquely defined with respect to rooted splits — there are three equally optimal mathematical solutions. In practice, this means that in a situation where two taxa are involved in producing a third taxon we cannot decide from the splits alone which is the reticulate taxon and which are the two "parents" (eg. which one is the hybrid).

An example

I will illustrate this point with a simple example. The data are taken from Wendel et al. (1991). The data consist of the presence-absence of 76 nuclear allozyme loci and 13 nuclear restriction sites, for five plant taxa, one of which is the outgroup. The first graph shows the splits graph using the default options in SplitsTree4 — both the NeighborNet and the ParsimonySplits analyses produce the same graph, which identifies a single reticulation.

In SplitsTree4, the outgroup for rooting the splits graph must be the first taxon in the datafile, which in this case is Gossypium robinsonii. The following three graphs are the result of then choosing the ReticulateNetwork analysis. They differ by having, respectively, Gossypium bickii as the final taxon in the dataset, Gossypium sturtianum as the final taxon, and Gossypium australe + Gossypium nelsonii as the final two taxa. Note that the ReticulateNetwork algorithm always identifies the dataset's final taxon as the reticulate one.

So, the hybrid taxon is indeterminable from the data given, and the algorithm simply makes a (consistent) choice from among the three possibilities. [That is, the algorithm chooses as the reticulate arc whichever of the three outgoing arcs is latest in the dataset.]

The original authors suggest that the nuclear and other data "indicate a biphyletic ancestry of G. bickii. Our preferred hypothesis involves an ancient hybridization, in which G. sturtianum, or a similar species, served as the maternal parent with a paternal donor from the lineage leading to G. australe and G. nelsoni." This doesn't quite match any of the three rooted networks shown above.


Huson DH, Bryant D (2006) Application of phylogenetic networks in evolutionary studies. Molecular Biology and Evolution 23: 254-267.

Huson DH, Kloepper TH (2005) Computing recombination networks from binary sequences. Bioinformatics 21: ii159-ii165.

Huson DH, Klöpper TH (2007) Beyond galled trees – decomposition and computation of galled networks. Lecture Notes in Bioinformatics 4453: 211-225.

Huson DH, Klöpper T, Lockhart PJ, Steel MA (2005) Reconstruction of reticulate networks from gene trees. Lecture Notes in Bioinformatics 3500: 233-249.

Wendel JF, Stewart JM, Rettig JH (1991) Molecular evidence for homoploid reticulate evolution among Australian species of Gossypium. Evolution 45: 694-711.

May 17, 2015


"Genealogies" produced on the web are frequently no such thing, they are merely timelines. However, the following alleged Genealogy of Automobile Companies seems to really be one, and it has a number of odd characteristics. These characteristics are quite common among manufactured products.

It is described as "A flowing history of more than 100 automobile companies across the complete time span of the automobile industry." Actually, it focuses on companies in the USA, up to 2012. You can zoom in on the details by visiting the original image at HistoryShots InfoArt.

First, note that the genealogy has multiple roots. Second, lineages coalesce forwards through time rather than diverging, so that the lineages become clustered. Moreover, some lineages do not connect to any others. Finally, there is horizontal transfer, because parts of companies get sold to other companies.

There is also a similar Genealogy of US Airlines, and a Genealogy of International Airlines.

May 12, 2015


This is a guest blog post, following on from his previous post, by:
Johann-Mattis ListCentre des Recherches Linguistiques sur l'Asie Orientale, Paris, France


All languages constantly change. Words are lost when speakers cease to use them, new words are gained when new concepts evolve, and even the pronunciation of the words changes slightly over time. Slight modifications that can barely be noticed during a person's lifetime sum up to great changes in the system of a language over centuries. When the speakers of a language diverge, their speech keeps on changing independently in the two communities, and at a certain point of time the independent changes are so great that they can no longer communicate with each other — what was one language has become two.

Demonstrating that two languages once were one is one of the major tasks of historical linguistics. If no written documents of the ancestral language exist, one has to rely on specific techniques for linguistic reconstruction (see the examples in this previous post). These techniques require us to first identify those words in the descendant languages that presumably go back to a common word form in the ancestral language. In identifying these words, we infer historical relations between them. The most fundamental historical relation between words is the relation of common descent. However, similarly to evolutionary biology, where homology can be further subdivided into the more specific relations of orthology, paralogy, and xenology, more specific fundamental historical relations between words can be defined for historical linguistics, depending on the underlying evolutionary scenario.

Homology and Cognacy in Linguistics and Biology

In evolutionary biology there is a rather rich terminological framework describing fundamental historical relations between genes and morphological characters. Discussions regarding the epistemological and ontological aspects of these relations are still ongoing (see the overview in Koonin 2005, but also this recent post by David). Linguists, in contrast, have rarely addressed these questions directly. They rather assumed that the fundamental historical relations between words are more or less self-evident, with only few counter-examples, which were largely ignored in the literature (Arapov and Xerc 1974; Holzer 1996; Katičić 1966). As a result, our traditional terminology to describe the fundamental historical relations between words is very imprecise and often leads to confusion, especially when it comes to computational applications that are based on software originally developed for applications in evolutionary biology.

As an example, consider the fundamental concept of homology in evolutionary biology. According to Koonin (2005: 311), it "designates a relationship of common descent between any entities, without further specification of the evolutionary scenario". The terms orthology, paralogy, and xenology are used to address more specific relations. Orthology refers to "genes related via speciation" (Koonin 2005: 311); that is, genes related via direct descent. Paralogy refers to "genes related via duplication" (ibid.); that is, genes related via indirect descent. Xenology, a notion which was introduced by Gray and Fitch (1983), refers to genes "whose history, since their common ancestor, involves an interspecies (horizontal) transfer of the genetic material for at least one of those characters" (Fitch 2000: 229); i.e. to genes related via descent involving lateral transfer.

In historical linguistics, the only relation that is explicitly defined is cognacy (also called cognation). Cognacy usually refers to words related via “descent from a common ancestor” (Trask 2000: 63), and it is strictly distinguished from descent involving lateral transfer (borrowing). The term cognacy itself, however, covers both direct and indirect descent. Hence, traditionally, German Zahn 'tooth' is cognate with English tooth, and German selig 'blessed' with English silly, and German Geburt 'birth' with English birth, although the historical processes that shaped the present appearance of these three word pairs are quite different. Apart from the sound shape, Zahn and tooth have regularly developed from Proto-Germanic *tanθ-; selig and silly both go back to Proto-Germanic *sæli- 'happy', but the meaning of the English word has changed greatly; Geburt and birth stem from Proto-Germanic *ga-burdi-, but the English word has lost the prefix as a result of specific morphological processes during the development of the English language (all examples follow Kluge and Seebold 2002, with modifications for the pronunciation of Proto-Germanic). Thus, of the three examples of cognate words given, only the first would qualify as having evolved by direct inheritance, while the inheritance of the latter two could be labelled as indirect, involving processes which are largely language-specific and irregular, such as meaning shift and morpheme loss. Trask (2000: 234) suggests the term oblique cognacy to label these cases of indirect inheritance, but this term seems to be rarely used in historical linguistics; and at least in the mainstream literature of historical linguistics I could not find even a single instance where the term was employed (apart from the passage by Trask).

In the table above (with modifications taken from List 2014: 39), I have tried to contrast the terminology used in evolutionary biology and historical linguistics by comparing to which degree they reflect fundamental historical relations between words or genes. Here, common descent is treated as a basic relation which can be further subdivided into relations of direct common descent, indirect common descent, and common descent involving lateral transfer. As one can easily see, historical linguistics lacks proper terms for at least half of the relations, offering no exact counterparts for homology, orthology, and xenology in evolutionary biology.

Cognacy in historical linguistics is often deemed to be identical with homology in evolutionary biology, but this is only true if one ignores common descent involving lateral transfer. One may argue that the notion of xenology is not unknown to linguists, since the borrowing of words is a very common phenomenon in language history. However, the specific relation which is termed xenology in biology has no direct counterpart in historical linguistics: the term borrowing refers to a distinct process, not a relation resulting from the process. There is no common term in historical linguistics which addresses the specific relation between such words as German kurz 'short' and English short. These words are not cognate, since the German word has been borrowed from Latin cŭrtus 'mutilated' (Kluge and Seebold 2002). They share, however, a common history, since Latin cŭrtus and English short both (may) go back to Proto-Indo-European *(s)ker- 'cut off' (Vaan 2008: 158). The specific history behind these relations is illustrated in the following figure.

A specific advantage of the biological notion of homology as a basic relation covering any kind of historical relatedness, compared to the linguistic notion of cognacy as a basic relation covering direct and indirect common descent, is that the former is much more realistic regarding the epistemological limits of historical research. Up to a certain point, it can be fairly reliably demonstrated that the basic entities in the respective disciplines (words, genes, or morphological characters) share a common history. Demonstrating that more detailed relations hold, however, is often much harder. The strict notion of cognacy has forced linguists to set goals for their discipline which may often be far too ambitious to achieve. We need to adjust our terminology accordingly and bring our goals into balance with the epistemological limits of our discipline. In order to do so, I have proposed to refine our current terminology in historical linguistics to the schema shown in the table below (with modifications taken from List 2014: 44):

Fifty Shades of Cognacy

In a recent blog post, David pointed to the relative character of homology in evolutionary biology in emphasizing that it "only applies locally, to any one level of the hierarchy of character generalization". Recalling his example of bat wings compared to bird wings, which are homologous when comparing them as forelimbs but who are analogous when comparing them as wings, we can find similar examples in historical linguistics.

If we consider words for 'to give' in the four Romance languages Portuguese, Spanish, Provencal and French, then we can state that both Portuguese dar and Spanish dar are homologous, as are Provencal douna and French donner. The former pair go back to the Latin word dare 'to give', and the latter pair go back to the Latin word donare 'to gift (give as a present)'. In those times when Latin was commonly spoken, both dare and donare were clearly separated words denoting clearly separated contexts and being used in clearly separated contexts. The verb donare itself was derived from Latin donum 'present, gift'. Similarly to English where nouns can be easily used as verbs, Latin allowed for specific morphological processes. In contrast to English, however, these processes required that the form of the noun was modified (compare English gift vs. to gift with Latin donum vs. donare).

What the ancient Romans (who spoke Latin as their native tongue) were not aware of is that Latin donum 'gift' and Latin dare 'to give' themselve go back to a common word form. This was no longer evident in Latin, but it was in Proto-Indo-European, the ancestor of the Latin language. Thus, Latin dare goes back to Proto-Indo-European *deh3- 'to give', and Latin donum goes back to Proto-Indo-European *deh3-no- 'that which is given (the gift)' (Meiser 1999; what is written as *h3 in this context was probably pronounced as [x] or [h]). The word form *deh3-no- is a regular derivation from *deh3-, so at the Indo-European level both forms are homologous, since one is derived from the other. That means, in turn, that Latin dare and donum are also homologs, since they are the residual forms of the two homologous words in Proto-Indo-European. And since Latin donare is a regular derivation of donum, this means, again, that Latin dare and donare are also homologous, as are the words in the four descendant languages, Portuguese dar, Spanish dar, Provencal douna, and French donner. Depending on the time depth we apply, we will arrive at different homology decisions. I have tried to depict the complex history of the words in the following figure:

Judging from the treatment in linguistic databases, many scholars do not regard these different "shades of homology" as a real problem. In most cases, scholars use a "lumping approach" and label as cognates all words that go back to a common root, no matter how far that root goes back in time (compare, for example, the cognate labeling for reflexes of Proto-Indo-European *deh3- in the IELex).

Importantly, this labeling practice, however, may be contrary to the models that are used to analyze the data afterwards. All computational analyses model language evolution as a process of word gain and word loss. The words for the analyses are sampled from an initial set of concepts (such as 'give', 'hand', 'foot', 'stone', etc.) which are translated into the languages under investigation. If we did not know about the deeper history of Latin dare and donare, we would assume a regular process of language evolution here: at some point, the speakers of Gallo-Romance would cease to use the word dare to express the meaning 'to give' and use the word donare instead, while the speakers of Ibero-Romance would keep on using the word dare. This well-known process of lexical replacement (illustrated in the graphic below), which may provide strong phylogenetic signals, is lost in the current encoding practice where all four words are treated as homologs. Our current practice of cognate coding masks vital processes of language change.


Historical linguistics needs a more serious analysis of the fundamental processes of language change and the fundamental historical relations resulting from these processes. In the last two decades a large arsenal of quantitative methods has been introduced in historical linguistics. The majority of these methods come from evolutionary biology. While we have quickly learned to adapt and apply these methods to address questions of language classification and language evolution, we have forgotten to ask whether the processes these methods are supposed to model actually coincide with the fundamental processes of language evolution. Apart from adapting only the methods from evolutionary biology, we should consider also adapting the habit of having deeper discussions regarding the very basics of our methodology.


Arapov MV, Xerc MM (1974) Математические методы в исторической лингвистике [Mathematical methods in historical linguistics]. Moscow: Nauka. German translation: Arapov, M. V. and M. M. Cherc (1983). Mathematische Methoden in der historischen Linguistik. Trans. by R. Köhler and P. Schmidt. Bochum: Brockmeyer.

Fitch WM (2000) Homology: a personal view on some of the problems. Trends in Genetics 16.5, 227-231.

Gray GS, Fitch WM (1983) Evolution of antibiotic resistance genes: the DNA sequence of a kanamycin resistance gene from Staphylococcus aureus. Molecular Biology and Evolution 1.1, 57-66.

Holzer G (1996) Das Erschließen unbelegter Sprachen. Zu den theoretischen Grundlagen der genetischen Linguistik. Frankfurt am Main: Lang

Katičić R (1966) Modellbegriffe in der vergleichenden Sprachwissenschaft. Kratylos 11, 49-67.

Kluge F, Seebold E (2002) Etymologisches Wörterbuch der deutschen Sprache. 24th ed. Berlin: de Gruyter.

List J-M (2014) Sequence Comparison in Historical Linguistics. Düsseldorf: Düsseldorf University Press.

Meiser G (1999) Historische Laut- und Formenlehre der lateinischen Sprache. Wissenschaftliche Buchgesellschaft: Darmstadt.

Vaan M (2008) Etymological Dictionary of Latin and the Other Italic Languages. Leiden and Boston: Brill.

May 10, 2015


Actually, if you do a search you will find that there are lots of non-humorous papers on the evolution of humor, in the variational sense not the transformational one, as used here.

May 5, 2015


It is obvious that there is a big cultural difference between biologists and computationalists, irrespective of whether we think its a good idea or not. This follows simply from the nature of the activities in the two professions — the activities are different and therefore different personalities are attracted to those professions.

Some of these differences are well known. For example, computations require algorithmic repeatability, along with proof that the algorithms achieve the explicitly stated goal. This means that computationalists have to be pedants in order to succeed. On the other hand, no-one can be pedantic and succeed in biology. Biodiversity is a concept that makes it clear that there are no rules to biological phenomena — any generalization that you can think of will turn out to have numerous exceptions. In the biological sciences we do not look for universal "laws" (as in the physical sciences), because there are none; and if you can't handle that fact then you should not try to become a biologist.

This leads to a further difference between the two professions that I think is sometimes poorly appreciated. In general, computationalists focus on patterns, whereas biologists focus on processes. Many processes can produce the same patterns, and therefore the same computations can be used to detect those patterns; and this is of interest to people who are developing algorithms. On the other hand, in biology processes can produce many different patterns, so that patterns are often unpredictable. Biologists are aware that patterns and processes can be poorly connected, and the biological interest is primarily on understanding the processes, because these are frequently more generalizable than are the patterns.

As a simple example of this dichotomy, consider the following diagram (from Loren H. Rieseberg and Richard D. Noyes. 1998. Genetic map-based studies of reticulate evolution in plants. Trends in Plant Science 3: 254-259). It shows the eight haploid chromosomes of a particular plant species.

Perusal of the figure will lead you to identify the pattern, and this is straightforward to detect computationally. Each chromosomal segment is triplicated, but the triplicates are arranged arbitrarily and are sometimes segmented.

On its own this is of little biological interest. The interest lies in the processes that led to the pattern. These processes could produce an infinite number of similar patterns, and so predicting the exact pattern in this species is impossible. We use abduction to proceed from the pattern to the processes (see What we know, what we know we can know, and what we know we cannot know).

We appear to be looking at a case of allopolyploidy (the nuclear genome is hexaploid) followed by recombination. Neither of these processes necessarily produces patterns that can be predicted in detail.

So, the computation focuses on the pattern and the biology on the process. Sometimes biologists forget this, and naively interpret patterns as inevitably implying a particular process. And sometimes computationalists naively expect patterns to be predictable when they are not.

May 3, 2015


I have noted before that many of the diagrams on the web purporting to show "evolution" actually show transformational evolution rather than variational evolution, as is done in biology and the historical social sciences (eg. Non-phylogenetic trees; Evolution and timelines; The evolutionary March of Progress in popular culture).

This diagram seems to be an improvement, however. Perhaps its geekiness is responsible for this?

This is an evolutionary network because it is rooted, at "Geekus Prime". You will note that it is a population network rather than strictly a phylogenetic network. That is, many of the internal nodes are labeled with extant taxa, so that both ancestors and their descendants appear. It is a network rather than a tree, because the "World of Warcraft Geek" is a hybrid between the "Dungeons and Dragons Geek" and the ancestor of the "Video Game Geek".

April 28, 2015


I have noted before that Pedigrees and phylogenies are networks not trees. For example, a human family "tree" is a tree only if it includes one sex alone. Otherwise, it must be a network when traced backwards from any single individual through both parents, because the lineages must eventually coalesce in a pair of shared common ancestors.

This potentially creates a problem for maintaining genetic diversity within species. If a pedigree is tree-like, then each person would, for example, have 32 great-great-great grand-parents. These 32 people's genes are mixed more-or-less randomly (depending on recombination and assortment) to produce the great-great-great grand-child. This heterozygosity is a good thing, evolutionarily, because there is then genetic diversity within that person.

However, inbreeding turns a tree into a network. This increases the probability that identical alleles will be paired in any one individual. If deleterious recessive alleles are thereby expressed, then genetic problems can ensue, which is called inbreeding depression. However, this situation is not inevitable, but depends on the probability of alleles becoming paired. Indeed, for domesticated organisms, inbreeding is the norm (see Thoroughbred horses and reticulate pedigrees).

I have discussed examples of well-known historical figures who have encountered the unfortunate effects of inbreeding, including Charles Darwin (Charles Darwin's family pedigree network) and Henri Toulouse-Lautrec (Toulouse-Lautrec: family trees and networks). In both cases the problems arose because of consanguineous relationships, which involve people who are first cousins or more closely related.

I have also discussed the extreme case of consanguinity, incest. In particular, royalty have often been exempt from taboos against sibling and parent-child couplings, as noted in Tutankhamun and extreme consanguinity and also in Cleopatra, ambition and family networks. At least for Tutankhamun there is evidence of genetic problems (an accumulation of malformations is evident), but apparently not in Cleopatra's case (there is no convincing evidence of infertility, infant mortality or genetic defects, for example). Royalty have not been the only exceptions to the incest taboo (see Evolutionary fitness and incest).

In Tutankhamun's case it has been suggested that his mother was his father's (Akhenaten) sister (name not known), which is surprising, because only two wives of Akhenaten, Nefertiti and Kiya, are known to have had the title of Great Royal Wife, which the father of the royal heir should bear. As a way out of this dilemma, Marc Gabolde has suggested that the apparent genetic closeness of Tutankhamun's parents is because his mother was his father's first cousin, Nefertiti. The apparent genetic closeness is then not the result of a single brother-sister mating but instead is due to three successive instances of marriage between first cousins.

To explain this idea we can look at an actual example. An historical example of how consanguinity can produce the same genetic effects as incest is provided by the Spanish branch of the Habsburg dynasty in 1700, as discussed in Family trees, pedigrees and hybridization networks.

This example can be explained using inbreeding F values. For any specified offspring, these indicate the probability of paired alleles being identical by descent (ie. due to the close relationship of the parents). For close family relationships the F values are:
uncle-niece = aunt-nephew
double first cousins
first cousins
first cousins once removed
second cousins 0.500
0.016Note that incest produces F values of 0.250 while consanguinity values are 0.063 or greater.

If we consider the case of King Charles II of Spain (1661-1700), then his inbreeding F = 0.254, which was achieved entirely without incestuous relationships. His pedigree is shown in the post Family trees, pedigrees and hybridization networks.

This pedigree shows that the parents of each person had the following relationships:

himself = uncle-niece [ie. his parents were uncle and niece]

father = first cousins once removed [ie. his father's parents were first cousins once removed]
mother = first cousins

father's father = (a) = uncle-niece
father's mother = (b) = uncle-niece
mother's father = first cousins
mother's mother = first cousins once removed

father's father's father = not closely related
father's father's mother = first cousins
father's mother's father = not closely related
father's mother's mother = not closely related
mother's father's father = uncle-niece
mother's father's mother = second cousins
mother's mother's father = see person (a)
mother's mother's mother = see person (b)

Thus, on his father's side he was the third generation of consecutive consanguinity, and on his mother's side he was the fourth generation of consecutive consanguinity. This is simply an accumulating series of probabilities — consanguinity potentially produces problems and consecutive consanguinity simply increases the probability.

It is not surprising, then, that Charles suffered genetic problems (he was disfigured, physically disabled and mentally retarded) to such an extent that his royal lineage came to an end, and the Spanish branch of the Habsburg dynasty ceased to rule.

Incidentally, the scientist who devised the quantity F, Sewall Wright, himself had a rather high amount of inbreeding — his parents were first cousins.

April 26, 2015


"Late night" broadcasting on United States network / cable TV starts at about 11:00 or 11:30 pm, and goes for a couple of hours. Many networks broadcast similar shows during this time, which directly compete against each other for the available audience (which is currently estimated to be slightly in excess of 10 million people per night at 11:30 pm). Many of these shows have been on for a long time. Most of them are recorded on several weekday nights in front of a live audience, and they are usually associated with only a very few presenters over time (almost always men!).

For example, since the early 1990s we have had:
NBC Tonight Show

NBC Late Night

CBS Late Show
CBS Late Late Show

ABC Kimmel Live
ABC Nightline

ComedyCentral Daily Show

ComedyCentral Colbert Report
TBS Conan 11:35-12:35





11:00-12:00 Jay Leno 1992-2009
Conan O'Brien 2009-2010
Jay Leno 2010-2014
Jimmy Fallon 2014-
David Letterman 1982-1993
Conan O'Brien 1993-2009
Jimmy Fallon 2009-2014
Seth Meyers 2014-
David Letterman 1993-2015
Tom Snyder 1995-1999
Craig Kilborn 1999-2004
Craig Ferguson 2005-2014
James Corden 2015-
Jimmy Kimmel 2003-
Ted Koppel 1980-2005
Three-anchor team 2005-
Craig Kilborn 1996-1998
Jon Stewart 1999-
Stephen Colbert 2005-2014
Conan O'Brien 2010-
Eventually, the presenters retire or move elsewhere, and the other presenters then move around among the shows. This has lead to the so-called "Late night wars", in which the NBC studio executives in charge repeatedly show that their personnel management skills are often lacking. For example, David Letterman was expected to replace Johnny Carson when he retired as the host of the NBC Tonight Show in 1992, but the job was given to Jay Leno, instead. So, Letterman moved to a directly competing show on CBS. When Leno subsequently moved to another show, Conan O'Brien took over. However, Leno then moved back again, and so O'Brien moved to a directly competing show on TBS. The media interest in these shenanigans exceeded their interest in the shows themselves.

Another substantial decision was that by ABC, at the end of 2012, to swap the timelsots of Nightline (which used to run 11:35-12:00) and Kimmel Live (which ran 12:00-13:00). This had a notable effect on the audience numbers, because Nightline was one of the top two shows in its original timeslot whereas Kimmel Live currently gets about 1 million viewers fewer per night in that same slot. On the other hand Nightline in its new timelsot gets about the same audience as Kimmel Live did when it occupied the slot. That seems to be a net loss of audience for ABC.

The Nielsen Media Research viewing data are available online at the TV by the Numbers site. They provide the weekly averages for each show in millions of viewers, based on what is known as "live plus same day" viewing (ie. the audience at the time of broadcast plus same-day viewing of video recordings). The data I have looked at run from early December 2011 to the end of December 2014 (161 weeks). Unfortunately, these data rely on NBC press releases (rather than direct access to Nielsen), so there are some missing data.

The comparison of these shows can be visualized using a phylogenetic network, as a tool for exploratory data analysis. To create the network, I first calculated the similarity of the nine shows using the manhattan distance; and a Neighbor-net analysis was then used to display the between-show similarities as a phylogenetic network. So, shows that are closely connected in the network are similar to each other based on their audience figures across the three years, and those that are further apart are progressively more different from each other.

The network shows a gradient of increasing audience size, from bottom-left to top-right. So, the Tonight Show consistently got a average nightly audience of c. 3.5 million people, while Conan had c. 0.8 million. The two CBS shows both consistently did somewhat worse than their NBC timeslot competitors.

The two ABC shows apparently did well, but this is confounded by the timeslot swap noted above. Nightline did well for the first year (before it was moved) but not for the second two years, while Kimmel Live did the opposite. This is what creates the big reticulation in the middle of the network, as all of the other shows had fairly consistent audiences throughout the three years.

However, there was a steady decrease in the total audience size across the three years, from c. 12 million per night (at 11:30 pm) at the end of 2011 to c. 10 million at the end of 2014. The only major exception to this was at the time when Jimmy Fallon took over from Jay Leno (early 2014). For several weeks the Tonight Show audience increased to >8 million per night, so that the total audience was c. 15.5 million (a 50% increase). This shows just how many people are available to be added to the late-night viewing, compared to how many watch regularly. So, why are they not watching in the other weeks? It seems that Late Night Television is not reaching its full potential.

April 21, 2015


Homology is a concept that is fundamental to biological studies, and yet it is difficult to define. Generally, characters are considered to be homologous among organisms if they have been inherited from a common ancestral character.

Homology is thus at the heart of phylogenetics, as it expresses the historical relationships among characters, whereas a phylogeny expresses the historical relationships among taxa (including individuals). Since the relationships among the taxa are based on pre-existing information about the relationships among the characters, homology must be established first. It is for this reason that multiple sequence alignments, for example, are so valuable.

However, homology is a relative concept; that is, it is context sensitive. It only applies locally, to any one level of the hierarchy of character generalization. The classic example of this idea is bird wings versus bat wings. These structures are homologous as forelimbs but not as wings – birds and bats independently modified their forelimbs into wings. So, homology exists at the more general level (forelimbs) but not at the less general level (wings). Forelimbs developed first in evolutionary history (the common ancestor of animals with four legs is ancient), and later these forelimbs were modified in different descendants, with some developing wings, some flippers, and some arms. Wings, flippers and arms are more recent, and are thus less general.

So, we can conceptualize characters as existing at many hierarchical levels of generality, depending on when they developed. We might have (going from specific to general) nucleotides, amino acids, protein domains, proteins, biosynthetic pathways, developmental origins, and anatomy, among many possible conceptual levels. Lower levels in the hierarchy "control" the upper levels, so that nucleotides code for amino acids, domains consist of strings of amino acids, proteins function as enzymes in biosynthesis, and development is controlled by biosynthetic pathways.

A nucleotide insertion and compensatory deletion results in two amino acid substitutions,
so that simultaneously aligning homologous nucleotides and homologous amino acids is no longer possible
The issue is that homology among characters can only be determined within any one hierarchical level. As noted by Fitch (2000): "Life would have been simple if phylogenetic homology necessarily implied structural homology or either of them had necessarily implied functional homology. However, they map onto each other imperfectly".

For example, homology of amino acids among a group of organisms does not necessarily imply that all of their coding nucleotides are homologous (see the figure above) — originally the nucleotides would also have been homologous, but insertions and deletions through time can break the original relationship between the amino acids and their coding nucleotides. So, one cannot always simultaneously align homologous amino acids and homologous nucleotides.

Similarly, homology of two anatomical features does not necessarily imply that their developmental sequences are homologous. This is an issue that the study of evo-devo has made increasingly obvious. That is, sometimes identity of morphological characters is not the result of identity of the sets of genes that control their development (Meyer 1999; Mindell and Meyer 2001; Wagner 2014) — non-homologous genes and gene networks can produce morphological structures that are usually considered to be homologs, and non-homologous structures can express homologous genes.

Developmental biologists therefore often prefer a process-oriented concept of homology, which they call 'biological homology', where homologous features are those sharing a set of developmental constraints (Wagner 1989). Indeed, the terms 'syngeny' (Butler and Saidel 2000) and 'homocracy' (Nielsen and Martinez 2003) have been coined to describe morphological features that are organized through the expression of homologous gene networks, irrespective of whether those features are evolutionarily homologous or convergent.

Reticulation and homology

This idea can be extended to other evolutionary scenarios. The one I am particularly interested in here is the consequence of reticulation. In the situations discussed above the character modifications (ancestral to derived) come from "within" the lineage (traditional ancestor-descendant gene inheritance), but the modifications can also come from "outside", by gene flow.

For example, Andam and Gogarten (2012) have noted that horizontal gene transfer (HGT) can in fact be used to provide information for the concept of a Tree of Life, because a transferred gene can also be regarded as a shared derived character. That is, HGT of a gene into an ancestor forms a synapomorphy for its descendants. This gene may subsequently diversify among those descendants, even following a simple tree-like pattern of descent.

This creates a terminological issue. If diversification occurs, then these genes are homologous in the traditional sense (they are modified descendants of a common ancestral character). However, how do they compare to genes in the descendants of species that did not receive the HGT, and to the genes from which the transfer occurred? In the first case they are not applicable (just as the concept of wings is not applicable to animals with flippers). In the second case our current concept of homology does not apply in any simple sense.

The hierarchical concept of homology is tied to a tree model of evolution. The hierarchical nature of characters results from the nested hierarchy of taxon relationships. If there is no nested hierarchy of taxon relationships then our current concepts of homology are inadequate. We need terms that describe possible reticulate relationships among the characters, not just hierarchical ones.

Thus, along with modifications to the concept of monophyly (see Monophyletic groups in networks ), networks imply that we need modifications to the concept of homology, as well.


It is worth noting that a similar issue applies in other fields that are based on a concept of evolutionary history. For example, in historical linguistics words are considered to descend from ancestral languages and diversify among multiple daughter languages. These words are considered to be cognate (cf. homologous). However, words are also borrowed from unrelated languages, and these are loan words (cf. HGT). Loan words may also diversify among the daughter languages, both in the original language and in the borrowing language.

For example, the Germanic word *rīks (ruler) was borrowed from Celtic *rīxs (king), and it has come down to modern times as German 'Reich', English 'rich' (West Germanic), Swedish 'rike' (North Germanic), and Gothic 'reiks' (East Germanic) (see Wikipedia). This diversification has followed Grimm's Law, a regular phonological change that defines the Germanic family — so, the subsequent development of the loan word allows reconstruction of the evolutionary history, and the descendants are cognate. But are they cognate to the words descended from *rīxs within Celtic?


Andam CP, Gogarten JP (2013) Biased gene transfer contributes to maintaining the Tree of Life. In: Lateral Gene Transfer in Evolution (U Gophna, ed.), pp 263-274. Springer: New York.

Butler AB, Saidel WM (2000) Defining sameness: historical, biological, and generative homology. Bioessays 22: 846-853.

Fitch WM (2000) Homology: a personal view on some of the problems. Trends in Genetics 16: 227-231.

Meyer A (1999) Homology and homoplasy: the retention of genetic programmes. In: Homology (GR Bock, G Cardew, eds), pp. 141-157. Wiley: Chichester.

Mindell DP, Meyer A (2001) Homology evolving. Trends in Ecology and Evolution 16: 434-440.

Nielsen C, Martinez P (2003) Patterns of gene expression: homology or homocracy? Development Genes and Evolution 213: 149-154.

Wagner GP (1989) The biological homology concept. Annual Review of Ecology and Systematics 20: 51-69.

Wagner GP (2014) Homology, Genes, and Evolutionary Innovation. Princeton University Press: Princeton NJ.

April 19, 2015


Phylogenetic networks were developed as a professional tool for displaying complicated evolutionary histories. However, this does no mean that such networks cannot be used elsewhere.

As an example, Pete Buchholz produces drawings of dinosaurs as the artist Ornithischophilia at the DeviantArt web site. Among these drawings are some phylogenies, and two of them are networks.

The first one is labelled Citrus is complicated, and refers to the origin of citrus cultivars.

The phylogenetic tree at the left is sourced from the American Journal of Botany, while the network at the right is from information in Wikipedia. The combination of the two appears to be original to the artist. The network is read from left to right — for example, the Limequat is a hybrid of the Key Line and the Kumquat. Compared to the original Wikipedia text, the picture speaks a thousand words.

The second network is labelled Apples are complicated, and refers to the origin of some of the apple cultivars.

No source is given for the information, but I assume that it also comes from Wikipedia. Note that, as before, the network is read from left to right, but this time there is a time scale at the top. The artist refers to it as a "spaghetti diagram", and notes that:
Colors are based on the major parent that the "story" revolves around; purple for Honeycrisp, Yellow for Golden Delicious, Red for Jonathan, Maroon for Red Delicious, Orange for Cox's Orange Pippin, Teal for McIntosh, Green for Granny Smith, and Blue for Topaz.

April 14, 2015


This is a guest blog post by:
Johann-Mattis ListCentre des Recherches Linguistiques sur l'Asie Orientale, Paris, France

What we know, what we know we can now, and what we know we cannot know:Ontological facts and epistemological reality in historical linguistics and evolutionary biology
In a recent blog post (Multiple sequence alignment), David wrote about some theoretical issues regarding the concept of homology in evolutionary biology, and specifically its impact on the design of sequence alignment programs. In that post, he mentioned a recently published paper, where he discusses algorithms for sequence alignment and notes that "there is no known objective function for identifying homology" (Morrison 2015: 14).

This statement triggered my interest, since I was immediately reminded of problems that have been occupying historical linguists for a long time now. These problems arise from the fact that in historical disciplines, such as evolutionary biology or historical linguistics (but also in general history or some parts of geology), scholars are not trying to infer general laws of nature, but rather use knowledge of general laws to infer unique events.

The tasks of scholars working in these disciplines is similar to the task of a crime investigator or a doctor: Detectives use the evidence from a crime scene to infer the individual events that led to the crime (and arrest the culprit), and doctors use the symptoms of patients to identify their individual diseases (and then look for a way to cure them). Similarly, evolutionary biologists and historical linguists try to identify the evolutionary events that lead to the observed diversity of life and languages, respectively.

What unites all these disciplines is the specific mode of reasoning that they employ. Charles Sanders Peirce (1839-1914) was among the first to investigate this reasoning mode in detail (Peirce 1931/1958: 7.202). He called it abduction, and contrasted it with induction and deduction, the traditional modes of logical reasoning. Induction is used to infer a currently unknown general rule from an initial state and its result state, while deduction infers the result state of an initial state and a general rule. On the other hand, abduction seeks to infer initial states from result states by employing a general rule.

What further complicates the task of evolutionary biologists and historical linguists is that we have only limited means to verify or falsify a given hypothesis, since, in contrast to detectives and doctors, our research objects usually do not confess, nor do they give positive feedback when we propose the right hypothesis. We never know whether we found the true murderer or whether we proposed the right cure.

Historical linguistics and the limits of knowledge

In historical linguistics, discussions regarding the limits of our knowledge have been centered around the question of the "nature of the proto-language". Using comparative techniques, in the second half of the 19th century linguists started to reconstruct ancestral words of languages that are not attested in any written source. Thus, linguists would first try to identify cognate (homologous) words in Indo-European languages, and then infer how these words were pronounced in the Indo-European language which was spoken some 8,000 years ago. This technique, which was originally introduced by August Schleicher (1821-1868) in 1861, became very popular, and has remained the standard way of knowledge representation in historical linguistics. Whenever linguists propose such a reconstructed form, based on various pieces of evidence, they use an asterisk symbol * to indicate that the word has been inferred, and that there is no written source that would confirm its existence.

As an example, consider some of the words for "sun" in Indo-European languages (discussed in detail in List 2014: 136):
According to modern historical linguistics theory, these words are all assumed to go back to the same ancestral word in Indo-European. The reconstructed pronunciation of the ancestral form is traditionally represented as *séh₂u̯el- "sun" and an approximate pronunciation of the nominate singular would be [soxwl] (with [x] indicating the same sound as the ch in German Rauch "smoke").

These techniques are generally thought to be quite reliable, and they provided concrete help in the decipherment of many ancient languages (including the Egyptian hieroglyphes, Linear B, and Hittite). The status of the reconstructions that scholars produced was, however, controversially debated. While some scholars claimed that there was a high probability that the proposed reconstructions would come close to the original pronunciation, others would classify them as a pure fiction (Schmidt 1872).

Linear B
While it is obvious that reconstructions represent hypotheses and not indisputable truths, it is less clear how they relate to the actual historical facts. First of all, we know for sure that our hypotheses are not stable over time. As our knowledge of the evidence increases, as we include more languages in our comparison, or get deeper insights into the major processes underlying language history, our hypotheses will also constantly be changed and refined. This is nicely reflected in August Schleicher's Fable (a short parable called "The Sheep and the Horses"), a text that he wrote in his reconstructed version of Proto-Indo-European, in order to illustrate what was by then known about the origin of the Indo-European language. When looking at the many later versions, written by scholars in order to illustrate how our knowledge of Indo-European had changed since then, the differences in the pronunciations are really striking (see this summary in Wikipedia), but so are the similarities.

Judging from the degree to which these reconstruction hypotheses evolved over about 150 years, we can reach an important, apparently paradoxical, conclusion: While our reconstructions in historical linguistics are far from being realistic (in the sense of representing actual pronunciations of an Indo-European people), they are by no means fictions, as Johannes Schmidt claimed long ago. The reconstructions are not (and never will be) realistic, since they will always be preliminary, depending on our currently available data and the theoretical development in our field. On the other hand, the reconstructions are also not necessarily unrealistic, since they reflect scientific hypotheses that have been constantly refined and independently developed using the best knowledge we have at that moment. So, although we know that our hypotheses do not truly reflect what really happened, we have good reasons to assume that they come much closer to the real story than any random hypothesis.

As reflected in David's aforementioned statement regarding the lack of an objective function for homology identification in evolutionary biology, the problem of assessing the realism of our hypotheses is not unique to historical linguistics. In a similar way to that with which we discuss the realism of our reconstructed forms in historical linguistics, one may discuss the realism behind any multiple sequence alignment in evolutionary biology. The objects of investigation in historical linguistics and evolutionary biology are not directly accessible to the researchers, but can only be inferred by tests and theories.

Interestingly, this problem also occurs in the social sciences. In psychology, for example, such attributes of people as "intelligence" cannot be directly observed, but have to be inferred by measuring what they provoke or how they are "reflected in test performance" (Cronbach and Meehl 1955: 178). What is inferred by psychological tests is usually called a construct, and is strictly separated from the underlying quality that scholars originally wanted to measure. The construct is thereby understood as the "fiction or story put forward by a theorist to make sense of a phenomenon" (Statt 1981 [1998]: 67). As in the case of reconstruction in linguistics or homology assessment in biology, it is not the "real" object or process.


What can we conclude from this? Or, to put it differently, why should we care about constructs or the degree of fiction behind our claims in historical linguistics and evolutionary biology? I see two important reasons to do so.

First, we can avoid confusion in our fields by strictly separating ontological facts and epistemological reality. In evolutionary biology, this would help to avoid the confusion that often arises when scholars talk about homologous genes, when in practice what they mean is that they applied some similarity threshold and some cluster procedure to cluster genes in sets of presumed homologs. In historical linguistics, on the other hand, it would help us to get rid of the tiresome debate between formalists (who emphasize that reconstructed forms are simple formulas) and realists (who take reconstructed forms as realistic representations) in reconstruction.

Second, from a broader viewpoint, as scientists, we should always try to be explicit in our claims, and we should also always try to be honest about what we know, what we know we can know, and what we know we cannot know.


Cronbach LJ, Meehl PE (1955) Construct validity in psychological tests. Psychological Bulletin 52: 281-302.

List J-M (2014) Sequence comparison in historical linguistics. Düsseldorf: Düsseldorf University Press.

Morrison DA (2015) Is multiple sequence alignment an art or a science? Systematic Botany 40: 14-26.

Peirce CS (1931/1958) Collected papers of Charles Sanders Peirce. Ed. by C Hartshorne and P Weiss. Cont. by AW Burke. 8 vols. Cambridge MA: Harvard University Press.

Schleicher A (1861) Compendium der vergleichenden Grammatik der indogermanischen Sprache. Vol. 1: Kurzer Abriss einer Lautlehre der indogermanischen Ursprache. Weimar: Böhlau.

Schmidt J (1872) Die Verwantschaftsverhältnisse der indogermanischen Sprachen. Weimar: Hermann Böhlau.

Statt DA, comp. (1981 [1998]) Concise Dictionary of Psychology, 3rd ed. London and New York: Routledge.

April 12, 2015


I have noted before (Evolution and timelines) that any history can be represented as a timeline, but a timeline diagram does not necessarily show an evolutionary history. Unfortunately, this does not stop people from putting the word "evolution" on their timeline diagrams.

One ambitious example is The Evolution of the Web. Two images are shown below, which illustrate some of the transformational history of web browsers and technology, depicted as complex timelines. This represents complex transformational evolution (see The evolutionary March of Progress in popular culture), rather than variational evolution.

The full majesty, and complexity, of the timline can be seen at the interactive version linked above.

April 7, 2015


Phylogenetic networks are intended to display reticulate evolutionary histories, rather than strictly divergent or transformational histories. This idea applies both to species and higher taxa (where the ancestors might be inferred), and to individuals and populations (where some of the ancestors might be sampled). However, the literature is still replete with studies that use one or more phylogenetic trees for displaying reticulate phylogenies.

A recent example is shown by: Umer Chaudhry, Elizabeth M. Redman, Muhammad Abbas, Raman Muthusamy, Kamran Ashraf, John S. Gilleard (2015) Genetic evidence for hybridisation between Haemonchus contortus and Haemonchus placei in natural field populations and its implications for interspecies transmission of anthelmintic resistance. International Journal for Parasitology 45: 149-159.

These authors sampled nematode parasites from sheep, goats, cattle and buffaloes at abattoirs in Pakistan and southern India. These parasites were morphologically characterized as being predominantly either Haemonchus contortus or Haemonchus placei. The worms were then genotyped in several ways, including: SNPs of rDNA ITS-2, microsatellite markers, sequences of nuclear isotype-1 of β-tubulin, and sequences of mitochondrial NADH dehydrogenase subunit 4. The genotyping revealed several individual worms that were considered to be inter-species F1 hybrids.

The phylogenetic tree from the β-tubulin sequences is shown in the first figure. There were 25 haplotypes identified among the worms. Most of the worms were homozygous, with haplotypes that were identified as either H. contortus or H. placei. However, five worms were discovered to be heterozygous, with one haplotype considered to have come from each of the species.

The hybrid status of the worms is shown in the phylogenetic tree by having the hybrids appear twice, once for each of their haplotypes, with the other worms appearing only once. Thus, the actual reticulate history is not made visually obvious.

A better approach would be to use a phylogenetic network. This is straightforward in this case. From the perspective of the worms (rather than the haplotypes), the phylogenetic tree is a so-called MUL-tree, in which some of the taxon labels appear multiple times (and some appear only once). The labels that appear once represent homozygous worms, which can be seen as being "monoploid" for this locus. The labels that appear twice represent heterozygous worms, which can be seen as being "diploid".

MUL-trees where the labels represent different ploidy levels can easily be turned into a network using the Padre program. The result is shown in the next figure, which is therefore a hybridization network.

The actual history of the worms is now clear. Interestingly, one of the hybridization events seems to be older than the other four.

As an aside, it is also worth pointing out a mis-interpretation of the phylogenetic tree produced from the mitochondrial ND4 sequences. This tree is shown in the next figure — I have added the annotations at the right.

The phylogeny shows 12 haplotypes considered to be H. contortus and 14 haplotypes considered to be H. placei. One of the hybrids clearly has a H. contortus haplotype, indicating that its maternal parent came from this species. However, the other four hybrids cannot be unequivocally identified as having H. placei mothers (as claimed by the authors), as their haplotypes are all sisters to the H. placei haplotypes — all of the H. placei haplotypes share a common ancestor that is not shared with the hybrids. Given the root of the tree, H. placei is a more likely identification than is H. contortus, but the tree does not provide unequivocal evidence.

April 5, 2015


The cost of renting or leasing office space differs dramatically around the world. This is obviously of great importance to businesses, as their profitability depends on the balance between income and costs. Their expenditure on office space can thus determine whether or not it is profitable for them to do business in certain cities.

The CBRE Group Inc. is an American commercial real estate company, and they provide an annual Global Prime Office Occupancy Costs report that addresses this business cost. It is a survey of office occupancy costs for prime office space in a large number of cities worldwide. Occupancy costs for business premises represent rent, plus local taxes and service charges. The report notes that: "The occupation cost figures have also been adjusted to reflect different measurement practices from market to market."

Each report lists the top 50 most expensive office locations in the world during the previous year, along with the average occupancy cost (in US$ / sq ft / annum). The locations examined may be the central business district of each city or several parts of some cities, depending on how much office space is available. The list of locations continues to expand every year, but only the top 50 are ever listed in each report.

The CBRE web site currently contains the data for the years 2008-2010 and 2012-2014. There are 71 locations that have appeared in these six top-50 lists, although only 30 of them have appeared in the top 50 in all six years (and seven have appeared only once).

Of course, a phylogenetic network could be used to visualize the data for each location across the six reports, as a tool for exploratory data analysis. To create the network, I first calculated the similarity of the 30 locations using the Gower similarity; and a Neighbor-net analysis was then used to display the between-location similarities as a phylogenetic network. So, locations that are closely connected in the network are similar to each other based on their office costs across the six years, and those that are further apart are progressively more different from each other.

The network shows a gradient of decreasing office costs, from bottom-left to top-right. So, the consistently most expensive locations have been the West End of London and central Hong Kong, followed by Moscow and central Tokyo. London City and Kowloon, in Hong Kong, are not far behind, showing that you cannot avoid high costs for prime office space in these two cities.

Across the locations, the most expensive ones cost on average 3.4 times as much as the cheapest locations. Note that Midtown Manhattan is not nearly as expensive for office rental as it is for living accommodation. Switzerland has only two cities, and both of them are in the middle of the network; so it is not cheap, either.

In the network, Dubai and central Mumbai are isolated from the other locations because their office rents have decreased over the six reports. In the case of Mumbai, the most expensive offices recently have been in the Bandra Kurla complex, instead of Nariman Point.

March 31, 2015


It is tolerably well known that Alfred Russel Wallace developed the idea of evolution via natural selection quite independently of Charles Darwin, and that, indeed, it was Wallace's revelation of this fact that prompted Darwin to finally publish his ideas (Bannister et al. 2014).

Some people are even aware that Wallace developed the Tree of Life metaphor independently, as well (Wallace 1855), a fact of which Darwin himself was perfectly well aware (eg. Bradman and Bartlett 1998):
"the analogy of a branching tree [is] the best mode of representing the natural arrangement of species ... a complicated branching of the lines of affinity, as intricate as the twigs of a gnarled oak ... we have only fragments of this vast system, the stem and main branches being represented by extinct species of which we have no knowledge, while a vast mass of limbs and boughs and minute twigs and scattered leaves is what we have to place in order, and determine the true position each originally occupied with regard to the others."What is less well known is Wallace's contribution to phylogenetic imagery.

The Darwinian version of a phylogenetic tree is, of course, something usually considered to post-date 1859, when Darwin published his best-known book. However, producing such a tree was apparently a rather slow process. For example, in 1863, Franz Hilgendorf wrote a PhD thesis for which he produced a hand-drawn phylogeny, but he did not actually include this in the thesis; and he significantly modified it for its publication in 1866. In 1864 Fritz Müller published a couple of three-taxon trees. Also in 1864, Ernst Haeckel claimed to have started work on his series of phylogenetic trees, but the resulting book was not published until 1866. This means that the first substantial tree to appear in print was that of Mivart (1865).

However, long before this, Wallace was already moving ahead. In 1856 Wallace took the tree imagery from his 1855 publication and applied it to the relationships among bird groups. This publication was his first clearly evolutionary empirical contribution. He adapted the unrooted diagram of Strickland (1841), which represented "the natural system" of bird relationships, and gave it a clearly evolutionary interpretation. So, while Strickland's work was strictly atemporal and non-evolutionary, Wallace produced an evolutionary view of the world, with his two trees representing the end-product of change through time.

Wallace was in South-East Asia at the time of this work, collecting specimens among the islands of what is now Indonesia. He returned to England in 1862, thus having been absent during Darwin's rise to fame. However, he did return before anyone else had tackled Darwin's ideas empirically, and he was in an ideal position to do so himself (Beckenbauer et al. 2010). It would therefore be surprising if he had not done so.

Recently, it has become clear, as a result of the work done for the Wallace Correspondence Project, that Wallace did, indeed, produce a post-Darwinian phylogenetic diagram before any of his contemporaries, although it remained unpublished (Becker and Borg 2014). Not unexpectedly, it also refers to the relationships among birds. What is most interesting for us, however, is that it was a phylogenetic network, not a tree.

You will note that it is an unrooted network, in the same manner as his unrooted bird trees from 1856. In this, his presentation differed from that of Müller, Hilgendorf, Mivart and Haeckel, who all indicated a common ancestor. On the other hand, the branch lengths represent the "relative amount of affinity" between the named taxa, unlike the diagrams of his contemporaries. This means that the diagram can, indeed, be interpreted (in modern terms) as an unrooted phylogenetic network.

In his bird paper, Wallace (1856) had noted that producing the tree diagrams is not easy, as "you will most likely find that you have set down some conflicting affinities, or that you have mistaken some mere analogies for affinities". This seems to be the origin of his interest in the alternative model of a network, rather than a tree (Brabham and Berger 2014), thus making him the first person the use a data-display network to represent conflicting character data.

This post was inspired by the work of Torvill and Dean (1996). Happy April 1.


Bannister RG, Ballesteros-Sota S, Bjørndalen OE (2014) Running, swinging and skiing — the private life of Alfred Russel Wallace. Studia Wallaceana 6: 82-96.

Becker BF, Borg BR (2014) The phylogenetics of A.R. Wallace, and its relation to the science of tennis. Journal of Phylogenetic Inference 13: 101-110.

Beckenbauer FA, Best G, Bruyneel J (2010) Association football as a metaphor for phylogenetics. Is it a sport or a science? Phyloinformatics 7:1.

Brabham JA, Berger G (2014) The speed required to achieve the publication rate of A.R. Wallace. Philosophy and History of Biology 102: 89-92.

Bradman DG, Bartlett KC (1998) Wallace Down Under: the work of Alfred Russel Wallace in the southern hemisphere. Systematic Zoology 47: 767-780.

Haeckel E (1866) Generelle Morphologie der Organismen. Verlag von Georg Reimer, Berlin.

Hilgendorf F (1866) Planorbis multiformis im Steinheimer Süßwasserkalk: ein beispiel von gestaltveränderung im laufe der zeit. Buchhandlung von W. Weber, Berlin.

Mivart, StG (1865) Contributions towards a more complete knowledge of the axial skeleton in the primates. Proceedings of the Zoological Society of London 33: 545-592.

Müller F (1864) Für Darwin. Verlag von Wilhelm Engelman, Leipzig.

Strickland HE (1841) On the true method of discovering the natural system in zoology and botany. Annals and Magazine of Natural History 6: 184-194.

Torvill J, Dean CC (1996) Skating on thin ice. Systematic Biology 45: 641-650.

Wallace AR (1855) On the law which has regulated the introduction of new species. Annals and Magazine of Natural History 16 (2nd series): 184-196.

Wallace AR (1856) Attempts at a natural arrangement of birds. Annals and Magazine of Natural History 18 (2nd series): 193-216.

March 29, 2015


NeighborNet produces splits graphs based on distances between the taxa, rather than using the original character data. This approach can produce what we might call inconsequential splits in the graph — that is, splits that are not explicitly supported by the character data. Here, I present a simple example to illustrate the extent to which this can occur.

The data are taken from: Nanette Thomas, Jeremy J. Bruhl, Andrew Ford, Peter H. Weston (2014) Molecular dating of Winteraceae reveals a complex biogeographical history involving both ancient Gondwanan vicariance and long-distance dispersal. Journal of Biogeography 41: 894-904.

This dataset consists of a set of eight morphological features of the pollen from 31 extant plant taxa plus two fossil samples, as shown in this data matrix:

T_lanceolata        00111011
T_stipitata         00111011
T_purpurescens      00111011
T_xerophila_x       00111011
T_xerophila_r       00111011
T_vickeriana        00111011
T_glaucifolia       00111011
T_membranea         00111011
T_insipida          00111011
T_perrieri          00111010
D_winteri           00111010
D_grenadensis       00111010
B_comptonii         00011010
B_howeana           00011010
B_semicarpoides     00011010
B_whiteana          00011010
B_queenslandiana_q  00011010
B_queenslandiana_1  00011010
P_axillaris         00011011
P_colorata          00011011
Pseudowinterapollis 00011011
B_pancheri          01001011
Harrisipollenites   01001100
Z_acsmithii         01001101
E_stipitatum        01001101
Z_bicolor           01001101
Z_balansae          11001101
C_dinisii           1-111101
C_madagascariensis  1-111101
W_salutaris         1-111101
P_macranthum        1-111101
C_ekmanii           1-111101
C_winterana         1-111101

Note that there are only nine groups of taxa (separated by the dashed lines) — within each group the data are identical. Each character has two states: present / absent.

The resulting NeighborNet, as produced by default using the SplitsTree4 program, is shown in the first graph.

As expected, the taxa form nine groups. There are a number of apparently well-supported splits (ie. with long edges) separating these groups. There are also a number of smaller splits, and a whole series of very tiny splits. None of these latter two groupings are explicitly present in the dataset — the only splits supported by the characters are plotted onto the graph using the character numbers. (Note that character 5 is uninformative.)

The series of very tiny splits are present throughout the graph as extremely short edges. For example, a detailed view of the bottom left-hand corner of the graph is shown in the next figure.

Note that these six taxa have identical character data, and therefore their separation into four groups is entirely an artifact of the NeighborNet algorithm.

So, one needs to be careful when interpreting small splits in such a graph — they may have biologiocal support and they may not.

March 24, 2015


In the literature, phylogenetic trees often appear even when the paper is discussing non-tree evolutionary histories.

A case in point is the paper by: Susanne Gallus, Axel Janke, Vikas Kumar, Maria A. Nilsson (2015) Disentangling the relationship of the Australian marsupial orders using retrotransposon and evolutionary network analyses. Genome Biology and Evolution, in press.

The authors discuss the relationship between the four Australian marsupial orders, and use data from transposable element (retrotransposon) insertions for resolving the inter- and intra-ordinal relationships of the Australian and South American orders. They plot the retrotransposon presence/absence onto a tree derived from alignments of 28 nuclear gene fragments. This is shown in the first figure, with the retrotransposons indicated as dots on the internal branches.

For comparison, the next figure is a Median-Joining network based on the presence/absence of the retrotransposons.

With the exception of the Monito del monte, Shrew opossum and Western quoll, the network matches the basic tree structure. However, it emphasizes more strongly the fact that the retrotransposons do not resolve the relationships among the Marsupial orders. As the authors note:
The retrotransposon insertions support three conflicting topologies regarding Peramelemorphia, Dasyuromorphia and Notoryctemorphia, indicating that the split between the three orders may be best understood as a network ...The rapid divergences left conflicting phylogenetic information in the genome possibly generated by incomplete lineage sorting or introgressive hybridisation, leaving the relationship among Australian marsupial orders unresolvable as a bifurcating process million years later.

March 22, 2015


Phylogenetic networks can be used to illustrate the history of any set of objects or concepts, provided that this history is a divergent one (ie. the history is not simply the transformation of objects through time).

Since I have recently been writing about sequence alignments, it is worthwhile to show an example of applying a network to sequence alignment programs. This comes from the paper by Chaisson MJ, Tesler G (2012) Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory. BMC Bioinformatics 13: 238.

The authors discuss programs that map reads from a sample genome onto a reference sequence. They note: "the relationship between many existing alignment methods is qualitatively illustrated in the figure."

Their legend reads:
The applications / corresponding computational restrictions shown are: (green) short pairwise alignment / detailed edit model; (yellow) database search / divergent homology detection; (red) whole genome alignment / alignment of long sequences with structural rearrangements; and (blue) short read mapping / rapid alignment of massive numbers of short sequences. Although solely illustrative, methods with more similar data structures or algorithmic approaches are on closer branches. The BLASR method combines data structures from short read alignment with optimization methods from whole genome alignment.The reticulation refers to their new program, which "maps reads using coarse alignment methods developed during WGA [whole genome alignment] studies, while speeding up these methods by using the advanced data structures employed in many NGS [next generation sequencing] mapping studies."

March 17, 2015


Multiple sequence alignment software have not yet met their primary aim for evolutionary biologists: maximizing homology of characters. If our goal is to develop an automated procedure for homology assessment, then we need someone to produce a program that explicitly implements this aim.

Alignment is just as much a part of phylogenetics as is tree or network building. It is the procedure that expresses the homology relationships among the characters, rather than the historical relationships among the taxa. Therefore, we need a computer program that accurately expresses homology relationships, as well as one that accurately expresses the historical relationships. We have some programs for the latter but currently nothing for the former.

Unfortunately, homology is a rather nebulous concept. It has to do with inheriting characters from a shared ancestor, which is not something that we can directly observe. Therefore we have to infer it. Somehow.

Homology criteria

Systematists have developed criteria for making decisions about potential homologies in an objective and (hopefully) repeatable manner, and these are directly applicable to nucleotide sequences, which these days are the most common form of data used in phylogenetics. These criteria are:

• Similarity
  1. Compositional = apparent likeness or resemblance between sequences (% similarity)
  2. Topographical = apparent likeness or resemblance between sequences (second- and third-order structure of proteins or RNA)
  3. Functional = functional relationship to other characters in the same sequence (annotated function of the sequence in protein or RNA)
• Conjunction = possible within-genome copies of the same sequence (i.e. paralogy)

• Congruence = agreement with other postulated homologies elsewhere in the same sequences (synapomorphy).

Traditionally, characters have been first proposed as homologous using the criteria of similarity and conjunction (together called primary homology), and then tested with the criterion of congruence (secondary homology).

It is important to note that these criteria do not necessarily always agree with each other in their inferences of homology. Changes that occur during evolutionary history can weaken the connection between these criteria so that, for example, nucleotide homology inferred from structural similarity is no longer the same as nucleotide similarity inferred from compositional similarity. It is for this reason that compositional similarity of the sequences is insufficient to establish gene orthology, for example. The same limitation applies to nucleotides.

Current computer programs

It is clear that these criteria have been incorporated singly into current computerized procedures for producing multiple sequence alignments, but rarely in combination. For example, compositional similarity is the criterion used by the most popular computer programs, such as CLUSTAL, MAFFT and Muscle. Topographical similarity is being invoked whenever structure-based alignments are produced. such as for RNA-coding sequences (eg. PicXAA-R; PMFastR), or when nucleotide sequences are translated to amino acids before alignment (eg. PROMALS). Functional similarity is used for specialist studies of conserved motifs and binding sites, for instance. Ontogenetic similarity of nucleotide sequences is based on inferring the possible molecular processes that cause the observed sequence variation — the program Prank uses this criterion by distinguishing between insertions and deletions.

Congruence as a criterion involves the observation of repeated patterns of synapomorphy in a phylogeny. Among alignment algorithms, both Direct Optimization (e.g. POY; MSAM; BeeTLe) and Statistical Alignment (e.g. BAli-Phy; StatAlign) try to simultaneously produce a multiple alignment and a phylogenetic tree, thus optimizing the criterion of congruence.

The fact that none of the current crop of programs basically apply more than one criterion is, I contend, the principal reason why so many phylogeneticists adjust their alignments manually. Personal judgment may not be perfect, but at least it can be consciously based on homology as a general character concept. Since the different criteria may conflict with each other, at the moment only human judgment is available to compare them and thus make a final decision.

Required program

To make the homology criteria fully operational, we need to compare their inferences by evaluating the comparative evidence. That is, since the different criteria may conflict with each other, we need an automated way to compare them and evaluate their relative probabilities for any alignment column. What we need is a computerized procedure that will includes all of the known criteria for homology assessment. Sadly, there are currently no mathematical models for doing this.

I suspect that there are two reasons for the failure of such a program to appear by now. First, biologists have not been clear about homology as a concept, and have not been able to express it in a form that computationalists could use to develop an algorithm. That is, we have criteria but they are not really operational criteria in a computational sense. Second, it will not be easy, because there is no obvious algorithm for inferring inheritance of characters. That is, we cannot easily separate homology from analogy.

Interactive editor

Another proposal is to have an interactive alignment editor. This editor would have the ability to show the conflicting hypotheses of homology (eg. where the homology suggested by structural pairing in a stem conflicts with homology suggested by tandem repeats), and then to annotate each column in the final alignment with the reason for the researcher having chosen to align those particular nucleotides. For example, one could press a button and see the RNA stem pairs in different colors (irrespective of whether the stem nucleotides are aligned), or press again and see the tandem repeats and inversions in different colours (once again, irrespective of how the nucleotides are aligned). One could also choose to see the annotations for the columns (summarized, using some coded schema), or simply look at the unadorned alignment itself.

This seems to me to be an achievable goal in the short-term; and the PhyDE editor already does some of it. Such an editor would also serve as a necessary step on the way to working out how to automate as much of the process as possible. The ultimate goal for some people may be total automation (ie. a black box), but I see no way to achieve that in the immediate term. Besides, I suspect that phylogeneticists will always want some judgemental control over the process, which would be best achieved with a semi-automated interactive editor. That is, we might ask the program to work out what the alternative alignments are for any specified subsequence (in an automated manner), and then we evaluate their relative merits for ourselves.

Note that I am treating the alignment as a set of hypotheses independent of their phylogenetic analysis. Subsequences can still be tentatively aligned even if the researcher intends masking those subsequences out of any subsequent tree-building analysis. Also, subsets of the taxa might be aligned confidently while other subsets are left unaligned. With current editors, this involves having a separate alignment file for each subset, which is very cumbersome, as well as error-prone.

March 15, 2015


Here is a new collection of tattoos based on Charles Darwin's best-known sketch from his Notebooks (the "I think" tree). For other examples, see Tattoo Monday III, Tattoo Monday VI, and Tattoo Monday IX.