The Genealogical World of Phylogenetic Networks

Biology, computational science, and networks in phylogenetic analysis


XML feed

Last update

1 hour 1 min ago

April 14, 2015


This is a guest blog post by:
Johann-Mattis ListCentre des Recherches Linguistiques sur l'Asie Orientale, Paris, France

What we know, what we know we can now, and what we know we cannot know:Ontological facts and epistemological reality in historical linguistics and evolutionary biology
In a recent blog post (Multiple sequence alignment), David wrote about some theoretical issues regarding the concept of homology in evolutionary biology, and specifically its impact on the design of sequence alignment programs. In that post, he mentioned a recently published paper, where he discusses algorithms for sequence alignment and notes that "there is no known objective function for identifying homology" (Morrison 2015: 14).

This statement triggered my interest, since I was immediately reminded of problems that have been occupying historical linguists for a long time now. These problems arise from the fact that in historical disciplines, such as evolutionary biology or historical linguistics (but also in general history or some parts of geology), scholars are not trying to infer general laws of nature, but rather use knowledge of general laws to infer unique events.

The tasks of scholars working in these disciplines is similar to the task of a crime investigator or a doctor: Detectives use the evidence from a crime scene to infer the individual events that led to the crime (and arrest the culprit), and doctors use the symptoms of patients to identify their individual diseases (and then look for a way to cure them). Similarly, evolutionary biologists and historical linguists try to identify the evolutionary events that lead to the observed diversity of life and languages, respectively.

What unites all these disciplines is the specific mode of reasoning that they employ. Charles Sanders Peirce (1839-1914) was among the first to investigate this reasoning mode in detail (Peirce 1931/1958: 7.202). He called it abduction, and contrasted it with induction and deduction, the traditional modes of logical reasoning. Induction is used to infer a currently unknown general rule from an initial state and its result state, while deduction infers the result state of an initial state and a general rule. On the other hand, abduction seeks to infer initial states from result states by employing a general rule.

What further complicates the task of evolutionary biologists and historical linguists is that we have only limited means to verify or falsify a given hypothesis, since, in contrast to detectives and doctors, our research objects usually do not confess, nor do they give positive feedback when we propose the right hypothesis. We never know whether we found the true murderer or whether we proposed the right cure.

Historical linguistics and the limits of knowledge

In historical linguistics, discussions regarding the limits of our knowledge have been centered around the question of the "nature of the proto-language". Using comparative techniques, in the second half of the 19th century linguists started to reconstruct ancestral words of languages that are not attested in any written source. Thus, linguists would first try to identify cognate (homologous) words in Indo-European languages, and then infer how these words were pronounced in the Indo-European language which was spoken some 8,000 years ago. This technique, which was originally introduced by August Schleicher (1821-1868) in 1861, became very popular, and has remained the standard way of knowledge representation in historical linguistics. Whenever linguists propose such a reconstructed form, based on various pieces of evidence, they use an asterisk symbol * to indicate that the word has been inferred, and that there is no written source that would confirm its existence.

As an example, consider some of the words for "sun" in Indo-European languages (discussed in detail in List 2014: 136):
According to modern historical linguistics theory, these words are all assumed to go back to the same ancestral word in Indo-European. The reconstructed pronunciation of the ancestral form is traditionally represented as *séh₂u̯el- "sun" and an approximate pronunciation of the nominate singular would be [soxwl] (with [x] indicating the same sound as the ch in German Rauch "smoke").

These techniques are generally thought to be quite reliable, and they provided concrete help in the decipherment of many ancient languages (including the Egyptian hieroglyphes, Linear B, and Hittite). The status of the reconstructions that scholars produced was, however, controversially debated. While some scholars claimed that there was a high probability that the proposed reconstructions would come close to the original pronunciation, others would classify them as a pure fiction (Schmidt 1872).

Linear B
While it is obvious that reconstructions represent hypotheses and not indisputable truths, it is less clear how they relate to the actual historical facts. First of all, we know for sure that our hypotheses are not stable over time. As our knowledge of the evidence increases, as we include more languages in our comparison, or get deeper insights into the major processes underlying language history, our hypotheses will also constantly be changed and refined. This is nicely reflected in August Schleicher's Fable (a short parable called "The Sheep and the Horses"), a text that he wrote in his reconstructed version of Proto-Indo-European, in order to illustrate what was by then known about the origin of the Indo-European language. When looking at the many later versions, written by scholars in order to illustrate how our knowledge of Indo-European had changed since then, the differences in the pronunciations are really striking (see this summary in Wikipedia), but so are the similarities.

Judging from the degree to which these reconstruction hypotheses evolved over about 150 years, we can reach an important, apparently paradoxical, conclusion: While our reconstructions in historical linguistics are far from being realistic (in the sense of representing actual pronunciations of an Indo-European people), they are by no means fictions, as Johannes Schmidt claimed long ago. The reconstructions are not (and never will be) realistic, since they will always be preliminary, depending on our currently available data and the theoretical development in our field. On the other hand, the reconstructions are also not necessarily unrealistic, since they reflect scientific hypotheses that have been constantly refined and independently developed using the best knowledge we have at that moment. So, although we know that our hypotheses do not truly reflect what really happened, we have good reasons to assume that they come much closer to the real story than any random hypothesis.

As reflected in David's aforementioned statement regarding the lack of an objective function for homology identification in evolutionary biology, the problem of assessing the realism of our hypotheses is not unique to historical linguistics. In a similar way to that with which we discuss the realism of our reconstructed forms in historical linguistics, one may discuss the realism behind any multiple sequence alignment in evolutionary biology. The objects of investigation in historical linguistics and evolutionary biology are not directly accessible to the researchers, but can only be inferred by tests and theories.

Interestingly, this problem also occurs in the social sciences. In psychology, for example, such attributes of people as "intelligence" cannot be directly observed, but have to be inferred by measuring what they provoke or how they are "reflected in test performance" (Cronbach and Meehl 1955: 178). What is inferred by psychological tests is usually called a construct, and is strictly separated from the underlying quality that scholars originally wanted to measure. The construct is thereby understood as the "fiction or story put forward by a theorist to make sense of a phenomenon" (Statt 1981 [1998]: 67). As in the case of reconstruction in linguistics or homology assessment in biology, it is not the "real" object or process.


What can we conclude from this? Or, to put it differently, why should we care about constructs or the degree of fiction behind our claims in historical linguistics and evolutionary biology? I see two important reasons to do so.

First, we can avoid confusion in our fields by strictly separating ontological facts and epistemological reality. In evolutionary biology, this would help to avoid the confusion that often arises when scholars talk about homologous genes, when in practice what they mean is that they applied some similarity threshold and some cluster procedure to cluster genes in sets of presumed homologs. In historical linguistics, on the other hand, it would help us to get rid of the tiresome debate between formalists (who emphasize that reconstructed forms are simple formulas) and realists (who take reconstructed forms as realistic representations) in reconstruction.

Second, from a broader viewpoint, as scientists, we should always try to be explicit in our claims, and we should also always try to be honest about what we know, what we know we can know, and what we know we cannot know.


Cronbach LJ, Meehl PE (1955) Construct validity in psychological tests. Psychological Bulletin 52: 281-302.

List J-M (2014) Sequence comparison in historical linguistics. Düsseldorf: Düsseldorf University Press.

Morrison DA (2015) Is multiple sequence alignment an art or a science? Systematic Botany 40: 14-26.

Peirce CS (1931/1958) Collected papers of Charles Sanders Peirce. Ed. by C Hartshorne and P Weiss. Cont. by AW Burke. 8 vols. Cambridge MA: Harvard University Press.

Schleicher A (1861) Compendium der vergleichenden Grammatik der indogermanischen Sprache. Vol. 1: Kurzer Abriss einer Lautlehre der indogermanischen Ursprache. Weimar: Böhlau.

Schmidt J (1872) Die Verwantschaftsverhältnisse der indogermanischen Sprachen. Weimar: Hermann Böhlau.

Statt DA, comp. (1981 [1998]) Concise Dictionary of Psychology, 3rd ed. London and New York: Routledge.

April 12, 2015


I have noted before (Evolution and timelines) that any history can be represented as a timeline, but a timeline diagram does not necessarily show an evolutionary history. Unfortunately, this does not stop people from putting the word "evolution" on their timeline diagrams.

One ambitious example is The Evolution of the Web. Two images are shown below, which illustrate some of the transformational history of web browsers and technology, depicted as complex timelines. This represents complex transformational evolution (see The evolutionary March of Progress in popular culture), rather than variational evolution.

The full majesty, and complexity, of the timline can be seen at the interactive version linked above.

April 7, 2015


Phylogenetic networks are intended to display reticulate evolutionary histories, rather than strictly divergent or transformational histories. This idea applies both to species and higher taxa (where the ancestors might be inferred), and to individuals and populations (where some of the ancestors might be sampled). However, the literature is still replete with studies that use one or more phylogenetic trees for displaying reticulate phylogenies.

A recent example is shown by: Umer Chaudhry, Elizabeth M. Redman, Muhammad Abbas, Raman Muthusamy, Kamran Ashraf, John S. Gilleard (2015) Genetic evidence for hybridisation between Haemonchus contortus and Haemonchus placei in natural field populations and its implications for interspecies transmission of anthelmintic resistance. International Journal for Parasitology 45: 149-159.

These authors sampled nematode parasites from sheep, goats, cattle and buffaloes at abattoirs in Pakistan and southern India. These parasites were morphologically characterized as being predominantly either Haemonchus contortus or Haemonchus placei. The worms were then genotyped in several ways, including: SNPs of rDNA ITS-2, microsatellite markers, sequences of nuclear isotype-1 of β-tubulin, and sequences of mitochondrial NADH dehydrogenase subunit 4. The genotyping revealed several individual worms that were considered to be inter-species F1 hybrids.

The phylogenetic tree from the β-tubulin sequences is shown in the first figure. There were 25 haplotypes identified among the worms. Most of the worms were homozygous, with haplotypes that were identified as either H. contortus or H. placei. However, five worms were discovered to be heterozygous, with one haplotype considered to have come from each of the species.

The hybrid status of the worms is shown in the phylogenetic tree by having the hybrids appear twice, once for each of their haplotypes, with the other worms appearing only once. Thus, the actual reticulate history is not made visually obvious.

A better approach would be to use a phylogenetic network. This is straightforward in this case. From the perspective of the worms (rather than the haplotypes), the phylogenetic tree is a so-called MUL-tree, in which some of the taxon labels appear multiple times (and some appear only once). The labels that appear once represent homozygous worms, which can be seen as being "monoploid" for this locus. The labels that appear twice represent heterozygous worms, which can be seen as being "diploid".

MUL-trees where the labels represent different ploidy levels can easily be turned into a network using the Padre program. The result is shown in the next figure, which is therefore a hybridization network.

The actual history of the worms is now clear. Interestingly, one of the hybridization events seems to be older than the other four.

As an aside, it is also worth pointing out a mis-interpretation of the phylogenetic tree produced from the mitochondrial ND4 sequences. This tree is shown in the next figure — I have added the annotations at the right.

The phylogeny shows 12 haplotypes considered to be H. contortus and 14 haplotypes considered to be H. placei. One of the hybrids clearly has a H. contortus haplotype, indicating that its maternal parent came from this species. However, the other four hybrids cannot be unequivocally identified as having H. placei mothers (as claimed by the authors), as their haplotypes are all sisters to the H. placei haplotypes — all of the H. placei haplotypes share a common ancestor that is not shared with the hybrids. Given the root of the tree, H. placei is a more likely identification than is H. contortus, but the tree does not provide unequivocal evidence.

April 5, 2015


The cost of renting or leasing office space differs dramatically around the world. This is obviously of great importance to businesses, as their profitability depends on the balance between income and costs. Their expenditure on office space can thus determine whether or not it is profitable for them to do business in certain cities.

The CBRE Group Inc. is an American commercial real estate company, and they provide an annual Global Prime Office Occupancy Costs report that addresses this business cost. It is a survey of office occupancy costs for prime office space in a large number of cities worldwide. Occupancy costs for business premises represent rent, plus local taxes and service charges. The report notes that: "The occupation cost figures have also been adjusted to reflect different measurement practices from market to market."

Each report lists the top 50 most expensive office locations in the world during the previous year, along with the average occupancy cost (in US$ / sq ft / annum). The locations examined may be the central business district of each city or several parts of some cities, depending on how much office space is available. The list of locations continues to expand every year, but only the top 50 are ever listed in each report.

The CBRE web site currently contains the data for the years 2008-2010 and 2012-2014. There are 71 locations that have appeared in these six top-50 lists, although only 30 of them have appeared in the top 50 in all six years (and seven have appeared only once).

Of course, a phylogenetic network could be used to visualize the data for each location across the six reports, as a tool for exploratory data analysis. To create the network, I first calculated the similarity of the 30 locations using the Gower similarity; and a Neighbor-net analysis was then used to display the between-location similarities as a phylogenetic network. So, locations that are closely connected in the network are similar to each other based on their office costs across the six years, and those that are further apart are progressively more different from each other.

The network shows a gradient of decreasing office costs, from bottom-left to top-right. So, the consistently most expensive locations have been the West End of London and central Hong Kong, followed by Moscow and central Tokyo. London City and Kowloon, in Hong Kong, are not far behind, showing that you cannot avoid high costs for prime office space in these two cities.

Across the locations, the most expensive ones cost on average 3.4 times as much as the cheapest locations. Note that Midtown Manhattan is not nearly as expensive for office rental as it is for living accommodation. Switzerland has only two cities, and both of them are in the middle of the network; so it is not cheap, either.

In the network, Dubai and central Mumbai are isolated from the other locations because their office rents have decreased over the six reports. In the case of Mumbai, the most expensive offices recently have been in the Bandra Kurla complex, instead of Nariman Point.

March 31, 2015


It is tolerably well known that Alfred Russel Wallace developed the idea of evolution via natural selection quite independently of Charles Darwin, and that, indeed, it was Wallace's revelation of this fact that prompted Darwin to finally publish his ideas (Bannister et al. 2014).

Some people are even aware that Wallace developed the Tree of Life metaphor independently, as well (Wallace 1855), a fact of which Darwin himself was perfectly well aware (eg. Bradman and Bartlett 1998):
"the analogy of a branching tree [is] the best mode of representing the natural arrangement of species ... a complicated branching of the lines of affinity, as intricate as the twigs of a gnarled oak ... we have only fragments of this vast system, the stem and main branches being represented by extinct species of which we have no knowledge, while a vast mass of limbs and boughs and minute twigs and scattered leaves is what we have to place in order, and determine the true position each originally occupied with regard to the others."What is less well known is Wallace's contribution to phylogenetic imagery.

The Darwinian version of a phylogenetic tree is, of course, something usually considered to post-date 1859, when Darwin published his best-known book. However, producing such a tree was apparently a rather slow process. For example, in 1863, Franz Hilgendorf wrote a PhD thesis for which he produced a hand-drawn phylogeny, but he did not actually include this in the thesis; and he significantly modified it for its publication in 1866. In 1864 Fritz Müller published a couple of three-taxon trees. Also in 1864, Ernst Haeckel claimed to have started work on his series of phylogenetic trees, but the resulting book was not published until 1866. This means that the first substantial tree to appear in print was that of Mivart (1865).

However, long before this, Wallace was already moving ahead. In 1856 Wallace took the tree imagery from his 1855 publication and applied it to the relationships among bird groups. This publication was his first clearly evolutionary empirical contribution. He adapted the unrooted diagram of Strickland (1841), which represented "the natural system" of bird relationships, and gave it a clearly evolutionary interpretation. So, while Strickland's work was strictly atemporal and non-evolutionary, Wallace produced an evolutionary view of the world, with his two trees representing the end-product of change through time.

Wallace was in South-East Asia at the time of this work, collecting specimens among the islands of what is now Indonesia. He returned to England in 1862, thus having been absent during Darwin's rise to fame. However, he did return before anyone else had tackled Darwin's ideas empirically, and he was in an ideal position to do so himself (Beckenbauer et al. 2010). It would therefore be surprising if he had not done so.

Recently, it has become clear, as a result of the work done for the Wallace Correspondence Project, that Wallace did, indeed, produce a post-Darwinian phylogenetic diagram before any of his contemporaries, although it remained unpublished (Becker and Borg 2014). Not unexpectedly, it also refers to the relationships among birds. What is most interesting for us, however, is that it was a phylogenetic network, not a tree.

You will note that it is an unrooted network, in the same manner as his unrooted bird trees from 1856. In this, his presentation differed from that of Müller, Hilgendorf, Mivart and Haeckel, who all indicated a common ancestor. On the other hand, the branch lengths represent the "relative amount of affinity" between the named taxa, unlike the diagrams of his contemporaries. This means that the diagram can, indeed, be interpreted (in modern terms) as an unrooted phylogenetic network.

In his bird paper, Wallace (1856) had noted that producing the tree diagrams is not easy, as "you will most likely find that you have set down some conflicting affinities, or that you have mistaken some mere analogies for affinities". This seems to be the origin of his interest in the alternative model of a network, rather than a tree (Brabham and Berger 2014), thus making him the first person the use a data-display network to represent conflicting character data.

This post was inspired by the work of Torvill and Dean (1996). Happy April 1.


Bannister RG, Ballesteros-Sota S, Bjørndalen OE (2014) Running, swinging and skiing — the private life of Alfred Russel Wallace. Studia Wallaceana 6: 82-96.

Becker BF, Borg BR (2014) The phylogenetics of A.R. Wallace, and its relation to the science of tennis. Journal of Phylogenetic Inference 13: 101-110.

Beckenbauer FA, Best G, Bruyneel J (2010) Association football as a metaphor for phylogenetics. Is it a sport or a science? Phyloinformatics 7:1.

Brabham JA, Berger G (2014) The speed required to achieve the publication rate of A.R. Wallace. Philosophy and History of Biology 102: 89-92.

Bradman DG, Bartlett KC (1998) Wallace Down Under: the work of Alfred Russel Wallace in the southern hemisphere. Systematic Zoology 47: 767-780.

Haeckel E (1866) Generelle Morphologie der Organismen. Verlag von Georg Reimer, Berlin.

Hilgendorf F (1866) Planorbis multiformis im Steinheimer Süßwasserkalk: ein beispiel von gestaltveränderung im laufe der zeit. Buchhandlung von W. Weber, Berlin.

Mivart, StG (1865) Contributions towards a more complete knowledge of the axial skeleton in the primates. Proceedings of the Zoological Society of London 33: 545-592.

Müller F (1864) Für Darwin. Verlag von Wilhelm Engelman, Leipzig.

Strickland HE (1841) On the true method of discovering the natural system in zoology and botany. Annals and Magazine of Natural History 6: 184-194.

Torvill J, Dean CC (1996) Skating on thin ice. Systematic Biology 45: 641-650.

Wallace AR (1855) On the law which has regulated the introduction of new species. Annals and Magazine of Natural History 16 (2nd series): 184-196.

Wallace AR (1856) Attempts at a natural arrangement of birds. Annals and Magazine of Natural History 18 (2nd series): 193-216.

March 29, 2015


NeighborNet produces splits graphs based on distances between the taxa, rather than using the original character data. This approach can produce what we might call inconsequential splits in the graph — that is, splits that are not explicitly supported by the character data. Here, I present a simple example to illustrate the extent to which this can occur.

The data are taken from: Nanette Thomas, Jeremy J. Bruhl, Andrew Ford, Peter H. Weston (2014) Molecular dating of Winteraceae reveals a complex biogeographical history involving both ancient Gondwanan vicariance and long-distance dispersal. Journal of Biogeography 41: 894-904.

This dataset consists of a set of eight morphological features of the pollen from 31 extant plant taxa plus two fossil samples, as shown in this data matrix:

T_lanceolata        00111011
T_stipitata         00111011
T_purpurescens      00111011
T_xerophila_x       00111011
T_xerophila_r       00111011
T_vickeriana        00111011
T_glaucifolia       00111011
T_membranea         00111011
T_insipida          00111011
T_perrieri          00111010
D_winteri           00111010
D_grenadensis       00111010
B_comptonii         00011010
B_howeana           00011010
B_semicarpoides     00011010
B_whiteana          00011010
B_queenslandiana_q  00011010
B_queenslandiana_1  00011010
P_axillaris         00011011
P_colorata          00011011
Pseudowinterapollis 00011011
B_pancheri          01001011
Harrisipollenites   01001100
Z_acsmithii         01001101
E_stipitatum        01001101
Z_bicolor           01001101
Z_balansae          11001101
C_dinisii           1-111101
C_madagascariensis  1-111101
W_salutaris         1-111101
P_macranthum        1-111101
C_ekmanii           1-111101
C_winterana         1-111101

Note that there are only nine groups of taxa (separated by the dashed lines) — within each group the data are identical. Each character has two states: present / absent.

The resulting NeighborNet, as produced by default using the SplitsTree4 program, is shown in the first graph.

As expected, the taxa form nine groups. There are a number of apparently well-supported splits (ie. with long edges) separating these groups. There are also a number of smaller splits, and a whole series of very tiny splits. None of these latter two groupings are explicitly present in the dataset — the only splits supported by the characters are plotted onto the graph using the character numbers. (Note that character 5 is uninformative.)

The series of very tiny splits are present throughout the graph as extremely short edges. For example, a detailed view of the bottom left-hand corner of the graph is shown in the next figure.

Note that these six taxa have identical character data, and therefore their separation into four groups is entirely an artifact of the NeighborNet algorithm.

So, one needs to be careful when interpreting small splits in such a graph — they may have biologiocal support and they may not.

March 24, 2015


In the literature, phylogenetic trees often appear even when the paper is discussing non-tree evolutionary histories.

A case in point is the paper by: Susanne Gallus, Axel Janke, Vikas Kumar, Maria A. Nilsson (2015) Disentangling the relationship of the Australian marsupial orders using retrotransposon and evolutionary network analyses. Genome Biology and Evolution, in press.

The authors discuss the relationship between the four Australian marsupial orders, and use data from transposable element (retrotransposon) insertions for resolving the inter- and intra-ordinal relationships of the Australian and South American orders. They plot the retrotransposon presence/absence onto a tree derived from alignments of 28 nuclear gene fragments. This is shown in the first figure, with the retrotransposons indicated as dots on the internal branches.

For comparison, the next figure is a Median-Joining network based on the presence/absence of the retrotransposons.

With the exception of the Monito del monte, Shrew opossum and Western quoll, the network matches the basic tree structure. However, it emphasizes more strongly the fact that the retrotransposons do not resolve the relationships among the Marsupial orders. As the authors note:
The retrotransposon insertions support three conflicting topologies regarding Peramelemorphia, Dasyuromorphia and Notoryctemorphia, indicating that the split between the three orders may be best understood as a network ...The rapid divergences left conflicting phylogenetic information in the genome possibly generated by incomplete lineage sorting or introgressive hybridisation, leaving the relationship among Australian marsupial orders unresolvable as a bifurcating process million years later.

March 22, 2015


Phylogenetic networks can be used to illustrate the history of any set of objects or concepts, provided that this history is a divergent one (ie. the history is not simply the transformation of objects through time).

Since I have recently been writing about sequence alignments, it is worthwhile to show an example of applying a network to sequence alignment programs. This comes from the paper by Chaisson MJ, Tesler G (2012) Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory. BMC Bioinformatics 13: 238.

The authors discuss programs that map reads from a sample genome onto a reference sequence. They note: "the relationship between many existing alignment methods is qualitatively illustrated in the figure."

Their legend reads:
The applications / corresponding computational restrictions shown are: (green) short pairwise alignment / detailed edit model; (yellow) database search / divergent homology detection; (red) whole genome alignment / alignment of long sequences with structural rearrangements; and (blue) short read mapping / rapid alignment of massive numbers of short sequences. Although solely illustrative, methods with more similar data structures or algorithmic approaches are on closer branches. The BLASR method combines data structures from short read alignment with optimization methods from whole genome alignment.The reticulation refers to their new program, which "maps reads using coarse alignment methods developed during WGA [whole genome alignment] studies, while speeding up these methods by using the advanced data structures employed in many NGS [next generation sequencing] mapping studies."

March 17, 2015


Multiple sequence alignment software have not yet met their primary aim for evolutionary biologists: maximizing homology of characters. If our goal is to develop an automated procedure for homology assessment, then we need someone to produce a program that explicitly implements this aim.

Alignment is just as much a part of phylogenetics as is tree or network building. It is the procedure that expresses the homology relationships among the characters, rather than the historical relationships among the taxa. Therefore, we need a computer program that accurately expresses homology relationships, as well as one that accurately expresses the historical relationships. We have some programs for the latter but currently nothing for the former.

Unfortunately, homology is a rather nebulous concept. It has to do with inheriting characters from a shared ancestor, which is not something that we can directly observe. Therefore we have to infer it. Somehow.

Homology criteria

Systematists have developed criteria for making decisions about potential homologies in an objective and (hopefully) repeatable manner, and these are directly applicable to nucleotide sequences, which these days are the most common form of data used in phylogenetics. These criteria are:

• Similarity
  1. Compositional = apparent likeness or resemblance between sequences (% similarity)
  2. Topographical = apparent likeness or resemblance between sequences (second- and third-order structure of proteins or RNA)
  3. Functional = functional relationship to other characters in the same sequence (annotated function of the sequence in protein or RNA)
• Conjunction = possible within-genome copies of the same sequence (i.e. paralogy)

• Congruence = agreement with other postulated homologies elsewhere in the same sequences (synapomorphy).

Traditionally, characters have been first proposed as homologous using the criteria of similarity and conjunction (together called primary homology), and then tested with the criterion of congruence (secondary homology).

It is important to note that these criteria do not necessarily always agree with each other in their inferences of homology. Changes that occur during evolutionary history can weaken the connection between these criteria so that, for example, nucleotide homology inferred from structural similarity is no longer the same as nucleotide similarity inferred from compositional similarity. It is for this reason that compositional similarity of the sequences is insufficient to establish gene orthology, for example. The same limitation applies to nucleotides.

Current computer programs

It is clear that these criteria have been incorporated singly into current computerized procedures for producing multiple sequence alignments, but rarely in combination. For example, compositional similarity is the criterion used by the most popular computer programs, such as CLUSTAL, MAFFT and Muscle. Topographical similarity is being invoked whenever structure-based alignments are produced. such as for RNA-coding sequences (eg. PicXAA-R; PMFastR), or when nucleotide sequences are translated to amino acids before alignment (eg. PROMALS). Functional similarity is used for specialist studies of conserved motifs and binding sites, for instance. Ontogenetic similarity of nucleotide sequences is based on inferring the possible molecular processes that cause the observed sequence variation — the program Prank uses this criterion by distinguishing between insertions and deletions.

Congruence as a criterion involves the observation of repeated patterns of synapomorphy in a phylogeny. Among alignment algorithms, both Direct Optimization (e.g. POY; MSAM; BeeTLe) and Statistical Alignment (e.g. BAli-Phy; StatAlign) try to simultaneously produce a multiple alignment and a phylogenetic tree, thus optimizing the criterion of congruence.

The fact that none of the current crop of programs basically apply more than one criterion is, I contend, the principal reason why so many phylogeneticists adjust their alignments manually. Personal judgment may not be perfect, but at least it can be consciously based on homology as a general character concept. Since the different criteria may conflict with each other, at the moment only human judgment is available to compare them and thus make a final decision.

Required program

To make the homology criteria fully operational, we need to compare their inferences by evaluating the comparative evidence. That is, since the different criteria may conflict with each other, we need an automated way to compare them and evaluate their relative probabilities for any alignment column. What we need is a computerized procedure that will includes all of the known criteria for homology assessment. Sadly, there are currently no mathematical models for doing this.

I suspect that there are two reasons for the failure of such a program to appear by now. First, biologists have not been clear about homology as a concept, and have not been able to express it in a form that computationalists could use to develop an algorithm. That is, we have criteria but they are not really operational criteria in a computational sense. Second, it will not be easy, because there is no obvious algorithm for inferring inheritance of characters. That is, we cannot easily separate homology from analogy.

Interactive editor

Another proposal is to have an interactive alignment editor. This editor would have the ability to show the conflicting hypotheses of homology (eg. where the homology suggested by structural pairing in a stem conflicts with homology suggested by tandem repeats), and then to annotate each column in the final alignment with the reason for the researcher having chosen to align those particular nucleotides. For example, one could press a button and see the RNA stem pairs in different colors (irrespective of whether the stem nucleotides are aligned), or press again and see the tandem repeats and inversions in different colours (once again, irrespective of how the nucleotides are aligned). One could also choose to see the annotations for the columns (summarized, using some coded schema), or simply look at the unadorned alignment itself.

This seems to me to be an achievable goal in the short-term; and the PhyDE editor already does some of it. Such an editor would also serve as a necessary step on the way to working out how to automate as much of the process as possible. The ultimate goal for some people may be total automation (ie. a black box), but I see no way to achieve that in the immediate term. Besides, I suspect that phylogeneticists will always want some judgemental control over the process, which would be best achieved with a semi-automated interactive editor. That is, we might ask the program to work out what the alternative alignments are for any specified subsequence (in an automated manner), and then we evaluate their relative merits for ourselves.

Note that I am treating the alignment as a set of hypotheses independent of their phylogenetic analysis. Subsequences can still be tentatively aligned even if the researcher intends masking those subsequences out of any subsequent tree-building analysis. Also, subsets of the taxa might be aligned confidently while other subsets are left unaligned. With current editors, this involves having a separate alignment file for each subset, which is very cumbersome, as well as error-prone.

March 15, 2015


Here is a new collection of tattoos based on Charles Darwin's best-known sketch from his Notebooks (the "I think" tree). For other examples, see Tattoo Monday III, Tattoo Monday VI, and Tattoo Monday IX.

March 11, 2015


Multiple sequence alignment software have not yet met their primary aim for evolutionary biologists: maximizing homology of characters. The proliferation of alignment methods have diverse optimization functions, along with assorted heuristics to search for the optimum alignment; and these methods produce detectably different multiple sequence alignments in almost all realistic cases (see The need for a new sequence alignment program). This leaves the phylogeneticists wondering what to do. In response, the majority of phylogeneticists use manual alignment or re-alignment at some stage in their procedures.

If our goal is to develop an automated procedure for homology assessment (see Multiple sequence alignment), then we need some means of evaluating the relative success of different alignment methods.

There are four suggestions for benchmarking strategies for sequence alignment (Iantorno S, Gori K, Goldman N, Gil M, Dessimoz C 2014. Who watches the watchmen? An appraisal of benchmarks for multiple sequence alignment. Methods in Molecular Biology 1079: 59-73):
  1. Benchmarks based on simulated evolution of biological sequences, to create examples with known homology.
  2. Benchmarks based on consistency among several alignment techniques.
  3. Benchmarks based on the three-dimensional structure of the products encoded by sequence data.
  4. Benchmarks based on knowledge of, or assumption about, the phylogeny of the aligned biological sequences.
These authors list a number of pros and cons for each strategy. For our purposes here we nee to consider the cons, which I discuss here (not all of these are covered by the authors).


Simulation-based approaches adopt a probabilistic model of sequence evolution to describe nucleotide substitution, deletion, and insertion rates, while keeping track of “true” relationships of homology between individual residue positions (see Do biologists over-interpret computer simulations?).
(a) The simulation and analysis methods are not independent. All observations drawn from simulated data depend on the assumptions and simplifications of the model used to generate the data. This means that the results are biased towards those analysis methods that most closely match the assumptions of the simulation model.
(b) Simulations cannot straightforwardly, if at all, account for all evolutionary forces. This means that the simulations are not realistic, and their relevance for the behaviour of real datasets is unknown. The biggest failing in this regard is that, at some stage in the simulation, insertions and deletions are assumed to occur at random along the sequence (IID), and nothing could be further from the truth. Sequence variation occurs as a result of tandem repeats, inverted repeats, substitutions, inversions, translocations, transpositions, deletions, and insertions; and there are strong spatial constraints on variation such as codons and stem-loops. Current simulation methods fall well short of modeling these patterns of sequence variation.

The key idea behind consistency-based benchmarks is that different good aligners should tend to agree on a common alignment (namely, the correct one) whereas poor aligners might make different kinds of mistakes, thus resulting in inconsistent alignments.
(a) Two wrongs don't make a right. That is, consistent methods may be collectively biased. Moreover, consistency is not independent of the set of methods used (some may be consistent with each other and not with others).
(b) Consistency scores are a feature of several methods, which means that the benchmark is not independent.

3. Structural benchmarks most commonly employ the superposition of known protein/RNA structures as an independent means of alignment, to which alignments derived from sequence analysis can then be compared (see Edgar RC 2010. Quality measures for protein alignment benchmarks. Nucleic Acids Research 38: 2145-2153). The best known of these include: BAliBASE, OXBench, PREFAB, SABmark, IRMBase, and BRAliBase.
(a) Datasets are limited to structurally conserved regions, and may not be relevant for other alignment objectives.
(b) Deriving the structure-based alignments is problematic. For example, there is inconsistency amongst different stuctural superpositions.

4. Given a reference tree, the more accurate is the tree resulting from a given alignment, then the more accurate the underlying alignment is assumed to be (see Dessimoz C, Gil M 2010. Phylogenetic assessment of alignments reveals neglected tree signal in gaps. Genome Biology 11: R37).
(a) False inversion of a proposition: Accurate alignments yield accurate trees, therefore accurate trees must be based on accurate alignments.
(b) Alignment is often involved in constructing the reference tree. If not, the tree may be trivial in terms of taxon relationships.


This evaluation leaves us in the invidious position of not yet having any benchmarking method that is relevant to homology assessment for multiple sequence alignments. This conclusion is at variance with other previous assessments (eg. Aniba MR, Poch O, Thompson JD 2010. Issues in bioinformatics benchmarking: the case study of multiple sequence alignment. Nucleic Acids Research 38: 7353-7363).

We need to consider what such a method might look like, and how we might go about constructing it. If biologists can't give the bioinformaticians a concrete goal for homology alignment then they can expect nothing in return.

It seems clear that we need to follow the idea behind option 3, but base the alignments on homology rather than structure. I once made a start with compiling some suitable datasets (see Morrison DA 2009. A framework for phylogenetic sequence alignment. Plant Systematics and Evolution 282: 127-149); but this was a very minor effort.

As I see it, we need alignments that are explicitly annotated with the reasons for considering the columns to be homologous. One suggestion would be to have relatively short alignments with annotations for "known" features, such as tandem repeats, inverted repeats, substitutions, inversions, translocations, transpositions, deletions, insertions, or stem-loops. These all create sequence variation, and they provide evidence of the homology relations among the sequences. Presumably the alignments would vary in length and number of sequences, and in the complexity of the patterns.

Perhaps the biggest practical problem will be how to deal with alignments where the homology criteria conflict with each other. That is, there are different types of criteria used to recognize homology — ie. similarity, structure, ontogeny, congruence (see Morrison DA 2015. Is multiple sequence alignment an art or a science? Systematic Botany 40: 14-26) — and they do not necessarily agree with each other.

This would allow us to come up with a set of requirements to specify various categories of the database, based on each of the above features. We would then try to accumulate as many example datasets for each category as we can. The database will presumably have protein-coding sequences in one section and RNA-coding, introns, etc in another. This dichotomy is simplistic, but I feel that it needs to be that way in order to be of practical use. Within each of those two sections we would have subsets of varying degrees of difficulty (eg. different degrees of average sequence similarity, or distinct taxon subsets in the same alignment, or orphan sequences).

This organisational approach is similar to that originally adopted for BAliBase, but it was dropped by most of the databases developed subsequently. I believe that it is the best approach for our purposes.

There are also experimentally created datasets where the alignment is known because all of the ancestors were sequenced as well. These would be useful; but their limitation is that the sequence variation was generated more or less at random, and so it does not match normal evolutionary processes. These alignments are more likely to match the IID assumption of the current automated alignment methods.

There is one further issue with this approach. Bioinformaticians often state that a few carefully prepared datasets is of little practical use to them (as opposed to being of use to phylogeneticists). What they need is a large number of datasets, the more the better. This is because they are interested in the percent success of their algorithms, and this cannot be assessed with small sample sizes. So, each alignment probably does not need to have too many taxa or too much sequence length — it is the number of alignments that is important, not their individual sizes. This could be achieved by sub-dividing larger datasets.

March 8, 2015


In a few recent blog posts I have discussed the early history of pedigrees, noting that they were usually presented as descent trees (with an ancestor at the top and the descendants below), although some later ones reversed this arrangement. This does not match our description of them as "family trees", of course, because the root of the pedigree is at the top.

I present here another early example, if for no other reason than that I have spent the past hour trying to decipher it. It is a Genealogy of the Saxon Dynasty, particularly the Ottonians. The picture is from the Chronica Sancti Pantaleonis, produced by the Benedictine monastery of Saint Pantaleon in Cologne in 1237 CE, which was itself based on the Chronica Regia Coloniensis [Royal Chronicle of Cologne], first compiled about 1177 CE in Michaelsberg Abbey, Siegburg.

Heinricus rex and Methildis regina are the founding couple in the double circle. Henry the Fowler did not himself become Holy Roman Emperor, but he created a situation where his descendants could do so, and did. They are numbered in the next diagram in the order in which they ruled. Number 9 is missing, this being Lothair II, who was not part of the family.

There are several things to note:
  • The interesting use of illustrative medallions, which seems to have been not uncommon at the time.
  • The consequent difficulty the illustrator has had in fitting the pedigree into the page, even though most of the descendants have been left out.
  • The pedigree is explicitly designed to establish noble ancestry, but females are included even when they are not in the direct line of descent.
  • The rulers nominally change families, from the Ottonian to the Hohenstaufen to the Salian dynasties, as a result of females in the direct line of descent.
  • Number 4 is Henry II, who made an appearance in an earlier post as the husband of Cunigunde of Luxembourg (The first royal pedigree).
  • Number 11 is Frederick I Barbarossa, who also made an appearance in an earlier post (Does it matter which way up a tree is drawn?).
  • The latter two points make it clear that the earliest written pedigrees were all closely related genealogically, and involved the attempts by certain parts of the German nobility to take control of the Holy Roman Empire, consisting at that time of what is now mostly Germany and Italy. Family descent was an important part of establishing who got to rule next.

March 3, 2015


I started actively working on phylogenetic networks more than 10 years ago, when I gave a talk at the Phylogenetic Combinatorics and Applications meeting in Uppsala in July 2004.

However, before I started working on networks I had for several years been working on multiple sequence alignment methodology, and I still do. This work is also of direct relevance to network construction, of course, since faulty alignments will generate conflicting signals that can confound the biological signals that alone should appear in the network.

This year marks the 20th anniversary of my first publication in the alignment field (see the list appended below). To celebrate this I have some review / commentary articles planned. The first of these has now appeared online, and I would like to draw it to your attention:
  • Morrison DA (2015) Is multiple sequence alignment an art or a science? Systematic Botany 40: 14-26.
This paper relates current sequence alignment procedures to homology assessments as they are practiced for other data. Most algorithms can be seen as implementing only one of the several criteria that are used to identify homologies, which is inadequate. Suggestions are made for improving this situation.

There will also be a couple of upcoming blog posts canvassing a few issues that I see as important for the future development of alignment methods.

Previous Publications


Ellis J, Morrison DA (1995) Effects of sequence alignment on the phylogeny of Sarcocystis deduced from 18S rDNA sequences. Parasitology Research 81: 696-699.

Morrison DA, Ellis JT (1997) Effects of nucleotide sequence alignment on phylogeny estimation: a case study of 18S rDNAs of Apicomplexa. Molecular Biology and Evolution 14: 428-441. [This has been the most cited of these publications, surprising me by still getting cited about once per month]

Morrison DA (2006) Multiple sequence alignment for phylogenetic purposes. Australian Systematic Botany 19: 479-539.

Morrison DA (2009) A framework for phylogenetic sequence alignment. Plant Systematics and Evolution 282: 127-149. [This was actually accepted for publication in 2007]

Morrison DA (2009) Why would phylogeneticists ignore computerized sequence alignment? Systematic Biology 58: 150-158.

Morrison DA (2010) [Book review of] ‘Sequence Alignment: Methods, Models, Concepts, and Strategies’. Systematic Biology 59: 363-365.

Empirical examples

Mugridge NB, Morrison DA, Johnson AM, Luton K, Dubey JP, Votypka J, Tenter AM (1999) Phylogenetic relationships of the genus Frenkelia: a review of its history and new knowledge gained from comparison of large subunit ribosomal RNA gene sequences. International Journal for Parasitology 29: 957-972.

Mugridge NB, Morrison DA, Heckeroth AR, Johnson AM, Tenter AM (1999) Phylogenetic analysis based on full-length large subunit ribosomal RNA gene sequence comparison reveals that Neospora caninum is more closely related to Hammondia heydorni than to Toxoplasma gondii. International Journal for Parasitology 29: 1545-1556.

Mugridge NB, Morrison DA, Jäkel T, Heckeroth AR, Tenter AM, Johnson AM (2000) Effects of sequence alignment and structural domains of ribosomal DNA on phylogeny reconstruction for the protozoan family Sarcocystidae. Molecular Biology and Evolution 17: 1842-1853.

Beebe NW, Cooper RD, Morrison DA, Ellis JT (2000) Subset partitioning of the ribosomal DNA small subunit and its effects on the phylogeny of the Anopheles punctulatus group. Insect Molecular Biology 9: 515-520.

Beebe NW, Cooper RD, Morrison DA, Ellis JT (2000) A phylogenetic study of the Anopheles punctulatus group of malaria vectors comparing rDNA sequence alignments derived from the mitochondrial and nuclear small ribosomal subunits. Molecular Phylogenetics and Evolution 17: 430-436.

March 1, 2015


I have occasionally mentioned in this blog the fact that phylogenetic trees have made it into the world of art. However, until now I have not really been able to say the same for phylogenetic networks. I am happy to report that I can now do so.

These three watercolours are from the collection of Sandra Black Culliton, a microbial geneticist.

 At the time of writing the originals are still for sale at Etsy.

Alternatively, you can apparently ask her to produce one to order.

February 24, 2015


Today is the third anniversary of starting this blog, and this is post number 325. Thanks to all of our visitors over the past three years — we hope that the next year will be as productive as this past one has been.

I have summarized here some of the accumulated data, in order to document at least some of the productivity.

As of this morning, there have been 238,613 pageviews, with a median of 192 per day. The blog has continued to grow in popularity, with a median of 70 pageviews per day in the first year, 189 per day in the second year, and 353 per day in the third year. The range of pageviews was 172-1148 per day during this past year. The daily pattern for the three years is shown in the first graph.

Line graph of the number of pageviews through time, up to today.
The largest values are off the graph. The green line is the half-way mark.
The inset shows the mean (blue) and standard deviation of the daily number of pageviews.
There are a few general patterns in the data, the most obvious one being the day of the week, as shown in the inset of the above graph. The posts have usually been on Mondays and Wednesdays, and these two days have had the greatest mean number of pageviews.

Some of the more obvious dips include times such as Christmas - New Year; and the biggest peaks are associated with mentions of particular blog posts on popular sites.

Unfortunately, the data are also seriously skewed by visits from troll sites. These have been particularly from the Ukraine, which is solely responsible for the peak between days 900 and 1000. The smaller following peak represents visits from Taiwan.

The posts themselves have varied greatly in popularity, as shown in the next graph. It is actually a bit tricky to assign pageviews to particular posts, because visits to the blog's homepage are not attributed by the counter to any specific post. Since the current two posts are the ones that appear on the homepage, these posts are under-counted until they move off the homepage, (after which they can be accessed only by a direct visit to their own pages, and thus always get counted). On average, 30% of the blog's pageviews are to the homepage, rather than to a specific post page, and so there is considerable under-counting.

Scatterplot of post pageviews through time, up to last week; the line is the median.
Note the log scale, and that the values are under-counted (see the text).
It is good to note that the most popular posts were scattered throughout the years. Keeping in mind the initial under-counting, the top collection of posts (with counted pageviews) have been:
8 The Music Genome Project is no such thing
Charles Darwin's unpublished tree sketches
The acoustics of the Sydney Opera House
Why do we still use trees for the dog genealogy?
How do we interpret a rooted haplotype network?
Carnival of Evolution, Number 52
Who published the first phylogenetic tree?
Phylogenetics with SpongeBob
Charles Darwin's family pedigree network
Faux phylogenies
Evolutionary trees: old wine in new bottles?
Network analysis of scotch whiskies
Tattoo Monday 8,347
1,747This list is not very different to the same time last year. Posts 129 (which is linked in Wikipedia) and 172 continue to receive visitors almost every day.

The audience for the blog continues to be firmly in the USA. Based on the number of pageviews, the visitor data are:
United States
Ukraine [spurious]
United Kingdom
Finally, if anyone wants to contribute, then we welcome guest bloggers. This is a good forum to try out all of your half-baked ideas, in order to get some feedback, as well as to raise issues that have not yet received any discussion in the literature. If nothing else, it is a good place to be dogmatic without interference from a referee!

February 22, 2015


As a means of motivating his interest in speciation, in The Origin of Species Charles Darwin highlighted the diversity of morphological forms among the finches of the Galápagos Islands, in the south-eastern Pacific Ocean, which he visited while circumnavigating the world in The Beagle. He considered this to be a prime example of biodiversity related to adaptation and natural selection, what we would now call an adaptive radiation.

Recently, the following paper, which provides a genomic-scale study of these birds, has attracted considerable attention:
Lamichhaney S, Berglund J, Almén MS, Maqbool K, Grabherr M, Martinez-Barrio A, Promerová M, Rubin CJ, Wang C, Zamani N, Grant BR, Grant PR, Webster MT, Andersson L (205) Evolution of Darwin's finches and their beaks revealed by genome sequencing. Nature 58: 371-375.The authors note:
Darwin's finches are a classic example of a young adaptive radiation. They have diversified in beak sizes and shapes, feeding habits and diets in adapting to different food resources. The radiation is entirely intact, unlike most other radiations, none of the species having become extinct as a result of human activities.Here we report results from whole genome re-sequencing of 120 individuals representing all Darwin's finch species and two closely related tanagers. For some species we collected samples from multiple islands. We comprehensively analyse patterns of intra- and inter-specific genome diversity and phylogenetic relationships among species. We find widespread evidence of inter-specific gene flow that may have enhanced evolutionary diversification throughout phylogeny, and report the discovery of a locus with a major effect on beak shape.Sadly, the authors try to study the intra- and inter-specific variation principally using phylogenetic trees. They do this in spite of noting that:
Extensive sharing of genetic variation among populations was evident, particularly among ground and tree finches, with almost no fixed differences between species in each group.Clearly, this situation requires a phylogenetic network for adequate study, as a network can always display at least as much phylogenetic information as a tree, and usually considerably more. The authors do recognize this:
A network constructed from autosomal genome sequences indicates conflicting signals in the internal branches of ground and tree finches that may reflect incomplete lineage sorting and/or gene flow ... We used PLINK to calculate genetic distance (on the basis of proportion of alleles identical by state) for all pairs of individuals separately for autosomes and the Z chromosome. We used the neighbour-net method of SplitsTree4 to compute the phylogenetic network from genetic distances.However, this network is tucked away as Fig. 3 in the appendices. It is shown here in the first figure. The authors attribute the gene flow to introgression, but occasionally refer to hybridization and convergent evolution. Indeed, they suggest both relatively recent hybridization as well as the possibility of more ancient hybridization between warbler finches and other finches.

Clearly, this network is not particularly tree-like in places, especially with respect to the delimitation of species based on their morphology, as reflected in their current taxonomy. Nevertheless, the authors prefer to present as their main result as a:
maximum-likelihood phylogenetic tree based on autosomal genome sequences ... We used FastTree to infer approximately maximum-likelihood phylogenies with standard parameters for nucleotide alignments of variable positions in the data set. FastTree computes local support values with the Shimodaira–Hasegawa test.This tree is shown in the second figure.

This apparently well-supported tree is not a particularly accurate representation of the pattern shown by the network. Indeed, it makes clear just why it is inadequate to use a tree to study the interplay of intra- and inter-specific variation. Gene flow requires a network for accurate representation, not a tree.

The authors do acknowledge this situation. While they try to date the nodes on their tree, they do note that:
Although these estimates are based on whole-genome data, they should be considered minimum times, as they do not take into account gene flow.Actually, in the face of gene flow the concept that a node has a specific date is illogical, because the nodes do not represent discrete events (see Representing macro- and micro-evolution in a network). Given the authors' final conclusion, it seems quite inappropriate to rely on trees rather than networks:
Evidence of introgressive hybridization, which has been documented as a contemporary process, is found throughout the radiation. Hybridization has given rise to species of mixed ancestry, in the past and the present. It has influenced the evolution of a key phenotypic trait: beak shape ... The degree of continuity between historical and contemporary evolution is unexpected because introgressive hybridization plays no part in traditional accounts of adaptive radiations of animals.

February 17, 2015


In biology we often distinguish microevolutionary events, which occur at the population level, from macroevolutionary events, which involve species. We have traditionally treated phylogenetics as a study of macroevolution. However, more recently there has been a trend to include population-level events, such as incomplete lineage sorting and introgression.

This is of particular importance for the resulting display diagrams. A phylogenetic tree was originally conceived to represent macroevolution. For example, speciation and extinction occur as single events at particular times, and these events apply to discrete groups of organisms. The taxa can be represented as distinct lineages in a tree graph, and the events by having these lineages stop or branch in the graph.

This idea is easily extended to phylogenetic networks, where the gene-flow events are also treated as singular, so that hybridization or horizontal gene transfer can be represented as single reticulations among the lineages.

These are sometimes called "pulse" events. However, there are also "press" events that are ongoing. That is, a lot of genetic variation is generated where populations repeatedly mix, so that every gene-flow instance is part of a continuous process of mixing. This often occurs, for example, in the context of isolation by distance, such as ring species or clinal variation. Under these circumstances, processes like introgression and HGT can involve ongoing events.

For instance, in an earlier life I once studied three species of plant in the Sydney region (Morrison DA, McDonald M, Bankoff P, Quirico P, Mackay D. 1994. Reproductive isolation mechanisms among four closely-related species of Conospermum (Proteaceae). Botanical Journal of the Linnean Society 116: 13-31). One of the species was ecologically isolated from the other two (it occurred in dry rather than damp habitats), and the other two were geographically isolated from each other (they occurred on separate sandstone uplands with a large valley in between). These species look very different from each other, as shown in the picture above, but looks are deceiving. Where the ecological isolation was incomplete, introgression occurred and admixed populations could be found.

These dynamics are more difficult to represent in a phylogenetic tree or network. We do not have discrete groups that can be represented by lines on a graph, but instead have fuzzy groups with indistinct boundaries. Furthermore, we do not have discrete events, but instead have ongoing (repeated) processes.

Nevertheless, it seems clear that there is a desire in modern biology to integrate macroevolutionary and microevolutionary dynamics in a single network diagram. That is, some parts of the diagram will represent pulse events involving discrete groups and other parts will represent press events among fuzzy groups. This situation seems to be currently addressed by practitioners by first creating a tree to represent the pulse events (and possibly their times), and then adding imprecisely located dashed lines as a representation of ongoing gene flow — see the example in Producing trees from datasets with gene flow. This particular mixture of precision and imprecision seems rather unsatisfactory.

Perhaps someone might like to have a think about this aspect of phylogenetic networks, to see if there is some way we can do better.

February 15, 2015


As usual at the beginning of the week, this blog presents something in a lighter vein.

Homologies lie at the heart of phylogenetic analysis. They express the historical relationships among the characters, rather than the historical relationships of the taxa. As such, homology assessment is the first step of a phylogenetic analysis, while building a tree or network is the second step.

With a colleague (Mike Crisp, now retired), I once wrote a tongue-in-cheek article about how to mis-interpret homologies, and the consequences of this for any subsequent tree-building analysis. This article appeared in 1989 in the Australian Systematic Botany Society Newsletter 60: 24–26. Since this issue of the Newsletter is not online, presumably no-one has read this article since then. However, you should read it, and so I have linked to a PDF copy [1.2 MB] of the paper:
An Hennigian analysis of the Eukaryotae

February 10, 2015


Recently, a number of computer programs have been released that are intended to produce phylogenetic networks representing introgression (or admixture) (see Admixture graphs – evolutionary networks for population biology).

A recent example of the use of these programs is presented by:
Jónsson H, Schubert M, Seguin-Orlando A, Ginolhac A, Petersen L, Fumagalli M, Albrechtsen A, Petersen B, Korneliussen TS, Vilstrup JT, Lear T, Myka JL, Lundquist J, Miller DC, Alfarhan AH, Alquraishi SA, Al-Rasheid KA, Stagegaard J, Strauss G, Bertelsen MF, Sicheritz-Ponten T, Antczak DF, Bailey E, Nielsen R, Willerslev E, Orlando L (2014) Speciation with gene flow in equids despite extensive chromosomal plasticity. Proceedings of the National Academy of Sciences of the USA 111: 18655-18660.This study presents a phylogenetic analysis of the extant genomes of the genus Equus, the horses, asses and zebras. This analysis leads the authors to the conclusion that there is "evidence for gene flow involving three contemporary equine species despite chromosomal numbers varying from 16 pairs to 31 pairs." The gene flow is indicated by the light-blue reticulations in the first diagram.

One important issue with these types of analyses is the logic on which the procedure is based. Programs like TreeMIx (used in this analysis) were developed to allow modelling of gene flow across the branches of trees at a microevolutionary (population) scale. Specifically, the graph generated by TreeMix models singular (pulse) introgression events in phylogenetic history.

The issue is that a tree is produced first, and then reticulations are added to it. The tree represents descent and the reticulations represent gene flow. But how do we produce a tree from a dataset that contains evidence of both descent and gene flow? The authors' initial tree is shown below.

The procedural logic works as follows:
(i) we assume that the traditionally recognized species exist
(ii) we assume that we have a representative sample of them, with one genome each
(iii) we construct a tree based on the assumption that there is no gene flow among the species
(iv) we then assess the species for gene flow, and discover it.

Isn't this rather circular? Surely (iv) invalidates the assumptions inherent in (i)-(iii)? How can we then assess the reliability of the sampling in (ii) and the analyses in (iii)? Why have we made assumption (i)? At best the species are fuzzy groups to one extent or another, and we do not know where we have sampled within the probabilistic space assigned to the groups.

This seems like a very poor way to go about studying the interaction between descent and gene flow. First we assume descent only, and then we assess gene flow. When we find gene flow we continue to accept the results of the initial analyses based on descent alone.

I would hate to have to justify this philosophy to someone outside phylogenetics, because I have a horrible feeling that they would either smile tolerantly or laugh outright.

This between-species situation is even more extreme for those within-species patterns where groups are recognized. Human races and domesticated breeds are two concepts that have received constant criticism. Neither races nor breeds form clear-cut groups, as there are no sharp boundaries between them, due to gene flow. Their "central locations" in genotype space are usually very different, however. Therefore it is quite possible to perform a tree-based analysis of samples from the central locations, and this would tell us a lot about descent. But it would tell us almost nothing about gene flow; and we would have a very distorted view of the phylogenetic history.

February 8, 2015


Over the past century a number of food styles have become internationalized, including hamburgers and fried chicken. Not all of these foodstuffs are nutritious, and some people have noted that not all of them are even particularly edible. However, perhaps the most interesting of these foods is the venerable pizza, not least because the customer has considerable say in what it looks and tastes like, but also because it is made and cooked fresh, right in front of us.

Pizza originated in Italy, Greece, or Persia, depending on how we define pizza. After all, covering flat bread with a topping is an idea that goes back a very long way. In the ancient world, the Egyptians made flat bread; the Indians baked bread in an oven, but without a topping; and the Persians cooked their bread without an oven, but they did put melted cheese on it. The Passion 4 Pizza site notes this more recent history: "The ancient Greeks had a flat bread called plakountos, on which they placed various toppings [eg. herbs, onion and garlic], and we know also that Naples was founded (as Neopolis) by the Greeks; and Naples is the home of the modern pizza."

In 16th century Naples, a yeast-based flat bread was referred to as a pizza, eaten by poor people as a street food; but the idea that led to modern pizza was the use of tomato as a topping. Tomatoes were introduced to Europe from South America in the 16th century, and by the 18th century it was common for the poor of the area around Naples to add tomato to their bread. Pizza was brought to the United States by the Italian immigrants in the late 19th century, and became popular in places like New York and Chicago.

Kenji López-Alt publishes The Pizza Lab, which is part of the Serious Eats blog, and he has taken a serious interest in pizza styles, at least in New York. He recognizes three main styles of pizza, based on their dough, the way it is treated, and the temperature at which it is cooked (see the picture above, left to right):
  • New York
  • Sicilian
  • Neapolitan
He also has several variants on these styles.

As a basis for discussion, I have analyzed the dough ingredients of these three styles, using a phylogenetic network as a tool for exploratory data analysis. To create the network, I first calculated the similarity of the pizzas using the Manhattan distance, and a Neighbor-net analysis was then used to display the between-dough similarities as a phylogenetic network. So, pizza-dough styles that are closely connected in the network are similar to each other based on their ingredients, and those that are further apart are progressively more different from each other.

The Neapolitan-style dough is the simplest in terms of ingredients. The dough is not kneaded, but instead is allowed to rise for 3-5 days in the refrigerator, although it remains a thin-crust pizza. It is cooked quickly at a high temperature. The New York-style dough is an offshoot of this that is slightly thicker, and is cooked cooler and slower. The unkneaded dough stands in the fridge for only 1 day. Like all of the styles except the Neapolitan, olive oil is used in the dough, but unlike any of the others it also contains sugar (to help the crust brown more evenly). The Sicilian-style dough is intended for a thick-crust pizza. It requires only a little kneading, after which it is allowed to rise for 2 hours at room temperature. It is essentially fried in olive oil while baking.

The Sfincione is the original Sicilian pizza style, thinner and chewier than the New York Sicilian. It is also cooked at a lower temperature. The Deep Pan pizza is, of course, another thick-crust style. It is allowed to rise for longer than the Sicilian, and is cooked at a higher temperature. The network shows that these all have closely related doughs.

The Greek-style pizza is allegedly a style "found mostly in the 'Pizza Houses' and 'Houses of Pizza' in New England". As shown by the reticulation in the network, it has characteristics of the Neapolitan pizza dough (relatively low water content) and the Sicilian (relatively high oil content). It is left to rise at room temperature overnight, and is cooked like the New York and Deep Pan pizzas.

There are many other pizza styles, of course, but I do not have recipes for them. For example, there is another Deep Dish style found in Chicago.