The Genealogical World of Phylogenetic Networks

Biology, computational science, and networks in phylogenetic analysis


XML feed

Last update

1 hour 33 min ago

March 31, 2015


It is tolerably well known that Alfred Russel Wallace developed the idea of evolution via natural selection quite independently of Charles Darwin, and that, indeed, it was Wallace's revelation of this fact that prompted Darwin to finally publish his ideas (Bannister et al. 2014).

Some people are even aware that Wallace developed the Tree of Life metaphor independently, as well (Wallace 1855), a fact of which Darwin himself was perfectly well aware (eg. Bradman and Bartlett 1998):
"the analogy of a branching tree [is] the best mode of representing the natural arrangement of species ... a complicated branching of the lines of affinity, as intricate as the twigs of a gnarled oak ... we have only fragments of this vast system, the stem and main branches being represented by extinct species of which we have no knowledge, while a vast mass of limbs and boughs and minute twigs and scattered leaves is what we have to place in order, and determine the true position each originally occupied with regard to the others."What is less well known is Wallace's contribution to phylogenetic imagery.

The Darwinian version of a phylogenetic tree is, of course, something usually considered to post-date 1859, when Darwin published his best-known book. However, producing such a tree was apparently a rather slow process. For example, in 1863, Franz Hilgendorf wrote a PhD thesis for which he produced a hand-drawn phylogeny, but he did not actually include this in the thesis; and he significantly modified it for its publication in 1866. In 1864 Fritz Müller published a couple of three-taxon trees. Also in 1864, Ernst Haeckel claimed to have started work on his series of phylogenetic trees, but the resulting book was not published until 1866. This means that the first substantial tree to appear in print was that of Mivart (1865).

However, long before this, Wallace was already moving ahead. In 1856 Wallace took the tree imagery from his 1855 publication and applied it to the relationships among bird groups. This publication was his first clearly evolutionary empirical contribution. He adapted the unrooted diagram of Strickland (1841), which represented "the natural system" of bird relationships, and gave it a clearly evolutionary interpretation. So, while Strickland's work was strictly atemporal and non-evolutionary, Wallace produced an evolutionary view of the world, with his two trees representing the end-product of change through time.

Wallace was in South-East Asia at the time of this work, collecting specimens among the islands of what is now Indonesia. He returned to England in 1862, thus having been absent during Darwin's rise to fame. However, he did return before anyone else had tackled Darwin's ideas empirically, and he was in an ideal position to do so himself (Beckenbauer et al. 2010). It would therefore be surprising if he had not done so.

Recently, it has become clear, as a result of the work done for the Wallace Correspondence Project, that Wallace did, indeed, produce a post-Darwinian phylogenetic diagram before any of his contemporaries, although it remained unpublished (Becker and Borg 2014). Not unexpectedly, it also refers to the relationships among birds. What is most interesting for us, however, is that it was a phylogenetic network, not a tree.

You will note that it is an unrooted network, in the same manner as his unrooted bird trees from 1856. In this, his presentation differed from that of Müller, Hilgendorf, Mivart and Haeckel, who all indicated a common ancestor. On the other hand, the branch lengths represent the "relative amount of affinity" between the named taxa, unlike the diagrams of his contemporaries. This means that the diagram can, indeed, be interpreted (in modern terms) as an unrooted phylogenetic network.

In his bird paper, Wallace (1856) had noted that producing the tree diagrams is not easy, as "you will most likely find that you have set down some conflicting affinities, or that you have mistaken some mere analogies for affinities". This seems to be the origin of his interest in the alternative model of a network, rather than a tree (Brabham and Berger 2014), thus making him the first person the use a data-display network to represent conflicting character data.

This post was inspired by the work of Torvill and Dean (1996). Happy April 1.


Bannister RG, Ballesteros-Sota S, Bjørndalen OE (2014) Running, swinging and skiing — the private life of Alfred Russel Wallace. Studia Wallaceana 6: 82-96.

Becker BF, Borg BR (2014) The phylogenetics of A.R. Wallace, and its relation to the science of tennis. Journal of Phylogenetic Inference 13: 101-110.

Beckenbauer FA, Best G, Bruyneel J (2010) Association football as a metaphor for phylogenetics. Is it a sport or a science? Phyloinformatics 7:1.

Brabham JA, Berger G (2014) The speed required to achieve the publication rate of A.R. Wallace. Philosophy and History of Biology 102: 89-92.

Bradman DG, Bartlett KC (1998) Wallace Down Under: the work of Alfred Russel Wallace in the southern hemisphere. Systematic Zoology 47: 767-780.

Haeckel E (1866) Generelle Morphologie der Organismen. Verlag von Georg Reimer, Berlin.

Hilgendorf F (1866) Planorbis multiformis im Steinheimer Süßwasserkalk: ein beispiel von gestaltveränderung im laufe der zeit. Buchhandlung von W. Weber, Berlin.

Mivart, StG (1865) Contributions towards a more complete knowledge of the axial skeleton in the primates. Proceedings of the Zoological Society of London 33: 545-592.

Müller F (1864) Für Darwin. Verlag von Wilhelm Engelman, Leipzig.

Strickland HE (1841) On the true method of discovering the natural system in zoology and botany. Annals and Magazine of Natural History 6: 184-194.

Torvill J, Dean CC (1996) Skating on thin ice. Systematic Biology 45: 641-650.

Wallace AR (1855) On the law which has regulated the introduction of new species. Annals and Magazine of Natural History 16 (2nd series): 184-196.

Wallace AR (1856) Attempts at a natural arrangement of birds. Annals and Magazine of Natural History 18 (2nd series): 193-216.

March 29, 2015


NeighborNet produces splits graphs based on distances between the taxa, rather than using the original character data. This approach can produce what we might call inconsequential splits in the graph — that is, splits that are not explicitly supported by the character data. Here, I present a simple example to illustrate the extent to which this can occur.

The data are taken from: Nanette Thomas, Jeremy J. Bruhl, Andrew Ford, Peter H. Weston (2014) Molecular dating of Winteraceae reveals a complex biogeographical history involving both ancient Gondwanan vicariance and long-distance dispersal. Journal of Biogeography 41: 894-904.

This dataset consists of a set of eight morphological features of the pollen from 31 extant plant taxa plus two fossil samples, as shown in this data matrix:

T_lanceolata        00111011
T_stipitata         00111011
T_purpurescens      00111011
T_xerophila_x       00111011
T_xerophila_r       00111011
T_vickeriana        00111011
T_glaucifolia       00111011
T_membranea         00111011
T_insipida          00111011
T_perrieri          00111010
D_winteri           00111010
D_grenadensis       00111010
B_comptonii         00011010
B_howeana           00011010
B_semicarpoides     00011010
B_whiteana          00011010
B_queenslandiana_q  00011010
B_queenslandiana_1  00011010
P_axillaris         00011011
P_colorata          00011011
Pseudowinterapollis 00011011
B_pancheri          01001011
Harrisipollenites   01001100
Z_acsmithii         01001101
E_stipitatum        01001101
Z_bicolor           01001101
Z_balansae          11001101
C_dinisii           1-111101
C_madagascariensis  1-111101
W_salutaris         1-111101
P_macranthum        1-111101
C_ekmanii           1-111101
C_winterana         1-111101

Note that there are only nine groups of taxa (separated by the dashed lines) — within each group the data are identical. Each character has two states: present / absent.

The resulting NeighborNet, as produced by default using the SplitsTree4 program, is shown in the first graph.

As expected, the taxa form nine groups. There are a number of apparently well-supported splits (ie. with long edges) separating these groups. There are also a number of smaller splits, and a whole series of very tiny splits. None of these latter two groupings are explicitly present in the dataset — the only splits supported by the characters are plotted onto the graph using the character numbers. (Note that character 5 is uninformative.)

The series of very tiny splits are present throughout the graph as extremely short edges. For example, a detailed view of the bottom left-hand corner of the graph is shown in the next figure.

Note that these six taxa have identical character data, and therefore their separation into four groups is entirely an artifact of the NeighborNet algorithm.

So, one needs to be careful when interpreting small splits in such a graph — they may have biologiocal support and they may not.

March 24, 2015


In the literature, phylogenetic trees often appear even when the paper is discussing non-tree evolutionary histories.

A case in point is the paper by: Susanne Gallus, Axel Janke, Vikas Kumar, Maria A. Nilsson (2015) Disentangling the relationship of the Australian marsupial orders using retrotransposon and evolutionary network analyses. Genome Biology and Evolution, in press.

The authors discuss the relationship between the four Australian marsupial orders, and use data from transposable element (retrotransposon) insertions for resolving the inter- and intra-ordinal relationships of the Australian and South American orders. They plot the retrotransposon presence/absence onto a tree derived from alignments of 28 nuclear gene fragments. This is shown in the first figure, with the retrotransposons indicated as dots on the internal branches.

For comparison, the next figure is a Median-Joining network based on the presence/absence of the retrotransposons.

With the exception of the Monito del monte, Shrew opossum and Western quoll, the network matches the basic tree structure. However, it emphasizes more strongly the fact that the retrotransposons do not resolve the relationships among the Marsupial orders. As the authors note:
The retrotransposon insertions support three conflicting topologies regarding Peramelemorphia, Dasyuromorphia and Notoryctemorphia, indicating that the split between the three orders may be best understood as a network ...The rapid divergences left conflicting phylogenetic information in the genome possibly generated by incomplete lineage sorting or introgressive hybridisation, leaving the relationship among Australian marsupial orders unresolvable as a bifurcating process million years later.

March 22, 2015


Phylogenetic networks can be used to illustrate the history of any set of objects or concepts, provided that this history is a divergent one (ie. the history is not simply the transformation of objects through time).

Since I have recently been writing about sequence alignments, it is worthwhile to show an example of applying a network to sequence alignment programs. This comes from the paper by Chaisson MJ, Tesler G (2012) Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory. BMC Bioinformatics 13: 238.

The authors discuss programs that map reads from a sample genome onto a reference sequence. They note: "the relationship between many existing alignment methods is qualitatively illustrated in the figure."

Their legend reads:
The applications / corresponding computational restrictions shown are: (green) short pairwise alignment / detailed edit model; (yellow) database search / divergent homology detection; (red) whole genome alignment / alignment of long sequences with structural rearrangements; and (blue) short read mapping / rapid alignment of massive numbers of short sequences. Although solely illustrative, methods with more similar data structures or algorithmic approaches are on closer branches. The BLASR method combines data structures from short read alignment with optimization methods from whole genome alignment.The reticulation refers to their new program, which "maps reads using coarse alignment methods developed during WGA [whole genome alignment] studies, while speeding up these methods by using the advanced data structures employed in many NGS [next generation sequencing] mapping studies."

March 17, 2015


Multiple sequence alignment software have not yet met their primary aim for evolutionary biologists: maximizing homology of characters. If our goal is to develop an automated procedure for homology assessment, then we need someone to produce a program that explicitly implements this aim.

Alignment is just as much a part of phylogenetics as is tree or network building. It is the procedure that expresses the homology relationships among the characters, rather than the historical relationships among the taxa. Therefore, we need a computer program that accurately expresses homology relationships, as well as one that accurately expresses the historical relationships. We have some programs for the latter but currently nothing for the former.

Unfortunately, homology is a rather nebulous concept. It has to do with inheriting characters from a shared ancestor, which is not something that we can directly observe. Therefore we have to infer it. Somehow.

Homology criteria

Systematists have developed criteria for making decisions about potential homologies in an objective and (hopefully) repeatable manner, and these are directly applicable to nucleotide sequences, which these days are the most common form of data used in phylogenetics. These criteria are:

• Similarity
  1. Compositional = apparent likeness or resemblance between sequences (% similarity)
  2. Topographical = apparent likeness or resemblance between sequences (second- and third-order structure of proteins or RNA)
  3. Functional = functional relationship to other characters in the same sequence (annotated function of the sequence in protein or RNA)
• Conjunction = possible within-genome copies of the same sequence (i.e. paralogy)

• Congruence = agreement with other postulated homologies elsewhere in the same sequences (synapomorphy).

Traditionally, characters have been first proposed as homologous using the criteria of similarity and conjunction (together called primary homology), and then tested with the criterion of congruence (secondary homology).

It is important to note that these criteria do not necessarily always agree with each other in their inferences of homology. Changes that occur during evolutionary history can weaken the connection between these criteria so that, for example, nucleotide homology inferred from structural similarity is no longer the same as nucleotide similarity inferred from compositional similarity. It is for this reason that compositional similarity of the sequences is insufficient to establish gene orthology, for example. The same limitation applies to nucleotides.

Current computer programs

It is clear that these criteria have been incorporated singly into current computerized procedures for producing multiple sequence alignments, but rarely in combination. For example, compositional similarity is the criterion used by the most popular computer programs, such as CLUSTAL, MAFFT and Muscle. Topographical similarity is being invoked whenever structure-based alignments are produced. such as for RNA-coding sequences (eg. PicXAA-R; PMFastR), or when nucleotide sequences are translated to amino acids before alignment (eg. PROMALS). Functional similarity is used for specialist studies of conserved motifs and binding sites, for instance. Ontogenetic similarity of nucleotide sequences is based on inferring the possible molecular processes that cause the observed sequence variation — the program Prank uses this criterion by distinguishing between insertions and deletions.

Congruence as a criterion involves the observation of repeated patterns of synapomorphy in a phylogeny. Among alignment algorithms, both Direct Optimization (e.g. POY; MSAM; BeeTLe) and Statistical Alignment (e.g. BAli-Phy; StatAlign) try to simultaneously produce a multiple alignment and a phylogenetic tree, thus optimizing the criterion of congruence.

The fact that none of the current crop of programs basically apply more than one criterion is, I contend, the principal reason why so many phylogeneticists adjust their alignments manually. Personal judgment may not be perfect, but at least it can be consciously based on homology as a general character concept. Since the different criteria may conflict with each other, at the moment only human judgment is available to compare them and thus make a final decision.

Required program

To make the homology criteria fully operational, we need to compare their inferences by evaluating the comparative evidence. That is, since the different criteria may conflict with each other, we need an automated way to compare them and evaluate their relative probabilities for any alignment column. What we need is a computerized procedure that will includes all of the known criteria for homology assessment. Sadly, there are currently no mathematical models for doing this.

I suspect that there are two reasons for the failure of such a program to appear by now. First, biologists have not been clear about homology as a concept, and have not been able to express it in a form that computationalists could use to develop an algorithm. That is, we have criteria but they are not really operational criteria in a computational sense. Second, it will not be easy, because there is no obvious algorithm for inferring inheritance of characters. That is, we cannot easily separate homology from analogy.

Interactive editor

Another proposal is to have an interactive alignment editor. This editor would have the ability to show the conflicting hypotheses of homology (eg. where the homology suggested by structural pairing in a stem conflicts with homology suggested by tandem repeats), and then to annotate each column in the final alignment with the reason for the researcher having chosen to align those particular nucleotides. For example, one could press a button and see the RNA stem pairs in different colors (irrespective of whether the stem nucleotides are aligned), or press again and see the tandem repeats and inversions in different colours (once again, irrespective of how the nucleotides are aligned). One could also choose to see the annotations for the columns (summarized, using some coded schema), or simply look at the unadorned alignment itself.

This seems to me to be an achievable goal in the short-term; and the PhyDE editor already does some of it. Such an editor would also serve as a necessary step on the way to working out how to automate as much of the process as possible. The ultimate goal for some people may be total automation (ie. a black box), but I see no way to achieve that in the immediate term. Besides, I suspect that phylogeneticists will always want some judgemental control over the process, which would be best achieved with a semi-automated interactive editor. That is, we might ask the program to work out what the alternative alignments are for any specified subsequence (in an automated manner), and then we evaluate their relative merits for ourselves.

Note that I am treating the alignment as a set of hypotheses independent of their phylogenetic analysis. Subsequences can still be tentatively aligned even if the researcher intends masking those subsequences out of any subsequent tree-building analysis. Also, subsets of the taxa might be aligned confidently while other subsets are left unaligned. With current editors, this involves having a separate alignment file for each subset, which is very cumbersome, as well as error-prone.

March 15, 2015


Here is a new collection of tattoos based on Charles Darwin's best-known sketch from his Notebooks (the "I think" tree). For other examples, see Tattoo Monday III, Tattoo Monday VI, and Tattoo Monday IX.

March 11, 2015


Multiple sequence alignment software have not yet met their primary aim for evolutionary biologists: maximizing homology of characters. The proliferation of alignment methods have diverse optimization functions, along with assorted heuristics to search for the optimum alignment; and these methods produce detectably different multiple sequence alignments in almost all realistic cases (see The need for a new sequence alignment program). This leaves the phylogeneticists wondering what to do. In response, the majority of phylogeneticists use manual alignment or re-alignment at some stage in their procedures.

If our goal is to develop an automated procedure for homology assessment (see Multiple sequence alignment), then we need some means of evaluating the relative success of different alignment methods.

There are four suggestions for benchmarking strategies for sequence alignment (Iantorno S, Gori K, Goldman N, Gil M, Dessimoz C 2014. Who watches the watchmen? An appraisal of benchmarks for multiple sequence alignment. Methods in Molecular Biology 1079: 59-73):
  1. Benchmarks based on simulated evolution of biological sequences, to create examples with known homology.
  2. Benchmarks based on consistency among several alignment techniques.
  3. Benchmarks based on the three-dimensional structure of the products encoded by sequence data.
  4. Benchmarks based on knowledge of, or assumption about, the phylogeny of the aligned biological sequences.
These authors list a number of pros and cons for each strategy. For our purposes here we nee to consider the cons, which I discuss here (not all of these are covered by the authors).


Simulation-based approaches adopt a probabilistic model of sequence evolution to describe nucleotide substitution, deletion, and insertion rates, while keeping track of “true” relationships of homology between individual residue positions (see Do biologists over-interpret computer simulations?).
(a) The simulation and analysis methods are not independent. All observations drawn from simulated data depend on the assumptions and simplifications of the model used to generate the data. This means that the results are biased towards those analysis methods that most closely match the assumptions of the simulation model.
(b) Simulations cannot straightforwardly, if at all, account for all evolutionary forces. This means that the simulations are not realistic, and their relevance for the behaviour of real datasets is unknown. The biggest failing in this regard is that, at some stage in the simulation, insertions and deletions are assumed to occur at random along the sequence (IID), and nothing could be further from the truth. Sequence variation occurs as a result of tandem repeats, inverted repeats, substitutions, inversions, translocations, transpositions, deletions, and insertions; and there are strong spatial constraints on variation such as codons and stem-loops. Current simulation methods fall well short of modeling these patterns of sequence variation.

The key idea behind consistency-based benchmarks is that different good aligners should tend to agree on a common alignment (namely, the correct one) whereas poor aligners might make different kinds of mistakes, thus resulting in inconsistent alignments.
(a) Two wrongs don't make a right. That is, consistent methods may be collectively biased. Moreover, consistency is not independent of the set of methods used (some may be consistent with each other and not with others).
(b) Consistency scores are a feature of several methods, which means that the benchmark is not independent.

3. Structural benchmarks most commonly employ the superposition of known protein/RNA structures as an independent means of alignment, to which alignments derived from sequence analysis can then be compared (see Edgar RC 2010. Quality measures for protein alignment benchmarks. Nucleic Acids Research 38: 2145-2153). The best known of these include: BAliBASE, OXBench, PREFAB, SABmark, IRMBase, and BRAliBase.
(a) Datasets are limited to structurally conserved regions, and may not be relevant for other alignment objectives.
(b) Deriving the structure-based alignments is problematic. For example, there is inconsistency amongst different stuctural superpositions.

4. Given a reference tree, the more accurate is the tree resulting from a given alignment, then the more accurate the underlying alignment is assumed to be (see Dessimoz C, Gil M 2010. Phylogenetic assessment of alignments reveals neglected tree signal in gaps. Genome Biology 11: R37).
(a) False inversion of a proposition: Accurate alignments yield accurate trees, therefore accurate trees must be based on accurate alignments.
(b) Alignment is often involved in constructing the reference tree. If not, the tree may be trivial in terms of taxon relationships.


This evaluation leaves us in the invidious position of not yet having any benchmarking method that is relevant to homology assessment for multiple sequence alignments. This conclusion is at variance with other previous assessments (eg. Aniba MR, Poch O, Thompson JD 2010. Issues in bioinformatics benchmarking: the case study of multiple sequence alignment. Nucleic Acids Research 38: 7353-7363).

We need to consider what such a method might look like, and how we might go about constructing it. If biologists can't give the bioinformaticians a concrete goal for homology alignment then they can expect nothing in return.

It seems clear that we need to follow the idea behind option 3, but base the alignments on homology rather than structure. I once made a start with compiling some suitable datasets (see Morrison DA 2009. A framework for phylogenetic sequence alignment. Plant Systematics and Evolution 282: 127-149); but this was a very minor effort.

As I see it, we need alignments that are explicitly annotated with the reasons for considering the columns to be homologous. One suggestion would be to have relatively short alignments with annotations for "known" features, such as tandem repeats, inverted repeats, substitutions, inversions, translocations, transpositions, deletions, insertions, or stem-loops. These all create sequence variation, and they provide evidence of the homology relations among the sequences. Presumably the alignments would vary in length and number of sequences, and in the complexity of the patterns.

Perhaps the biggest practical problem will be how to deal with alignments where the homology criteria conflict with each other. That is, there are different types of criteria used to recognize homology — ie. similarity, structure, ontogeny, congruence (see Morrison DA 2015. Is multiple sequence alignment an art or a science? Systematic Botany 40: 14-26) — and they do not necessarily agree with each other.

This would allow us to come up with a set of requirements to specify various categories of the database, based on each of the above features. We would then try to accumulate as many example datasets for each category as we can. The database will presumably have protein-coding sequences in one section and RNA-coding, introns, etc in another. This dichotomy is simplistic, but I feel that it needs to be that way in order to be of practical use. Within each of those two sections we would have subsets of varying degrees of difficulty (eg. different degrees of average sequence similarity, or distinct taxon subsets in the same alignment, or orphan sequences).

This organisational approach is similar to that originally adopted for BAliBase, but it was dropped by most of the databases developed subsequently. I believe that it is the best approach for our purposes.

There are also experimentally created datasets where the alignment is known because all of the ancestors were sequenced as well. These would be useful; but their limitation is that the sequence variation was generated more or less at random, and so it does not match normal evolutionary processes. These alignments are more likely to match the IID assumption of the current automated alignment methods.

There is one further issue with this approach. Bioinformaticians often state that a few carefully prepared datasets is of little practical use to them (as opposed to being of use to phylogeneticists). What they need is a large number of datasets, the more the better. This is because they are interested in the percent success of their algorithms, and this cannot be assessed with small sample sizes. So, each alignment probably does not need to have too many taxa or too much sequence length — it is the number of alignments that is important, not their individual sizes. This could be achieved by sub-dividing larger datasets.

March 8, 2015


In a few recent blog posts I have discussed the early history of pedigrees, noting that they were usually presented as descent trees (with an ancestor at the top and the descendants below), although some later ones reversed this arrangement. This does not match our description of them as "family trees", of course, because the root of the pedigree is at the top.

I present here another early example, if for no other reason than that I have spent the past hour trying to decipher it. It is a Genealogy of the Saxon Dynasty, particularly the Ottonians. The picture is from the Chronica Sancti Pantaleonis, produced by the Benedictine monastery of Saint Pantaleon in Cologne in 1237 CE, which was itself based on the Chronica Regia Coloniensis [Royal Chronicle of Cologne], first compiled about 1177 CE in Michaelsberg Abbey, Siegburg.

Heinricus rex and Methildis regina are the founding couple in the double circle. Henry the Fowler did not himself become Holy Roman Emperor, but he created a situation where his descendants could do so, and did. They are numbered in the next diagram in the order in which they ruled. Number 9 is missing, this being Lothair II, who was not part of the family.

There are several things to note:
  • The interesting use of illustrative medallions, which seems to have been not uncommon at the time.
  • The consequent difficulty the illustrator has had in fitting the pedigree into the page, even though most of the descendants have been left out.
  • The pedigree is explicitly designed to establish noble ancestry, but females are included even when they are not in the direct line of descent.
  • The rulers nominally change families, from the Ottonian to the Hohenstaufen to the Salian dynasties, as a result of females in the direct line of descent.
  • Number 4 is Henry II, who made an appearance in an earlier post as the husband of Cunigunde of Luxembourg (The first royal pedigree).
  • Number 11 is Frederick I Barbarossa, who also made an appearance in an earlier post (Does it matter which way up a tree is drawn?).
  • The latter two points make it clear that the earliest written pedigrees were all closely related genealogically, and involved the attempts by certain parts of the German nobility to take control of the Holy Roman Empire, consisting at that time of what is now mostly Germany and Italy. Family descent was an important part of establishing who got to rule next.

March 3, 2015


I started actively working on phylogenetic networks more than 10 years ago, when I gave a talk at the Phylogenetic Combinatorics and Applications meeting in Uppsala in July 2004.

However, before I started working on networks I had for several years been working on multiple sequence alignment methodology, and I still do. This work is also of direct relevance to network construction, of course, since faulty alignments will generate conflicting signals that can confound the biological signals that alone should appear in the network.

This year marks the 20th anniversary of my first publication in the alignment field (see the list appended below). To celebrate this I have some review / commentary articles planned. The first of these has now appeared online, and I would like to draw it to your attention:
  • Morrison DA (2015) Is multiple sequence alignment an art or a science? Systematic Botany 40: 14-26.
This paper relates current sequence alignment procedures to homology assessments as they are practiced for other data. Most algorithms can be seen as implementing only one of the several criteria that are used to identify homologies, which is inadequate. Suggestions are made for improving this situation.

There will also be a couple of upcoming blog posts canvassing a few issues that I see as important for the future development of alignment methods.

Previous Publications


Ellis J, Morrison DA (1995) Effects of sequence alignment on the phylogeny of Sarcocystis deduced from 18S rDNA sequences. Parasitology Research 81: 696-699.

Morrison DA, Ellis JT (1997) Effects of nucleotide sequence alignment on phylogeny estimation: a case study of 18S rDNAs of Apicomplexa. Molecular Biology and Evolution 14: 428-441. [This has been the most cited of these publications, surprising me by still getting cited about once per month]

Morrison DA (2006) Multiple sequence alignment for phylogenetic purposes. Australian Systematic Botany 19: 479-539.

Morrison DA (2009) A framework for phylogenetic sequence alignment. Plant Systematics and Evolution 282: 127-149. [This was actually accepted for publication in 2007]

Morrison DA (2009) Why would phylogeneticists ignore computerized sequence alignment? Systematic Biology 58: 150-158.

Morrison DA (2010) [Book review of] ‘Sequence Alignment: Methods, Models, Concepts, and Strategies’. Systematic Biology 59: 363-365.

Empirical examples

Mugridge NB, Morrison DA, Johnson AM, Luton K, Dubey JP, Votypka J, Tenter AM (1999) Phylogenetic relationships of the genus Frenkelia: a review of its history and new knowledge gained from comparison of large subunit ribosomal RNA gene sequences. International Journal for Parasitology 29: 957-972.

Mugridge NB, Morrison DA, Heckeroth AR, Johnson AM, Tenter AM (1999) Phylogenetic analysis based on full-length large subunit ribosomal RNA gene sequence comparison reveals that Neospora caninum is more closely related to Hammondia heydorni than to Toxoplasma gondii. International Journal for Parasitology 29: 1545-1556.

Mugridge NB, Morrison DA, Jäkel T, Heckeroth AR, Tenter AM, Johnson AM (2000) Effects of sequence alignment and structural domains of ribosomal DNA on phylogeny reconstruction for the protozoan family Sarcocystidae. Molecular Biology and Evolution 17: 1842-1853.

Beebe NW, Cooper RD, Morrison DA, Ellis JT (2000) Subset partitioning of the ribosomal DNA small subunit and its effects on the phylogeny of the Anopheles punctulatus group. Insect Molecular Biology 9: 515-520.

Beebe NW, Cooper RD, Morrison DA, Ellis JT (2000) A phylogenetic study of the Anopheles punctulatus group of malaria vectors comparing rDNA sequence alignments derived from the mitochondrial and nuclear small ribosomal subunits. Molecular Phylogenetics and Evolution 17: 430-436.

March 1, 2015


I have occasionally mentioned in this blog the fact that phylogenetic trees have made it into the world of art. However, until now I have not really been able to say the same for phylogenetic networks. I am happy to report that I can now do so.

These three watercolours are from the collection of Sandra Black Culliton, a microbial geneticist.

 At the time of writing the originals are still for sale at Etsy.

Alternatively, you can apparently ask her to produce one to order.

February 24, 2015


Today is the third anniversary of starting this blog, and this is post number 325. Thanks to all of our visitors over the past three years — we hope that the next year will be as productive as this past one has been.

I have summarized here some of the accumulated data, in order to document at least some of the productivity.

As of this morning, there have been 238,613 pageviews, with a median of 192 per day. The blog has continued to grow in popularity, with a median of 70 pageviews per day in the first year, 189 per day in the second year, and 353 per day in the third year. The range of pageviews was 172-1148 per day during this past year. The daily pattern for the three years is shown in the first graph.

Line graph of the number of pageviews through time, up to today.
The largest values are off the graph. The green line is the half-way mark.
The inset shows the mean (blue) and standard deviation of the daily number of pageviews.
There are a few general patterns in the data, the most obvious one being the day of the week, as shown in the inset of the above graph. The posts have usually been on Mondays and Wednesdays, and these two days have had the greatest mean number of pageviews.

Some of the more obvious dips include times such as Christmas - New Year; and the biggest peaks are associated with mentions of particular blog posts on popular sites.

Unfortunately, the data are also seriously skewed by visits from troll sites. These have been particularly from the Ukraine, which is solely responsible for the peak between days 900 and 1000. The smaller following peak represents visits from Taiwan.

The posts themselves have varied greatly in popularity, as shown in the next graph. It is actually a bit tricky to assign pageviews to particular posts, because visits to the blog's homepage are not attributed by the counter to any specific post. Since the current two posts are the ones that appear on the homepage, these posts are under-counted until they move off the homepage, (after which they can be accessed only by a direct visit to their own pages, and thus always get counted). On average, 30% of the blog's pageviews are to the homepage, rather than to a specific post page, and so there is considerable under-counting.

Scatterplot of post pageviews through time, up to last week; the line is the median.
Note the log scale, and that the values are under-counted (see the text).
It is good to note that the most popular posts were scattered throughout the years. Keeping in mind the initial under-counting, the top collection of posts (with counted pageviews) have been:
8 The Music Genome Project is no such thing
Charles Darwin's unpublished tree sketches
The acoustics of the Sydney Opera House
Why do we still use trees for the dog genealogy?
How do we interpret a rooted haplotype network?
Carnival of Evolution, Number 52
Who published the first phylogenetic tree?
Phylogenetics with SpongeBob
Charles Darwin's family pedigree network
Faux phylogenies
Evolutionary trees: old wine in new bottles?
Network analysis of scotch whiskies
Tattoo Monday 8,347
1,747This list is not very different to the same time last year. Posts 129 (which is linked in Wikipedia) and 172 continue to receive visitors almost every day.

The audience for the blog continues to be firmly in the USA. Based on the number of pageviews, the visitor data are:
United States
Ukraine [spurious]
United Kingdom
Finally, if anyone wants to contribute, then we welcome guest bloggers. This is a good forum to try out all of your half-baked ideas, in order to get some feedback, as well as to raise issues that have not yet received any discussion in the literature. If nothing else, it is a good place to be dogmatic without interference from a referee!

February 22, 2015


As a means of motivating his interest in speciation, in The Origin of Species Charles Darwin highlighted the diversity of morphological forms among the finches of the Galápagos Islands, in the south-eastern Pacific Ocean, which he visited while circumnavigating the world in The Beagle. He considered this to be a prime example of biodiversity related to adaptation and natural selection, what we would now call an adaptive radiation.

Recently, the following paper, which provides a genomic-scale study of these birds, has attracted considerable attention:
Lamichhaney S, Berglund J, Almén MS, Maqbool K, Grabherr M, Martinez-Barrio A, Promerová M, Rubin CJ, Wang C, Zamani N, Grant BR, Grant PR, Webster MT, Andersson L (205) Evolution of Darwin's finches and their beaks revealed by genome sequencing. Nature 58: 371-375.The authors note:
Darwin's finches are a classic example of a young adaptive radiation. They have diversified in beak sizes and shapes, feeding habits and diets in adapting to different food resources. The radiation is entirely intact, unlike most other radiations, none of the species having become extinct as a result of human activities.Here we report results from whole genome re-sequencing of 120 individuals representing all Darwin's finch species and two closely related tanagers. For some species we collected samples from multiple islands. We comprehensively analyse patterns of intra- and inter-specific genome diversity and phylogenetic relationships among species. We find widespread evidence of inter-specific gene flow that may have enhanced evolutionary diversification throughout phylogeny, and report the discovery of a locus with a major effect on beak shape.Sadly, the authors try to study the intra- and inter-specific variation principally using phylogenetic trees. They do this in spite of noting that:
Extensive sharing of genetic variation among populations was evident, particularly among ground and tree finches, with almost no fixed differences between species in each group.Clearly, this situation requires a phylogenetic network for adequate study, as a network can always display at least as much phylogenetic information as a tree, and usually considerably more. The authors do recognize this:
A network constructed from autosomal genome sequences indicates conflicting signals in the internal branches of ground and tree finches that may reflect incomplete lineage sorting and/or gene flow ... We used PLINK to calculate genetic distance (on the basis of proportion of alleles identical by state) for all pairs of individuals separately for autosomes and the Z chromosome. We used the neighbour-net method of SplitsTree4 to compute the phylogenetic network from genetic distances.However, this network is tucked away as Fig. 3 in the appendices. It is shown here in the first figure. The authors attribute the gene flow to introgression, but occasionally refer to hybridization and convergent evolution. Indeed, they suggest both relatively recent hybridization as well as the possibility of more ancient hybridization between warbler finches and other finches.

Clearly, this network is not particularly tree-like in places, especially with respect to the delimitation of species based on their morphology, as reflected in their current taxonomy. Nevertheless, the authors prefer to present as their main result as a:
maximum-likelihood phylogenetic tree based on autosomal genome sequences ... We used FastTree to infer approximately maximum-likelihood phylogenies with standard parameters for nucleotide alignments of variable positions in the data set. FastTree computes local support values with the Shimodaira–Hasegawa test.This tree is shown in the second figure.

This apparently well-supported tree is not a particularly accurate representation of the pattern shown by the network. Indeed, it makes clear just why it is inadequate to use a tree to study the interplay of intra- and inter-specific variation. Gene flow requires a network for accurate representation, not a tree.

The authors do acknowledge this situation. While they try to date the nodes on their tree, they do note that:
Although these estimates are based on whole-genome data, they should be considered minimum times, as they do not take into account gene flow.Actually, in the face of gene flow the concept that a node has a specific date is illogical, because the nodes do not represent discrete events (see Representing macro- and micro-evolution in a network). Given the authors' final conclusion, it seems quite inappropriate to rely on trees rather than networks:
Evidence of introgressive hybridization, which has been documented as a contemporary process, is found throughout the radiation. Hybridization has given rise to species of mixed ancestry, in the past and the present. It has influenced the evolution of a key phenotypic trait: beak shape ... The degree of continuity between historical and contemporary evolution is unexpected because introgressive hybridization plays no part in traditional accounts of adaptive radiations of animals.

February 17, 2015


In biology we often distinguish microevolutionary events, which occur at the population level, from macroevolutionary events, which involve species. We have traditionally treated phylogenetics as a study of macroevolution. However, more recently there has been a trend to include population-level events, such as incomplete lineage sorting and introgression.

This is of particular importance for the resulting display diagrams. A phylogenetic tree was originally conceived to represent macroevolution. For example, speciation and extinction occur as single events at particular times, and these events apply to discrete groups of organisms. The taxa can be represented as distinct lineages in a tree graph, and the events by having these lineages stop or branch in the graph.

This idea is easily extended to phylogenetic networks, where the gene-flow events are also treated as singular, so that hybridization or horizontal gene transfer can be represented as single reticulations among the lineages.

These are sometimes called "pulse" events. However, there are also "press" events that are ongoing. That is, a lot of genetic variation is generated where populations repeatedly mix, so that every gene-flow instance is part of a continuous process of mixing. This often occurs, for example, in the context of isolation by distance, such as ring species or clinal variation. Under these circumstances, processes like introgression and HGT can involve ongoing events.

For instance, in an earlier life I once studied three species of plant in the Sydney region (Morrison DA, McDonald M, Bankoff P, Quirico P, Mackay D. 1994. Reproductive isolation mechanisms among four closely-related species of Conospermum (Proteaceae). Botanical Journal of the Linnean Society 116: 13-31). One of the species was ecologically isolated from the other two (it occurred in dry rather than damp habitats), and the other two were geographically isolated from each other (they occurred on separate sandstone uplands with a large valley in between). These species look very different from each other, as shown in the picture above, but looks are deceiving. Where the ecological isolation was incomplete, introgression occurred and admixed populations could be found.

These dynamics are more difficult to represent in a phylogenetic tree or network. We do not have discrete groups that can be represented by lines on a graph, but instead have fuzzy groups with indistinct boundaries. Furthermore, we do not have discrete events, but instead have ongoing (repeated) processes.

Nevertheless, it seems clear that there is a desire in modern biology to integrate macroevolutionary and microevolutionary dynamics in a single network diagram. That is, some parts of the diagram will represent pulse events involving discrete groups and other parts will represent press events among fuzzy groups. This situation seems to be currently addressed by practitioners by first creating a tree to represent the pulse events (and possibly their times), and then adding imprecisely located dashed lines as a representation of ongoing gene flow — see the example in Producing trees from datasets with gene flow. This particular mixture of precision and imprecision seems rather unsatisfactory.

Perhaps someone might like to have a think about this aspect of phylogenetic networks, to see if there is some way we can do better.

February 15, 2015


As usual at the beginning of the week, this blog presents something in a lighter vein.

Homologies lie at the heart of phylogenetic analysis. They express the historical relationships among the characters, rather than the historical relationships of the taxa. As such, homology assessment is the first step of a phylogenetic analysis, while building a tree or network is the second step.

With a colleague (Mike Crisp, now retired), I once wrote a tongue-in-cheek article about how to mis-interpret homologies, and the consequences of this for any subsequent tree-building analysis. This article appeared in 1989 in the Australian Systematic Botany Society Newsletter 60: 24–26. Since this issue of the Newsletter is not online, presumably no-one has read this article since then. However, you should read it, and so I have linked to a PDF copy [1.2 MB] of the paper:
An Hennigian analysis of the Eukaryotae

February 10, 2015


Recently, a number of computer programs have been released that are intended to produce phylogenetic networks representing introgression (or admixture) (see Admixture graphs – evolutionary networks for population biology).

A recent example of the use of these programs is presented by:
Jónsson H, Schubert M, Seguin-Orlando A, Ginolhac A, Petersen L, Fumagalli M, Albrechtsen A, Petersen B, Korneliussen TS, Vilstrup JT, Lear T, Myka JL, Lundquist J, Miller DC, Alfarhan AH, Alquraishi SA, Al-Rasheid KA, Stagegaard J, Strauss G, Bertelsen MF, Sicheritz-Ponten T, Antczak DF, Bailey E, Nielsen R, Willerslev E, Orlando L (2014) Speciation with gene flow in equids despite extensive chromosomal plasticity. Proceedings of the National Academy of Sciences of the USA 111: 18655-18660.This study presents a phylogenetic analysis of the extant genomes of the genus Equus, the horses, asses and zebras. This analysis leads the authors to the conclusion that there is "evidence for gene flow involving three contemporary equine species despite chromosomal numbers varying from 16 pairs to 31 pairs." The gene flow is indicated by the light-blue reticulations in the first diagram.

One important issue with these types of analyses is the logic on which the procedure is based. Programs like TreeMIx (used in this analysis) were developed to allow modelling of gene flow across the branches of trees at a microevolutionary (population) scale. Specifically, the graph generated by TreeMix models singular (pulse) introgression events in phylogenetic history.

The issue is that a tree is produced first, and then reticulations are added to it. The tree represents descent and the reticulations represent gene flow. But how do we produce a tree from a dataset that contains evidence of both descent and gene flow? The authors' initial tree is shown below.

The procedural logic works as follows:
(i) we assume that the traditionally recognized species exist
(ii) we assume that we have a representative sample of them, with one genome each
(iii) we construct a tree based on the assumption that there is no gene flow among the species
(iv) we then assess the species for gene flow, and discover it.

Isn't this rather circular? Surely (iv) invalidates the assumptions inherent in (i)-(iii)? How can we then assess the reliability of the sampling in (ii) and the analyses in (iii)? Why have we made assumption (i)? At best the species are fuzzy groups to one extent or another, and we do not know where we have sampled within the probabilistic space assigned to the groups.

This seems like a very poor way to go about studying the interaction between descent and gene flow. First we assume descent only, and then we assess gene flow. When we find gene flow we continue to accept the results of the initial analyses based on descent alone.

I would hate to have to justify this philosophy to someone outside phylogenetics, because I have a horrible feeling that they would either smile tolerantly or laugh outright.

This between-species situation is even more extreme for those within-species patterns where groups are recognized. Human races and domesticated breeds are two concepts that have received constant criticism. Neither races nor breeds form clear-cut groups, as there are no sharp boundaries between them, due to gene flow. Their "central locations" in genotype space are usually very different, however. Therefore it is quite possible to perform a tree-based analysis of samples from the central locations, and this would tell us a lot about descent. But it would tell us almost nothing about gene flow; and we would have a very distorted view of the phylogenetic history.

February 8, 2015


Over the past century a number of food styles have become internationalized, including hamburgers and fried chicken. Not all of these foodstuffs are nutritious, and some people have noted that not all of them are even particularly edible. However, perhaps the most interesting of these foods is the venerable pizza, not least because the customer has considerable say in what it looks and tastes like, but also because it is made and cooked fresh, right in front of us.

Pizza originated in Italy, Greece, or Persia, depending on how we define pizza. After all, covering flat bread with a topping is an idea that goes back a very long way. In the ancient world, the Egyptians made flat bread; the Indians baked bread in an oven, but without a topping; and the Persians cooked their bread without an oven, but they did put melted cheese on it. The Passion 4 Pizza site notes this more recent history: "The ancient Greeks had a flat bread called plakountos, on which they placed various toppings [eg. herbs, onion and garlic], and we know also that Naples was founded (as Neopolis) by the Greeks; and Naples is the home of the modern pizza."

In 16th century Naples, a yeast-based flat bread was referred to as a pizza, eaten by poor people as a street food; but the idea that led to modern pizza was the use of tomato as a topping. Tomatoes were introduced to Europe from South America in the 16th century, and by the 18th century it was common for the poor of the area around Naples to add tomato to their bread. Pizza was brought to the United States by the Italian immigrants in the late 19th century, and became popular in places like New York and Chicago.

Kenji López-Alt publishes The Pizza Lab, which is part of the Serious Eats blog, and he has taken a serious interest in pizza styles, at least in New York. He recognizes three main styles of pizza, based on their dough, the way it is treated, and the temperature at which it is cooked (see the picture above, left to right):
  • New York
  • Sicilian
  • Neapolitan
He also has several variants on these styles.

As a basis for discussion, I have analyzed the dough ingredients of these three styles, using a phylogenetic network as a tool for exploratory data analysis. To create the network, I first calculated the similarity of the pizzas using the Manhattan distance, and a Neighbor-net analysis was then used to display the between-dough similarities as a phylogenetic network. So, pizza-dough styles that are closely connected in the network are similar to each other based on their ingredients, and those that are further apart are progressively more different from each other.

The Neapolitan-style dough is the simplest in terms of ingredients. The dough is not kneaded, but instead is allowed to rise for 3-5 days in the refrigerator, although it remains a thin-crust pizza. It is cooked quickly at a high temperature. The New York-style dough is an offshoot of this that is slightly thicker, and is cooked cooler and slower. The unkneaded dough stands in the fridge for only 1 day. Like all of the styles except the Neapolitan, olive oil is used in the dough, but unlike any of the others it also contains sugar (to help the crust brown more evenly). The Sicilian-style dough is intended for a thick-crust pizza. It requires only a little kneading, after which it is allowed to rise for 2 hours at room temperature. It is essentially fried in olive oil while baking.

The Sfincione is the original Sicilian pizza style, thinner and chewier than the New York Sicilian. It is also cooked at a lower temperature. The Deep Pan pizza is, of course, another thick-crust style. It is allowed to rise for longer than the Sicilian, and is cooked at a higher temperature. The network shows that these all have closely related doughs.

The Greek-style pizza is allegedly a style "found mostly in the 'Pizza Houses' and 'Houses of Pizza' in New England". As shown by the reticulation in the network, it has characteristics of the Neapolitan pizza dough (relatively low water content) and the Sicilian (relatively high oil content). It is left to rise at room temperature overnight, and is cooked like the New York and Deep Pan pizzas.

There are many other pizza styles, of course, but I do not have recipes for them. For example, there is another Deep Dish style found in Chicago.

February 3, 2015


Computer simulations are an important part of phylogenetics, not least because people use them to evaluate analytical methods, for example for alignment strategies or network and tree-building algorithms.

For this reason, biologists often seem to expect that there is some close connection between simulation "experiments" and the performance of data-analysis methods in phylogenetics, and yet the experimental results often have little to say about the methods' performance with empirical data.

There are two reasons for the disconnection between simulations and reality, the first of which is tolerably well known. This is that simulations are based on a mathematical model, and the world isn't (in spite of the well-known comment from James Jeans that "God is a mathematician"). Models are simplifications of the world with certain specified characteristics and assumptions. Perhaps the most egregious assumption is that variation associated with the model involves independent and identically distributed (IID) random variables. For example, simulation studies of molecular sequences make the IID assumption, by generating substitutions and indels at random in the simulated sequences (called stochastic modeling). This IID assumption is rarely true, and therefore simulated sequences deviate strongly from real sequences, where variation occurs distinctly non-randomly and non-independently, both in space and time.

The second problem with simulations seems to be less well understood. This is that they are not intended to tell you anything about which data-analysis method is best. Instead, whatever analysis method matches the simulation model most closely will almost always do best, irrespective of any characteristics of the model.

To take a statistical example, consider assessing the t-test versus the Mann-Whitney test — this is the simplest form of statistical analysis, comparing two groups of data. If we simulate the data using a normal probability distribution, then we know a priori that the t-test will do best, because its assumptions perfectly match the model. What the simulation will tell us is how well the t-test does under perfect conditions; and indeed we find that its success is 100%. Furthermore, the Mann-Whitney test scores about 95%, which is pretty good. But we know a priori that it will do worse than the t-test; what we want to know is how much worse. All of this tells us nothing about which test we should use. It only tells us which method most closely matches the simulation model, and how close it gets to perfection. If we change the simulation model to one where we do not know a priori which analysis method is closest (eg. a lognormal distribution), then the simulation will tell us which it is.

This is what mathematicians intended simulations for — to compare methods relative to the models for which they were designed, and to deviations from those models. So, simulations evaluate models as much as methods. They will mainly tell you which model assumptions are important for your chosen analysis method. To continue the example, non-normality matters for the t-test when the null hypothesis being tested is true, but not when it is false. Instead, inequality of variances matters for the t-test when the null hypothesis is false. This is easily demonstrated using simulations, as it also is for the Mann-Whitney test. But does it tell you whether to use t-tests or Mann-Whitney tests?

This is not a criticism of simulations as such, because mathematicians are interested in the behaviour of their methods, such as their consistency, efficiency, power, and robustness. Simulations help with all of these things. Instead it is a criticism of the way simulations are used (or interpreted) by biologists. Biologists want to know about "accuracy" and about which method to use. Simulations were never intended for this.

To take a first phylogenetic example. People simulate sequence data under likelihood models, and then note that maximum likelihood tree-building does better than parsimony. Maximum likelihood matches the model better than parsimony, so we know a priori that it will do better. What we learn is how well maximum likelihood does under perfect conditions (it is some way short of 100%) and how well parsimony does relative to maximum likelihood.

As a second example, we might simulate sequence-alignment data with the gaps in multiples of three nucleotides. We then discover that an alignment method that puts gaps in multiples of three does better than ones that allow any size of gap. So what? We know a priori which method matches the model. What we don't know is how well it does (it is not 100%), and how close to it the other methods will get. But this is all we learn. We learn nothing about which method we should use.

So, it seems to me that biologists often over-interpret computer simulations. They are tempted to over-interpret the results and not see them for what they are, which is simply an exploration of one set of models versus other models within the specified simulation framework. The results have little to say about the data-analysis methods' performance with empirical data in phylogenetics.

February 1, 2015


Here is a new collection of interesting tattoos.

For other examples of circular trees see Tattoo Monday, Tattoo Monday V and Tattoo Monday VII. For circular trees with pictures see Tattoo Monday II, and for DNA trees see Tattoo Monday IV. For other March of Progress tattoos see Tattoo Monday VIII.

January 27, 2015


We don't normally discuss individual papers in this blog (except as example datasets), but today I am simply drawing your attention to what appears to be a little-known paper on phylogenetic networks.

Naruya Saitou has not contributed much to the theory of networks, being instead best known for the development of the neighbor-joining method for phylogenetic trees. (The 20th most cited paper ever; see Massive citations of bioinformatics in biology papers) However, this recent paper is of interest:
Naruya Saitou, Takashi Kitano (2013) The PNarec method for detection of ancient recombinations through phylogenetic network analysis. Molecular Phylogenetics and Evolution 66: 507-514.The paper presents a new method for detecting ancient recombinations through phylogenetic network analysis. Recent recombinations are easily detectable using alternative methods, although splits graphs can also be used, but older recombinations are more tricky.

Importantly, I particularly like the opening paragraph of the paper:
The good old days of constructing phylogenetic trees from relatively short sequences are over. Reticulated or "non-tree" structures are omnipresent in genome sequences, and the construction of phylogenetic networks is now the default for describing these complex realities. Recombinations, gene conversions, and gene fusions are biological mechanisms to produce non-tree structures to gene phylogenies, while gene flow is a well known factor for creating reticulations within population phylogenies.These are heart-warming words from the developer of the most commonly used tree-building method!

January 25, 2015


It might be nice to live in a world where the mere fact that you are male or female does not attract attention to you within your profession. But while we are waiting for that day, you might like to ask yourself about women in systematics. David Archibald suggests that the tree produced by Anna Maria Redfield is "the first tree – creationist or evolutionary – by a woman and may well be the only such tree by a woman until well into the twentieth century."

Anna Maria Redfield (1800-1888, née Treadwell) is described in these terms by Michon Scott's Strange Science web site:
Born at the dawn of the 19th century, Anna Maria Redfield earned the equivalent of a master's degree from the first U.S. institution of higher learning devoted to female students: Ingham University, and became perhaps the first woman to design a tree-like diagram of animal life. Although tree-like, her diagram didn't show common ancestry but instead showed the "embranchements" established by Georges Cuvier: vertebrates, arthropods, mollusks, and "radiata" (today classified as cnidarian and echinoderm phyla). To be fair, this diagram was published before Darwin's Origin of Species but later editions of her work made no mention of evolution either. Instead, she wrote about our simian cousins, "The teeth, bones and muscles of the monkey decisively forbid the conclusion that he could by any ordinary natural process, ever be expanded into a Man." Still, her elegant work is great fun to behold even now.The tree-like diagram (shown in miniature above) was a wall chart (1.56 x 1.56 m) called A General View of the Animal Kingdom, published in 1857 by E.B. and E.C. Kellogg, New York. It is heavily illustrated with images of the taxa, their names, and brief notes: eg. "Man alone can articulate sounds, and is capable of improving his faculties or advancing his condition". Only three lithograph copies of the original tree are now known, one of which was sold at auction by Christie's in 2005 for £7,200.

The following year the same publishers produced a companion volume to the chart, called Zoölogical Science, or Nature in Living Forms: Adapted to Elucidate the Chart of the Animal Kingdom, and designed for the higher seminaries, common schools, libraries, and the family circle (1858, reprinted 1860, 1865, 1874). A copy is available in the Biodiversity Heritage Library. Only 57 original copies of the book are now known.

This book of 743 pages is richly illustrated, the artist being unacknowledged in the first edition but credited as E.D. Maltbie from then on. (He is presumably responsible for the chart as well.) The book has the frontispiece shown below, which is an edited version of the base of the tree.

Redfield and her chart have recently been discussed by Susan Butts (2011. Conservation of the Anna Maria Redfield wall chart: A General View of the Animal Kingdom. Society for the Preservation of Natural History Collections Newsletter 25(1): 18-19). She notes:
The wall chart is a masterpiece, with intricate and accurate illustrations of representatives of the animal kingdom portrayed as a Tree of Life, which illuminates the relationships of the major groups of organisms. It is an important document in the study of biology and in the pioneering work of women in science. The wall chart has eloquent phrases, which express a Victorian humanistic view of nature (often intermingled with anthropomorphism, biblical overtones, and the biological superiority of humans).Redfield's views on evolution are clear from her book, indicating that the relationships shown represent affinity not evolution:
There is no evidence whatever that one species has succeeded, or been the result of transmutation of a former species.Butts notes that unfortunately Redfield "remains a relatively minor and poorly recorded figure in the history of women in science, let alone biological and evolution studies in general."