The Genealogical World of Phylogenetic Networks

Biology, computational science, and networks in phylogenetic analysis


XML feed

Last update

9 min 18 sec ago

November 25, 2014


This the 300th post on this blog, and so I thought we might have a bit of a summary. Here is the early history of phylogenetic trees and networks as we currently know it. There may, of course, be as yet undetected sources. Details of each of these historical notes (including illustrations) can be found elsewhere in this blog — you can use the search feature in the right side-bar to find them.


Genealogies as pedigrees (the history of individuals) have a long history. For example, they appear in inscriptions concerning the pharaohs of Ancient Egypt, although these are very imprecise and have caused many headaches for modern scholars. They appear as chains of ancestors and descendants in the Old Testament of the Christian Bible, often contradicting each other and claiming impossible lifespans. Most importantly for modern usage, they were employed in the New Testament to legitimize Jesus as the messiah foretold in the Old Testament. The first known illustration of this appeared in c.400 AD, and it was actually a network, as there were two lineages leading to Jesus (via both Joseph and Mary).

The apparent success of this application (later called the Tree of Jesse, pictures of which started appearing in the 10th century) has meant that both royalty and the nobility have subsequently used pedigrees to assert their own right to be regal and noble. The first known illustration of this is from c.1000 AD, in which Cunigunde of Luxembourg's ancestry was traced in a tree-like manner to include Charlemagne, thus legitimizing her claim to being royal.

Also, up until 1215 AD marriage within seven degrees of separation was not allowed by the christian church, and intestate inheritance applied the same relationship limit. So, a record of blood ties among relatives was often needed; and these started appearing in family bibles, for example. The first recorded tree-like illustrated pedigree was for Lambert of Saint-Omer, which appeared in 1122 AD in his personal copy of his book Liber Floridus.

It seems obvious, then, to also construct genealogies for groups of organisms, which we now call phylogenies (a word coined by Ernst Haeckel in 1866). The Great Chain of Being was for a long time the most popular iconography for relationships, mainly because it neatly tied in with the Christian philosophy of a chain of intellectual ideas, leading from pragmatic earthly concerns and culminating in the idealistic heavens. Humans were, of course, at the head of the chain of earthly beings, and capable of ascending to the heavens.

However, this did not work from a purely observational point of view. Observed pedigrees were not linear, but branched with each generation and often fused again via marriage. Furthermore, biodiversity (the patterns among groups of organisms) also seemed to have multiple relationships. This lead Vitaliano Donati in 1750 (Della Storia Naturale Marina dell' Adriatico) to suggest that:
In addition, the links of the chain are joined in such a way within the links of another chain, that the natural progressions should have to be compared more to a net than to a chain, that net being, so to speak, woven with various threads which show, between them, changing communications, connections, and unions. [from the original Italian]He was not alone in this thought, although others chose different metaphors. For example, Carl von Linné in 1751 (Philosophia Botanica) wrote this:
All plants show affinities on either side, like territories in a geographical map. [from the original Latin]Neither author published a reticulating diagram to illustrate their thoughts, although one of Linné's students subsequently produced a version of his ideas in 1792 (Caroli a Linné, Praelectiones in Ordines Naturales Plantarum).

So, it was Georges-Louis Leclerc, Comte de Buffon, who produced the first empirical phylogeny in 1755 (Histoire Naturelle Générale et Particulière, Tome V). This was a network showing the evolutionary origin of domesticated dog breeds. This was followed by Antoine Nicolas Duchesne in 1766 (Histoire Naturelle des Fraisiers), who produced a network showing the evolutionary origin of strawberry cultivars. In both cases the evolutionary process illustrated by the reticulations in the network was hybridization. Note that both of these diagrams refer to within-species genealogies, rather than to relationships between species; and neither author seems to have contemplated the idea of among-species phylogenies.

Thus, in both theory and practice modern phylogenetic metaphors started as networks, not trees. It was Peter Simon Pallas in 1776 (Elenchus Zoophytorum) who first suggested using a tree as a simplified metaphor:
As Donati has already judiciously observed, the works of Nature are not connected in series in a Scale, but cohere in a Net. On the other hand, the whole system of organic bodies may be well represented by the likeness of a tree that immediately from the root divides both the simplest plants and animals, [but they remain] variously contiguous as they advance up the trunk, Animals and Vegetables; [from the origina Latin]Again, no diagram was forthcoming to illustrate this. It was Jean-Baptiste Pierre Antoine de Monet, Chevalier de Lamarck, who finally produced an empirical phylogeny in 1809 (Philosophie Zoologique). This was a small tree showing the evolutionary relationships among the major groups of animals. However, it represented what we would now call transformational evolution, as Lamarck did not believe in extinction, and thus he showed one group transforming into another. This differed from both Buffon and Duchesne, who were illustrating a process of increasing diversity of groups. It also differed by referring to supra-species relationships.

For the next 50 years, diagrams showing biodiversity relationships illustrated what we now call patterns of affinity, rather than showing historical relationships. These affinity diagrams showed apparent similarities among groups of organisms, without any implication that the relationships were the result of evolutionary history. The majority of these diagrams were networks rather than trees, indicating that groups of organisms had observed similarities with several other groups.

It is Charles Darwin and Alfred Russel Wallace who are credited with introducing, in 1858, the idea that natural selection could be the important process by which new species arise, although the idea of natural selection itself had been "in the air" for more than half a century with respect to within-species variation. (In the case of Patrick Matthew, he had also suggested a role in the origin of new species; 1831, On Naval Timber and Arboriculture; with Critical Notes on Authors who have Recently Treated the Subject of Planting).

As was by now becoming a tradition, neither Darwin nor Wallace (nor Matthew) produced a diagram to illustrate their thoughts. Darwin did draw a theoretical diagram in his subsequent 1859 book (On the Origin of Species by Means of Natural Selection), but he used it to illustrate continuity of evolutionary descent and the processes of extinction and diversification, rather than strictly as representing a phylogeny. His famous "Tree of Life" metaphor had nothing to do with the diagram (it was a Biblical metaphor, to stimulate the imagination of his readers).

The first person to get into print what we could call an empirical diagram representing Darwin's idea was Johann Friedrich Theodor Müller in 1864 (Für Darwin), who drew a small (three-species) tree of amphipods. This was followed by St George Jackson Mivart in 1865 (Contributions towards a more complete knowledge of the axial skeleton in the primates. Proceedings of the Zoological Society of London 33: 545-592). This was a much more extensive diagram illustrating possible evolutionary relationships among primate species (including humans) based solely on their body skeleton.

Confusion between trees and networks reappeared at this time. In particular, Franz Martin Hilgendorf had produced an unpublished PhD thesis in 1863 (Beiträge zur Kenntniß des Süßwasserkalkes von Steinheim) during which he constructed an empirical network of relationships among extinct snail species; but he rejected this because it did not match the Darwinian idea of an evolutionary tree. He later collected more data, and instead published a phylogenetic tree in 1866 (Planorbis multiformis im Steinheimer Süßwasserkalk: ein beispiel von gestaltveränderung im laufe der zeit).

Thus, we last saw an explicit evolutionary network in 1766, referring to with-species variation. The first person to publish an evolutionary network showing relationships among species was apparently Ferdinand Albin Pax in 1888 (Monographische übersicht über die arten der gattung Primula. Botanische Jahrbücher für Systematik, Pflanzengeschichte und Pflanzengeographie 10: 75-241). He produced 14 networks of various primula species, apparently showing affinity relationships, but three of these also illustrate hybridization, which is strictly an evolutionary process.


Genealogies appear in anthropology as well as in biology. Any human creation can be considered to have a history of "descent with modification" if copies are passed from generation to generation (eg. languages, books, tales). For our purposes here, the most important historical developments were in linguistics (languages studies) and in stemmatology (manuscript studies).

Georg Stiernhielm appears to have been the first linguist to draw a genealogy, when he produced a small network of Germanic languages in 1671 (De Linguarum Origine Præfatio, the preface to his edition of Evangelia ab Ulfila Gothorum). This was followed by Félix Gallet in c.1800 (Arbre Généalogique des Langues Mortes et Vivantes), who produced a single broadsheet with a network of Indo-European languages.

Note that, as for biology, the modern metaphors started as networks, not trees. More importantly, note that Stiernhielm's diagram pre-dated Buffon's dog network by more than 80 years — evolutionary ideas were less revolutionary in linguistics than they were in biology.

Darwin explicitly noted a connection between language genealogies and biology genealogies in 1859. However, the first people to get into print what we could call empirical diagrams representing Darwin's idea did so before Darwin published anything on the subject. In 1853 František Ladislav Čelakovský published a tree depicting a history of the Slavic languages (Čtení o Srovnávací Mluvnici Slovanské na Universitě Pražskě), and Auguste Schleicher published one on the development of the Indo-Germanic language family (Die ersten Spaltungen des Indogermanischen Urvolkes. Allgemeine Monatsschrift für Wissenschaft und Literatur 1853: 786-787).

Stemmatology differs from linguistics and biology in first producing a tree rather than a network. Hans Samuel Collin and Carl Johan Schlyter produced this in 1827 (first volume of Corpus Iuris Sueo-Gotorum Antiqui), with a tree of relationships among hand-written copies of documents containing the Medieval laws of Sweden. This was also a tree that represented Darwin's genealogical idea, and so it may be considered to be the first one of that type to be published (ie. 25 years before Čelakovský and Schleicher, and 30 years before Darwin).

This early lead was followed by the first network in 1832, when Friedrich Wilhelm Ritschl's stemma of a book by Thomas Magister (Thomae Magistri sive Theoduli Monachi Ecloga vocum Atticarum) explicitly showed sources of contamination among the manuscript copies — that is, different parts of a manuscript were copied from different sources, rather strict ancestor-descendant copying.

Interestingly, the tree metaphor didn’t endure in anthropology as well as it did in biology. It was quickly replaced by alternative metaphors, such as wave, web, warp & weft, lattice and other continuously reticulating images. Horizontal flow of information has always been seen as a dominant force in anthropological histories.



1671 Georg Stiernhielm — small language network
1750 Vitaliano Donati — biology network suggestion
1751 Carl von Linné — biology map suggestion
1755 Georges-Louis Leclerc, Comte de Buffon — intra-species network
1766 Antoine Nicolas Duchesne — intra-species network
1792 Carl von Linné — map
1800 Félix Gallet — language network
1832 Friedrich Wilhelm Ritschl — small manuscript network
1863 Franz Martin Hilgendorf — unpublished inter-species network
1888 Ferdinand Albin Pax — inter-species network


1776 Peter Simon Pallas — biology tree suggestion
1809 Jean-Baptiste Pierre Antoine de Monet, Chevalier de Lamarck — small inter-species tree
1827 Hans Samuel Collin and Carl Johan Schlyter — manuscript tree
1853 František Ladislav Čelakovský — language tree
1853 Auguste Schleicher — language tree
1859 Charles Robert Darwin — generalized tree
1864 Johann Friedrich Theodor Müller — small inter-species tree
1865 St George Jackson Mivart — large inter-species tree
1866 Franz Martin Hilgendorf — large inter-species tree

November 23, 2014


Infographics have become very popular in recent decades, with the advent of computer graphics packages. Infographics combine data and pictures, trying to produce an aesthetically pleasing but still informative presentation of numeric information. Recently, the following book appeared:
The Infographic History of the World (2013)
by Valentina D’Efilippo & James Ball
HarperCollins (UK) / Firefly Books (US)

A selection of the the infographics can be perused at the senior author's web page: the blog the author also explains her intentions:
The Infographic History of the World is a new book that continues to push the field of infographics forward. Our task required research, organization and the selection of topics. Then, we needed to decide how to display data in order to tell a coherent and compelling story. We have never considered this to be an alternative to tons of books of history, but hopefully a refreshing interpretation of what history is about. With this book, we hope to lead readers on a journey, to interpret the data and find the implications that resonate with them. We don’t pretend that every set of data presents an unquestionable truth. And, rather than looking to define the world’s history, we were looking to present readers with an unconventional interpretation of the subject.Sadly, these good intentions have not always been achieved. As noted by a review at Amazon:
the book showcases *clever* ways of displaying data, not *clear* ways of displaying it ... Far too often I had to pore over the graphic to figure out what it was trying to say.What is worse for the readers of this blog, the information is not always correct. Consider this version of the Tree of Life, which has a long-standing tradition in systematics as one of the world's first examples of an infographic:

Click to enlarge.
Quite a number of the taxonomic labels are misplaced. You can check them for yourselves, but here is a selection of some of the surprising information contained in this infographic:
  • Amphibians are not Tetrapods
  • Humans are not Mammals
  • Mammals are not Amniotes
  • Turtles are not Reptiles
  • Lobe-finned fishes are not Sarcopterygians
  • Ray-finned fishes are not Bony Vertebrates
  • Charophytes are Land Plants
  • Hornworts are Vascular Plants
  • Ferns and Horsetails are not only Seed Plants they are Gymnosperms
  • Conifers, Gnetophytes, Gingko and Cycads are not Gymnosperms
 Clearly, little has been done to check the veracity of the information in this infographic, which completely defeats its purpose.

November 18, 2014


In a previous post I introduced the Great Stemma as the earliest known pedigree, being a genealogical view of biblical history (The first infographic was a genealogy). In it I noted that people were enclosed in circles, which were connected by lines showing relationships, much as we still do today. However, the lines combined marriage, parent-offspring and brotherly relationships without distinction. So, while it is a good first attempt, the Great Stemma leaves room for informational confusion, and this was not corrected at any time during its centuries of being copied. (In fact, confusion was increased through embellishments, deletions and modifications; but that is another story.)

To illustrate the potential problem of interpreting this early type of genealogy, I have included here a specific example.

The above excerpt from the Stemma shows the the children of Jacob by his wife Leah (who is shown at the top centre), and their subsequent children (ie. Leah's grandchildren). I have annotated the diagram to show parent-offspring (P), brother (B) and half-brother (HB) relationships. Note that all relationships are between males unless specified otherwise (so, half-brothers have the same father).

Leah is at the top [generation 1], with her six sons in a row below her (in birth order left to right), and her daughter to the side [generation 2]. Below this is the first-born son of each of the sons [generation 3], followed in columns down the page by their later sons, in birth order. Sons by later relationships are shown as half-brothers. At the bottom are two of Leah's great-grandchildren [generation 4].

Thus, the genealogical diagram does not effectively separate the generations visually, and parental and fraternal relationships are depicted in the same way. These days we solve this, of course, by keeping each generation as a single row and linking each child directly to the parent. It is easy to get used to the Stemma way of doing it, because it is fairly consistent about the arrangement. If there is confusion, then each circle does specify the relationship in words.

So, as I noted, this is a good first attempt, but some of the things that we now feel need distinguishing were not distinguished by the (unknown) original author.

However, the 24 extant copies of the Stemma are not identical, and two of them try to fit more information into Leah's family tree than is shown above. This information concerns the origin of the fourth generation, which is accurately depicted as far as it goes, but the above figure leaves out a lot. Some of the extra information is shown in the Stemma version below, which adds two extra people, both of them wives. I have annotated this version the same as the previous one, except that this pedigree adds one more relationship to the mix — marriage (M).

The extra details come from Genesis 38, which describes a set of relationships that would make a modern television soap-opera scriptwriter jealous. The story goes something like this (I have indicated the named people with letters in the diagram above, with Leah as L):
Judah (J) marries the [unnamed] daughter (W) of Shua. Judah and his wife have three children, Er (E), Onan (O), and Shelah (S). Er marries Tamar (T), but God kills him because he "was wicked in the sight of the Lord" (Gen. 38:7). Tamar becomes Onan's wife in accordance with the custom of the time, but he too is killed by God after he refuses to father children for his older brother's childless widow, and "spills his seed on the ground" instead (Gen. 38:8-10). Although Tamar should marry Shelah, the remaining brother, Judah does not consent, for fear of his son's life (Gen. 38:11). In response, after Judah's wife has died, Tamar deceives Judah into having intercourse with her, by pretending to be a prostitute (Gen. 38:12-23). When Judah discovers that Tamar is pregnant he prepares to have her killed, but recants and confesses when he finds out that he is the father (Gen. 38:24-26). The result is twin boys, Zerah (Z) and Perez (P) (Gen. 38:27), who are accepted as Judah's sons.Biblically, this story is important because Judah became the founder of the Tribe of Judah, one of the twelve Tribes of Israel. Their land encompassed most of the southern portion of the Land of Israel, including Jerusalem. Both the Book of Ruth and the Gospel of Matthew identify Tamar's son Perez as an ancestor of King David, which makes Judah and Tamar also ancestors of Jesus.

For our purposes here, though, the interesting thing is the confusion caused by trying to add the two marriage relationships to the pedigree. These are in no way distinguished visually from the paternal and fraternal relationships, although the circled text does specify the relationship in words. Today, we solve this potential confusion by using horizontal lines for marriage relationships and vertical lines for parent-offspring relationships.

Equally importantly, note that Tamar's (legal) relationship supplants the (biological) parent-offspring relationship between Judah and her sons — you would never conclude from the diagram that Perez was Judah's son, for example, rather than Er's. However, note the neat attempt to keep Tamar's children in a single column by putting one twin above her and one below (perhaps also signifying simultaneous birth).

The above part of this post was inspired by a blog post from Jean-Baptiste Piggin (The Tamar Storyboard). The first picture above is from an unnamed manuscript in the Biblioteca Medicea Laurenziana, Florence, Plut.20.54, dated c. 1050 AD. The second picture if from an unnamed manuscript in the Pierpont Morgan Library, New York, M.644, dated 940-945 AD.

Moving on, the scribes of that time tried to go even further in complicating simple genealogies, as shown in the next figure. This is drawn by Stephanus Garsia Placidus, and is taken from the Saint-Sever Beatus in the Bibliothèque Nationale de France, Paris, ms. lat 8878, dated c. 1060 AD.

It shows the non-Semitic (ie. polytheistic) part of Noah's family. Noah is at the top right (sacrificing two doves), with his son Japheth (J) to the left and son Ham (H) below. Their wives (W) are indicated by intersecting circles, rather than by lines, which is a more successful approach than in the Stemma. Their descendants are shown in roughly the same style as above, with the first-born son followed by the later ones in order (so that the P and B relationships are not clearly distinguished) — Japheth has seven sons and Ham has four.

However, the illustrator has also tried to include a lot of history in this genealogy. For example, the sons of Ham's son Cush end with Nimrod (N), who has a small essay attached to his name. Among other things, he founded Babel, the city that plays an important role later in the Bible. Moreover, the sons of Ham's son Canaan (C) are shown as a reticulating network rather than as a simple chain. This apparently represents their roles as founders of the 11 tribes who originally occupied the ancient Land of Canaan, and who were later driven out and enslaved by the Israelites. These lines thus represent later history rather than parental or fraternal relationships.

This diagram is thus not a simple pedigree, as we would usually leave it today.

November 16, 2014


The New Testament was originally written in Greek, and it apparently did not occur to the writers that a visualization of the many (and lengthy) Biblical genealogies would be helpful. They knew a lot about geometry but nothing about infographics.

Given the importance of the New Testament genealogies for the foundation of Christianity (see The role of biblical genealogies in phylogenetics), it is not at all surprising that eventually someone had a go at summarizing them all in one place. However, this did not happen until several centuries later, when the Bible was being translated into Latin. Perhaps this delay had something to do with the biblical prohibition on images.

The first known attempt to draw a biblical pedigree, rather than writing out the relationships as text, also appears to have been the first attempt at a genealogy of any sort. Jean-Baptiste Piggin has been researching this document since 2009, and he has remarkably extensive notes about it at his web site Macro-Typography. Piggin dates the document to sometime in the decades before 427 AD, which is surprisingly early and thus unique in its historical context (Late Antiquity).

Importantly, the pedigree is actually an infographic in the modern sense, in that the figure itself conveys almost all of the information, with the text acting as a supplement. Thus, a single image allows the viewer to grasp the overview (of biblical history in this case), as well as providing access to the details. This is an idea that did not really catch on until the Medieval period, when Latin manuscripts started to use images as pedagogic devices, in addition to their textual descriptions. An obvious example is the so-called Tree of Porphyry in logic, which was first described in words by Porphyry of Tyre in c. 270 AD (Isagoge), sketched by Boëthius c. 520 AD (In Porphyrium Commentariorum), and finally reproduced as an actual tree diagram in Medieval manuscripts (being named arbor Porphyrii by Petrus Hispanus in 1240, in Summulae Logicales).

Sadly, there is no extant copy of this early biblical pedigree, and so we do not know who produced it or exactly when; nor do we have any of the copies made during the following 500 years. We do, however, have 24 complete or partial copies from the period 950-1250, many of them incorporated into Spanish editions of the Bible. Piggin has studied these copies extensively, and tried to reconstruct what he thinks the original document most probably looked like.

Piggin reconstructs the document (shown above), which he calls the Great Stemma, as a single scroll made from papyrus, designed to be unrolled and read from the upper left towards the middle right. All extant copies, however, break the figure up into sections, for inclusion as pages in a parchment manuscript (a codex) typical of the Medieval period.

Reconstruction was not an easy task, given the later modifications, digressions and embellishments, made with each successive hand-drawn copy. In particular, the process of reducing the long scroll to sequential pages apparently introduced many errors; and subsequent modifications degraded the logic of the original intention. Incidentally, embellishments do not improve the communication of information (see Mistaken improvements), and nor necessarily do modifications, since in this case they often created contradictions.

Above is a schematic overview of the reconstructed original scroll, but you can zoom in to all of the details by visiting Piggin's original reconstruction. Each circle represents one person (out of 540), with connecting lines showing their genealogical relationships — marriage, parent-offspring or brotherly (these are inter-mixed). Time is read left to right along the top (Adam is at the top-left), with vertical excursions downwards for lineages that do not lead to Jesus (who is at the middle-right). Note that the pedigree is drawn using nodes and lines, as we still do, but it is not drawn anything like a tree (ie. a "family tree"). Indeed, it is actually a network, since two ancestral lineages converge on Jesus (via Joseph and Mary).

The diagram also has a distinct timeline superimposed, shown as the elements without circles, which attempts to synchronize biblical events with contemporaneous secular history. So, Piggin notes that the Stemma it is "not just a genealogy, but a graphic version of the universal chronicles which attempted in antiquity to cross reference the histories of different civilizations to establish an overview of Middle Eastern and Graeco-Roman history." However, the timeline is not calibrated in any way (ie. time changes are not constant).

Below, I have included pages from some of the extant manuscripts, to show their variety after more than 500 years of scribes making copies.

The above figure is the first page from the Roda Codex, in the Real Academia de la Historia (Madrid) cod.78 (dated 990 AD). This is the start of the genealogy, with Adam at the top-left, and illustrating his family.

The above figure is the third page from an unnamed manuscript in the Pierpont Morgan Library (New York) M.644 (dated 940-945 AD). This one shows Noah and his non-Semite descendants.

The above figure is the final page from an unnamed manuscript in the Plutei collection at the Biblioteca Medicea Laurenzian (Florence) Plut.20.54 (dated 1050 AD). This shows the incarnation of Jesus, at the end of the genealogy, illustrating the confluence of the lineages described by Matthew (at the top) and Luke (at the bottom).

Piggin notes that here may actually have been few early copies of the Stemma, because of the difficulty of transcribing illustrations by hand. That is, it is very difficult to accurately hand-copy a diagram, as opposed to copying text (where only the words matter not their visual style). Indeed, to what extent did the scribes actually understand that they needed a precise copy? Copying complex technical drawings requires careful measurement and layout, and yet some of the copies seem to have been very badly planned. Piggin suggests that "the serious corruption done to the Great Stemma early in its diffusion led to it ultimately being discarded and begun all over again by medieval writers such as Peter of Poitiers." The reference is to the Compendium Historiae in Genealogia Christi by Petrus Pictaviensis (Peter of Poitiers) produced in c.1185 AD, and for which there are many extant copies dated from that time to 1650 AD — he used long rolls for his genealogies.

Finally, Piggin even has a suggestion for a small ancient board game that might have provided inspiration for the form of the infographic (see Board Game). This is important, because there are no known prior models for constructing such a diagram — apart from geometry, no-one had previously produced an image that illustrated non-corporeal ideas.

Footnote: The word stemma referred originally to an ancient Roman genealogy (sometimes displayed in homes), which is roughly how it is used by Piggin. However, these days the word is more commonly used in anthropology to refer to a genealogy of manuscript copies. A genealogy of manuscripts is more properly called a stemma codicum.

November 11, 2014


The draft Minimum Information about a Phylogenetic Analysis standard (Leebens-Mack et al. 2006) suggests that all relevant information about each and every published phylogenetics analysis should be archived, so that it can be scrutinized by later researchers, either for validation or for re-use. The issues here are both preservation of the information (data and analysis protocols) and open access to it.

In this blog we have already pointed out that there has been criticism of the bioinformatics part of this archiving, where there have been repeated claims that many computer programs are poorly maintained (Poor bioinformatics?) as well as poorly archived (Archiving of bioinformatics software).

Anyone who has ever tried to get data out of a biologist will know that the data-related part of the standard is no better. My own success rate, at requesting data from all areas of biology not just phylogenetics, is less than 20% over the past 25 years. The responses have been, in order: (i) no response (>50%), (ii) "a student / postdoc / colleague has the data not me", and (iii) "I have moved recently and don't know where the data are". My most recent attempt, to get the data from Collard et al. (2006), was ultimately unsuccessful even after several attempts.

For phylogenetics, this situation has recently been quantified and analyzed by Magee et al. (2014). They tried to collect phylogenetic data (comprising nucleotide sequence alignment and tree files) from 217 published studies. Of these, 54 (25%) had at least some part of the data (alignment or tree) archived in an online repository, and 91 (42%) were obtained by direct solicitation, but in 72 (33%) of cases nothing could be obtained even after three requests. Overall, complete datasets (both tree and alignment) were available for only 40% of the studies.

The authors note that the data were more likely to be deposited in online archives and/ or shared upon request when the publishing journal has a strong data-sharing policy. Furthermore, there has been a positive impact of recent policy initiatives and infrastructural changes involving data repositories. The TreeBASE phylogenetic-data repository has existed for more than 20 years, but its use has been sporadic. However, the recent establishment of the Joint Data Archiving Policy by a consortium of journals, which requires the submission of data to online archives as a condition of publication, and the concomitant establishment of the Dryad repository for evolutionary and ecological data, has seen a surge in the archiving of data.

So, all in all, things have been no better on the bio side than the informatics side of bioinformatics.

Stoltzfus et al. (2012) have identified a number of possible barriers to successful data archiving, including lack of awareness of options and policies, perception that benefits do not justify burden, and an active desire to restrict data access. Importantly, there are also a number of practical issues even for those people who do wish to archive their data:
  • inconvenience of gathering complete data and metadata
  • inconvenience of format conversions needed for archiving
  • frustration when some data don't fit the archive's data model
  • poor and undocumented archive submission interfaces.
For the readers of this blog, issue three is possibly the most important one — all current repositories are based on a tree model for phylogenetics, and therefore network phylogenies are frustrating to deal with.

In order to improve the overall situation, there are explicit suggestions from Cranston et al. (2014) for best practices when archiving. They have ten simple guidelines that, if followed, will result in you providing open access to your data and analyses, even if the publishing journal does not force you to do it.

Footnote: I have been reminded that archiving data in PDF format is inappropriate. Trying to extract text (such as a dataset) from a PDF file can be difficult, because there is no standard format for storing the text. Consequently, different PDF readers will extract the text in different ways, and it is possible that in all cases the output will need extensive manual re-formatting, in order to recover the original text formatting that went into the PDF file. In my experience, Google Chrome may do the least-worst job.


Collard M, Shennan SJ, Tehrani JJ (2006) Branching, blending, and the evolution of cultural similarities and differences among human populations. Evolution and Human Behavior 27: 169-184.

Cranston K, Harmon LJ, O'Leary MA, Lisle C (2014) Best practices for data sharing in phylogenetic research. PLoS Currents Jun 19;6.

Leebens-Mack J, Vision T, Brenner E, Bowers JE, Cannon S, Clement MJ, Cunningham CW, dePamphilis C, deSalle R, Doyle JJ, Eisen JA, Gu X, Harshman J, Jansen RK, Kellogg EA, Koonin EV, Mishler BD, Philippe H, Pires JC, Qiu YL, Rhee SY, Sjölander K, Soltis DE, Soltis PS, Stevenson DW, Wall K, Warnow T, Zmasek C (2006) Taking the first steps towards a standard for reporting on phylogenies: Minimum Information About a Phylogenetic Analysis (MIAPA). OMICS 10: 231-237.

Magee AF, May MR, Moore BR (2014) The dawn of open access to phylogenetic data. PLoS One 9: e110268.

Stoltzfus A, O'Meara B, Whitacre J, Mounce R, Gillespie EL, Kumar S, Rosauer DF, Vos RA (2012) Sharing and re-use of phylogenetic trees (and associated data) to facilitate synthesis. BMC Research Notes 5: 574.

November 9, 2014


Trees can be many things: objects, symbols, art, or information.

As objects, they act as homes and shelter, they provide food and oxygen, and they bind soil to hold topography in place. They even provide somewhere to sit while you are waiting to discover gravity. Their most famous use as symbols is the Tree of Life, which recurs in many cultures throughout the world. This was later extended to the Tree of Knowledge, a potent intellectual symbol throughout Western history. In the modern world this latter use has been expanded, so that trees are mathematical representations of the relationships among information.

Trees have also long played a role in art, which continues in the modern works of, for example, Vincent van Gogh and Gustav Klimt.

My first introduction to this was the book The Tree (1979, Aurum Press, UK / Little, Brown and Co, USA) by John Fowles (text) and Frank Horvat (photographs). This is a meditation on the connection between the natural world and human creativity. Horvat provides moody views of trees with (almost) no human objects in sight, and Fowles (the novelist) provides a provocative essay on trees as representations of art, revealing in his usual erudite manner that he particularly dislikes the "taming the wild" aspects of horticulture and science.

More recently, there has been the hand-lithographed book The Night Life of Trees (2006, Tara Books, Chennai, India). This contains a series of tribal-art images from three Gond people of central India (Bhajju Shyam, Durga Bai and Ramsingh Urveti). (And yes, the land of the Gond is Gondwanaland, which was the source of our name for the southern land masses.)

The Gond people have previously decorated their house walls and floors with traditional tattoos and motifs; and these motifs have made their way onto paper as modern representations of the tribal art form. Other tribal art forms that have followed a similar transfomation include the Aboriginal art of Australia, which bears a strong stylistic resemblance to some of the Gond art.

The Gonds are traditionally forest dwellers, and so the lives of humans and trees have been seen as closely entwined. Their lore suggests that trees are hard at work during the day providing shelter and nourishment, but at night they finally rest and their spirits are revealed. It is these spirits that the artists have tried to capture in their book.

I have reproduced two of the images here, because it is clear that the inter-twining reveals a very network-like aspect of the trees. The accompanying text is taken from the book.

Snakes and Earth

The earth is held in the coils of the snake goddess. And the roots of trees coil around the earth too, holding it in place. If you want to depict the earth, you can show it in the form of a snake. It is the same thing.

The Binding Tree

Mahalain trees are found deep inside the thickest jungles, holding each other in a tight embrace. Because it clings and binds so well, Mahalain bark is known for its strength. Our ancestors from earliest times searched for it in the deep jungles and used it to build houses. A house built well with Mahalain bark is said to last a hundred years.

Both books are worth seeking out if you value art as well as science. The Gond book is now in its 9th hardback edition, and is widely available in bookstores. The Fowles book (without the photographs) is currently available as a 30th anniversary paperback edition; but you are better off finding a second-hand hardback with the pictures.

Finally, just by way of contrast, here is the Albero Trinità from Joachim of Fiore's Liber Figurarum (published in 1202), a book that uses many different visualizations to display human knowledge.

My daughter was the inspiration for writing this blog post.

November 5, 2014


For those of you who have missed it, the magazine Nature has recently looked at the 100 most highly cited science papers of all time (across all fields):
van Noorden R, Maher B, Nuzzo R (2014) The top 100 papers: Nature explores the most-cited research of all time. Nature 514: 550-553.The list is dominated by biology papers, with biochemical laboratory techniques taking all of the top spots. However, it also worth noting that bioinformatics papers produce a very good showing, and so I have extracted 10 of them here.

If you have ever wondered what phylogenetic tree-building method is most used then it is at #20, while the most-used tree-building program is at #45 (having got there in only 7 years). You may also wonder why sequence alignment programs (#10 & #28 for Clustal; #12 & #14 for BLAST) do much better than tree-building programs (#45 for MEGA; #75 for GCG; #100 for MrBayes).

As for journals, the papers appeared in Nucleic Acids Research (4), Molecular Biology & Evolution (2), Bioinformatics (2), Journal of Molecular Biology (1) and Evolution (1). This list only partally matches their Journal Citation Reports current 5-Year Impact Factors: 8.378, 10.494, 6.968, 3.795 and 5.469, respectively.

Rank: 10 Citations: 40,289
Clustal W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice.
Thompson, J. D., Higgins, D. G. & Gibson, T. J
Nucleic Acids Res. 22, 4673–4680 (1994).

Rank: 12 Citations: 38,380
Basic local alignment search tool.
Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J.
J. Mol. Biol. 215, 403–410 (1990).

Rank: 14 Citations: 36,410
Gapped BLAST and PSI-BLAST: A new generation of protein database search programs.
Altschul, S. F. et al.
Nucleic Acids Res. 25, 3389–3402 (1997).

Rank: 20 Citations: 30,176
The neighbor-joining method: A new method for reconstructing phylogenetic trees.
Saitou, N. & Nei, M.
Mol. Biol. Evol. 4, 406–425 (1987).

Rank: 28 Citations: 24,098
The CLUSTAL_X Windows interface: Flexible strategies for multiple sequence alignment aided by quality analysis tools.
Thompson, J. D., Gibson, T. J., Plewniak, F., Jeanmougin, F. & Higgins, D. G.
Nucleic Acids Res. 25, 4876–4882 (1997).

Rank: 41 Citations: 21,373
Confidence limits on phylogenies: an approach using the bootstrap.
Felsenstein, J.
Evolution 39, 783–791 (1985).

Rank: 45 Citations: 18,286
MEGA4: Molecular Evolutionary Genetics Analysis (MEGA) software version 4.0.
Tamura, K., Dudley, J., Nei, M. & Kumar, S.
Mol. Biol. Evol. 24, 1596–1599 (2007).

Rank: 75 Citations: 14,226
A comprehensive set of sequence analysis programs for the VAX.
Devereux, J., Haeberli, P. & Smithies, O.
Nucleic Acids Res. 12, 387–395 (1984).

Rank: 76 Citations: 14,099
MODELTEST: Testing the model of DNA substitution.
Posada, D. & Crandall, K. A.
Bioinformatics 14, 817–818 (1998).

Rank: 100 Citations: 12,209
MrBayes 3: Bayesian phylogenetic inference under mixed models.
Ronquist, F. & Huelsenbeck, J. P.
Bioinformatics 19, 1572–1574 (2003).

November 3, 2014


In a recent article (by myself, Leo van Iersel, Nela Lekić and Simone Linz) we stumbled upon the following problem which appears to touch upon some interesting biological issues.

A rooted triplet xy|z is a rooted binary tree in which x and y have a common parent p, p is a child of the root, and z is the other child of the root. A rooted phylogenetic tree T displays (informally: agrees with) xy|z if the common ancestor of x and y is a strict descendant of the common ancestor of x and z (or y and z). See the figure below: the tree on the right displays triplet xy|z.

Suppose we are given a set of rooted triplets S on a set X of taxa. Suppose we have reason to believe that the set of triplets S have been obtained from different sources (e.g. genes), where the genes have different evolutionary histories due to reticulate phenomena. This means that, for a given subset of 3 taxa {x,y,z} from X, S will contain zero, one, two or three of the possible triplets {xy|z, xz|y, yz|x}.

Crucially, suppose we do not know which gene generated each triplet in S. This might sound artificial, but if some of the rooted triplets have been generated from phenotypic data, or have been obtained from inherently complex data (such as metagenomic data), then the genomic origins of the triplets might not be readily available.

Under such circumstances it is tempting to obtain a lower bound on the number of incongruent gene topologies by answering the following question. What is the minimum number of blocks that we can partition the triplets into, such that the triplets in each block are compatible with a tree (i.e. can all be displayed by the same tree)? It's easy to see that the worst case is when all 3(n 3) possible triplet topologies are present in S, where n is the number of elements in X. Let tau(n) denote this worst case.

We computed tau(n) exactly for small n. For n equal to 3 or 4, tau(n) is 3. For 5

October 28, 2014


It is well known that reticulations in phylogenetic networks can reflect variation in data sets from many sources, not only gene flow during evolutionary history. These other sources are presumably unwanted in the analysis when they are due to estimation errors. Such errors include incorrect data, inappropriate sampling, and model mis-specification.

For molecular data, one of the more obvious sources of model mis-specification is an incorrect multiple sequence alignment. This reflects wrong assessments of primary homology among the characters, so that the wrong residues are aligned in the columns. This particular issue seems not to have been addressed in the network literature in any systematic way.

However, it is obviously rather important. After all, who needs a phylogenetic network that reflects mis-alignment rather than evolutionary history? One approach to this issue would be to have some sort of measurement of our confidence in the alignment columns, which could be taken into account when the network is constructed.

One practical problem with this approach is that there has been a veritable cottage industry developing such measurements, which would need to be assessed for their suitability. So, I thought that I might list some of them here, along with a brief description of what they measure. The list is comprehensive but not necessarily exhaustive — it consists of ones for which there was at some stage a computer program (there are others that have never been named). Most of the methods are designed specifically for amino-acid sequences, so that not all of them can be used for nucleotides.

There are basically two types of measurement: (1) quantitative scoring schemes, which provide a reliability score for each aligned position, and (2) selection schemes, which select a subset of the aligned positions as being reliably aligned. So, I have divided the list roughly into these two groups.


Dopazo J (1997) A new index to find regions showing an unexpected variability or conservation in sequence alignments. Computer Applications in the Biosciences 13: 313-317.
— evolutionary index is based on conservativeness of amino acid differences as predicted from nucleotide differences

Thompson JD, Gibson TJ, Plewniak F, Jeanmougin F, Higgins DG (1997) The CLUSTAL-X windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucleic Acids Research 25: 4876-4882.
— quality is based on conservativeness of amino acid differences

Notredame C, Holm L, Higgins DG (1998) COFFEE: an objective function for multiple sequence alignments. Bioinformatics 14: 407-422.
— score represents consistency among global and local alignments

Pei J, Grishin NV (2001) AL2CO: calculation of positional conservation in a protein sequence alignment. Bioinformatics 17: 700-712.
— conservation is based on weighted entropy

Redelings BD, Suchard MA (2005) Joint Bayesian estimation of alignment and phylogeny. Systematic Biology 54: 401-418.
— approximate probability that the letter is homologous to the ancestral residue in its column

Lassmann T, Sonnhammer EL (2005) Automatic assessment of alignment quality. Nucleic Acids Research 33: 7120-7128.
— consistency based on overlap of alignments from several programs

HoT score
Landan G, Graur D (2007) Heads or tails: a simple reliability check for multiple sequence alignments. Molecular Biology and Evolution 24: 1380-1383.
— measures uncertainty due to co-optimal alignments

Bradley RK, Roberts A, Smoot M, Juvekar S, Do J, Dewey C, Holmes I, Pachter L (2009) Fast Statistical Alignment. PLoS Computational Biology 5: e1000392.
— several scores based on HMM consistency, certainty, expected accuracy, expected sensitivity, expected specificity

Penn O, Privman E, Landan G, Graur D, Pupko T (2010) An alignment confidence score capturing robustness to guide tree uncertainty. Molecular Biology and Evolution 27: 1759-1767.
— robustness to guide tree uncertainty

Kim J, Ma J (2011) PSAR: measuring multiple sequence alignment reliability by probabilistic sampling. Nucleic Acids Research 39: 6359-6368.
— agreement with probabilistic sampling of suboptimal alignments

Wu M, Chatterji S, Eisen JA (2012) Accounting for alignment uncertainty in phylogenomics. PLoS One 7: e30288.
— pair Hidden Markov Model to model the sequence evolution and uses the model to calculate the posterior probabilities that residues of a column are correctly aligned

Chang J-M, Di Tommaso P, Notredame C (2014) TCS: a new multiple sequence alignment reliability measure to estimate alignment accuracy and improve phylogenetic tree reconstruction. Molecular Biology and Evolution 31: 1625-1637.
— transitive consistency score is an extended version of the Coffee scoring scheme


Martin MJ, Gonzâlez-Candelas F, Sobrino F, Dopazo J (1995) A method for determining the position and size of optimal sequence regions for phylogenetic analysis. Journal of Molecular Evolution 41: 1128-1138.
— locates the smallest blocks with similar pairwise genetic distances to the whole alignment

Castresana J (2000) Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. Molecular Biology and Evolution 17: 540-552.
— selected blocks are based on conservation of identity

Löytynoja A, Milinkovitch MC (2001) SOAP, cleaning multiple alignments from unstable blocks. Bioinformatics 17: 573-574.
— stability is measured with respect to variation in the Clustal gap-opening and gap-extension penalties

Thompson JD, Plewniak F, Ripp R, Thierry J-C, Poch O (2001) Towards a reliable objective function for multiple sequence alignments. Journal of Molecular Biology 314: 937-951.
— normalized mean distance is based on pairwise distances

Shift score
Cline M, Hughey R, Karplus K (2002) Predicting reliable regions in protein sequence alignments. Bioinformatics 18: 306-314.
— uses information from near-optimal alignments

Lawrence CJ, Zmasek CM, Dawe RK, Malmberg RL (2004) LumberJack: a heuristic tool for sequence alignment exploration and phylogenetic inference. Bioinformatics 20: 1977–1979.
— identifies blocks that have their phylogenetic tree being most similar to that of the whole alignment

Dress AW, Flamm C, Fritzsch G, Grünewald S, Kruspe M, Prohaska SJ, Stadler PF. (2008) Noisy: identification of problematic columns in multiple sequence alignments. Algorithms in Molecular Biology 3: 7.
— identification of phylogenetically uninformative homoplastic sites from compatibilities in a circular split system

Capella-Gutiérrez S, Silla-Martínez JM, Gabaldón T (2009) trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics 25: 1972-1973.
— proportion of sequences with a gap, level of amino acid similarity, level of consistency across different (user-provided) alignments

Blouin C, Perry S, Lavell A, Susko E, Roger AJ. (2009) Reproducing the manual annotation of multiple sequence alignments using a SVM classifier. Bioinformatics 25: 3093-3098.
— support vector machine reproduces manual annotations from other alignments

Criscuolo A, Gribaldo S (2010) BMGE (Block Mapping and Gathering with Entropy): a new software for selection of phylogenetic informative regions from multiple sequence alignments. BMC Evolutionary Biology 10: 210.
— calculates entropy-like scores weighted by similarity matrices

Kück P, Meusemann K, Dambach J, Thormann B, von Reumont BM, Wägele JW, Misof B (2010) Parametric and non-parametric masking of randomness in sequence alignments can be improved and leads to better resolved trees. Frontiers in Zoology 7: 10.
— consensus profiles identify dominating patterns of nonrandom similarity

Rajan V (2013) A method of alignment masking for refining the phylogenetic signal of multiple sequence alignments. Molecular Biology and Evolution 30: 689-712.
— compatible subsplits define clusters of sites which are then removed based on evolutionary rate

October 26, 2014


Charles Darwin and Alfred Russel Wallace are usually credited with independently developing the idea that natural selection could be the important process by which new species arise, although history has apportioned most of the fame to Darwin alone.

In the first edition of his most famous book Darwin (1859) cited no sources, and credited no-one except Thomas Malthus as a source of ideas. He was criticized for this, and from the third edition onwards he provided a historical essay mentioning a few more names.

The basic issue is that the idea of natural selection had been "in the air" for more than half a century, but only with respect to within-species variation. It was Darwin and Wallace who took the leap to consider between-species variation, on the basis that there is no historical boundary defining species — all individuals trace their ancestry back through a whole series of ancestors, including those who existed before the origin of their current species. That is, phylogenies trace back to the origin of life not just to the origin of each species.

So, who were the people who published, however briefly, a comment noting the idea of within-species natural selection? Joachim Dagg, of the Natural History Apostils blog, has recently been writing a series of posts discussing many of those publications that contain a clear description of selection. Here I have provided a convenient overview, in time order, with links to Joachim's blog for those of you who want more information.

Joseph Townsend
  • (1786, republished in 1817) A Dissertation on the Poor Laws, by a Well-wisher to Mankind. London: Ridgways.
— a brief mention of selection in relation to the Poor Laws, not organic evolution, but he seems to have inspired Thomas Mathus (1798) Essay on the Principle of Population, the critical work cited by both Darwin and Wallace (Malthus does not write about heritable variation, and therefore does not cover selection)
Link 1 - Link 2

James Hutton
  • (1794) Investigation of the Principles of Knowledge and of the Progress of Reason, from Sense to Science and Philosophy. Volume 2. Edinburgh: Strahan & Cadell. [section 13, chapter 3]
— advocated the idea of what we now call microevolution, especially in relation to agriculture, and suggested natural selection as the mechanism
Link 1

William Charles Wells
  • (1813) An Account of a White Female, Part of Whose Skin Resembles that of a Negro. [talk]
  • (1818) Two Essays: One Upon Single Vision with Two Eyes; the other on Dew. [plus] An Account of a Female of the White Race of Mankind, Part of Whose Skin Resembles that of a Negro. Edinburgh: Archibald Constable.
— a talk read before the Royal Society of London in 1813, and apparently referenced by Adams, but not put into print until 1818 — discusses selection in relation to human skin color
Link 1 - Link 2

Joseph Adams
  • (1814) A Treatise on the Supposed Hereditary Properties of Diseases. London: J. Callow.
— does not actually use the expression "selection" but briefly describes the process in relation to climate-related human variation, tucked away in the notes
Link 1 - Link 2 - Link 3

Patrick Matthew
  • (1831) On Naval Timber and Arboriculture; with Critical Notes on Authors who have Recently Treated the Subject of Planting. Edinburgh: Adam Black.
— explicitly used the phrase "natural process of selection" in relation to the origin of timber varieties, with a discussion tucked away in an appendix — as noted by Joachim Dagg, Matthew explicitly included the possible origin of new species via selection, thus being a literal predecessor of Darwin and Wallace, although they appear to have been unaware of his work
Link 1 - Link 2 - Link 3

John C. Loudon
  • (1832) [Book review of] Matthew, Patrick: On Naval Timber and Arboriculture; with Critical Notes on Authors who have recently treated the Subject of Planting. The Gardener's Magazine 8: 702-703.
— a book review mentioning Matthew's idea of natural selection (he was the only contemporary commenter to do so) and noted it explicitly as being concerned with "the origin of species and varieties"
Link 1 - Link 2

Edward Blyth
  • (1835) An attempt to classify the "varieties" of animals, with observations on the marked seasonal and other changes which naturally take place in various British species, and which do not constitute varieties. The Magazine of Natural History 8: 40-53.*
  • (1836) Observations on the various seasonal and other external changes which regularly take place in birds, more particularly in those which occur in Britain; with remarks on their great importance in indicating the true affinities of species; and upon the natural system of arrangement. The Magazine of Natural History 9: 393-409.*
  • (1837) On the psychological distinctions between man and all other animals; and the consequent diversity of human influence over the inferior ranks of creation, from any mutual or reciprocal influence exercised among the latter. The Magazine of Natural History, new series, 1: 1-9.*
— discusses the effects of artificial selection, but describes the process in nature as restoring organisms in the wild to their archetype (rather than forming new species)
Link 1

Herbert Spencer
  • (1852) A theory of population, deduced from the general law of animal fertility. Westminster Review 57: 468-501.
— published his article in order to show that the adaptedness or fitness of organisms results from the principle discussed by Malthus — Spencer later coined the expression "survival of the fittest" as a synonym of natural selection (in 1862)
Link 1

* Full title: The Magazine of Natural History and Journal of Zoology, Botany, Mineralogy, Geology, and Meteorology

October 21, 2014


Phylogenomics, the idea of applying genomic data to phylogenetic studies, has been around for quite a while now (Eisen 1998), although it was probably Rokas et al. (2003) who drew the first widespread attention among phylogeneticists. Molecular phylogenetics started off using the sequence of a single locus (often small-subunit rRNA) as the data, and slowly progressed from there to multiple loci. Currently, it is considered good practice to use half-a-dozen loci, sampling the main genomes (nucleus, mitochondrion, plastid); and genomics offers the possibility of a fast and cost-effective means of generating large amounts of multi-locus sequence data.

Review papers are beginning to appear based explicitly on next-generation sequencing (NGS), such as those of Lemmon & Lemmon (2013) and McCormack et al. (2013), replacing the earlier work of Philippe et al. (2005), and there are suggestions for how phylogenetics analyses might need to change in response to NGS data (Chan and Ragan 2013). These all treat phylogenomics as being very similar to traditional molecular phylogenetics, in the sense that many people are expecting phylogenomics to provide tree-like resolution of questions that remain unresolved with the current smaller datasets. In the words of Rokas et al. (2003), phylogenomics is intent on "resolving incongruence in molecular phylogenies". That is, incongruent gene trees are seen as the major obstacle to be overcome by phylogenetics data analysis (see also Jeffroy et al. 2006).

However, this might be a naive expectation. After all, the existing phylogenetic conflicts are there for a reason. If we cannot resolve certain parts of organismal history in terms of a phylogenetic tree when we use the current levels of multi-locus data (say

October 19, 2014


Some time ago I wrote a blog post about The bourbon family forest, which contained a collection of trees that, rather than being genealogical trees, instead showed the corporate ownership of American whiskey.

Here is a similar arrangement for "the six companies that make 50% of the world's beer", produced by David Yanofsky at the Quartz blog. As before, the vertical axis is actually a time scale, but the trees are only marginally family trees in the genealogical sense. Note that there is a reticulation between two of the trees for the "Scottish & Newcastle" entry, although this was apparently followed immediately by a subsequent divergence.

Nevertheless, roughly the same sort of information could actually be presented as proper genealogies. Here is an example form Philip Howard's blog, restricted to American beer. Note that the genealogies refer to the joining of branches through time, rather than their splitting. There are two reticulation events, one of which also refers to the "Scottish & Newcastle" entry.

It is also worth noting the use of other types of network by Philip Howard, to look at:

October 14, 2014


Periodically, mathematicians and other computationalists produce lists of what they refer to as "Open Problems" in their particular field. Phylogenetics is no exception. We have had a few on this blog before today (e.g.  An open question about computational complexity; Phylogenetic network Millennium problems).

I thought that I should draw your attention to the fact that last year, Barbara Holland produced a few of her own (2013. The rise of statistical phylogenetics. Australian and New Zealand Journal of Statistics 55: 205-220). These are:

Open problem 1: What is the natural analogue of a confidence interval for a phylogenetic tree?

Open problem 2: What are useful residual diagnostics for phylogenetic models?

Open problem 3: What makes a good phylogenetic model?

Open problem 4: Should DAGs be acceptable objects for inference or should network methods be restricted to exploratory data analysis?

It is obviously the latter problem that is of most interest to us here:
DAGs [directed acyclic graphs] can be constructed by beginning with a good tree and then progressively adding edges until the fit between the model and the data is deemed good enough or there is no sufficient improvement in fit by continuing to add edges. The trouble with using DAGs to define mixture models is that this approach doesn’t actually capture the biological processes of interest within the model. The sorts of things we’d like the data to tell us are what is the relative rate of recombination events or hybridisation events to mutation events or speciation events. The danger with using phylogenetic networks in an "add an extra edge until the fit is good enough" approach is that by giving ourselves the capacity to explain everything we risk explaining nothing. At some point have we stopped doing inference and got back to just summarising our data? In phylogenetics we rely on our models for their explanatory power — in the context of network evolution we need to make careful decisions about what biological processes should be included within the model such that inferences about reticulate (non-treelike) processes of evolution can be brought within the realm of stochastic uncertainty rather than being left as a source of inductive uncertainty. This is not a straightforward task, and will require the collaboration of evolutionary biologists and statisticians.One of the principal issues here is that it is almost impossible to consistently distinguish one reticulation process from another based on the structure of the resulting network. These processes all produce gene flow in the biological world, and they all appear as reticulations in the graphical representation of a network. In practice, phylogenetic analysis may boil down to only two biological processes in the model (vertical gene inheritance and horizontal gene flow), followed by biologists trying to sort out the details with post hoc analyses. Deep coalescence and gene duplication are part of the vertical inheritance, while hybridization, introgression, horizontal gene flow and recombination are part of gene flow. It would be nice to think that this model would simplify network analyses.

October 12, 2014


Some years ago Larisa Lehmer, Bruce Ragsdale, John Daniel, Edwin Hayashi and Robert Kvalstad published a medical report about an ingested plastic bag closure caught in someone's colon (Plastic bag clip discovered in partial colectomy accompanying proposal for phylogenic plastic bag clip classification. BMJ Case Reports 2011). This sounds quite painful.

What is more interesting, though, is that the report was accompanied by a phylogenetic and taxonomic evaluation of plastic ties in general, which the authors named Occlupanids.

Note that the proposed morphological changes in the phylogeny match Cope's Rule of phyletic size increase, as discussed in a previous blog post (Steven Jay Gould was wrong).

Shortly afterwards, one of the authors, John Daniel, set up a web page with a more detailed analysis, under the guise of the Holotypic Occlupanid Research Group (HORG).

Among a lot of other interesting information, there is a revised phylogenetic analysis.

Given the data, it seems fairly clear that the genealogical relationship among these objects is reticulate, and that the trees should thus actually be networks. This follows from the simple fact that these phylogenies are rather uninformative (they are bushes showing a few character transformation series). Also, note that contemporary taxa are ancestors, so that the diagrams are more like population networks than species networks.

These ties are used for packets of sliced bread (a relatively recent invention), and so there has been an explosion of Occlupanid forms as they occupy a new adaptive zone. This is a classic instance of recent speciation that is not yet complete. Occlupanids have now reached pest proportions, except where governments have instituted erradication programmes (such as Europe, where they are no longer found).

Part of the difficulty of analysis is that the objects shown constitute only a small part of the known diversity of Occlupanids (e.g. see this photo and this one). There are a number of manufacturers, and their products constitute separate historical lineages. Morphological features have been transferred from one lineage to another, which is a classic case of reticulate history that has not been taken into account in the above phylogenies.

Indeed, the HORG page is not the only detailed web resource about bread ties — see also the now-defunct but fascinating Transactoid page.

October 7, 2014


I noted recently that the best documented human genealogies are those for the various Anabaptist populations (including the Mennonites, Hutterites and Amish) (The importance of the Amish for reticulate genealogies). They have mostly closed populations (ie. marriages occur solely within a population), and they are thus inbred, and most importantly they maintain detailed written genealogies. This makes them ideal for genealogical studies involving reticulation, including being a source of "known" reticulate histories for testing network algorithms.

If we move outside of Homo sapiens then a genealogy that is equally well documented (if not better) is that of English Thoroughbred horses. This breed was developed as a result of the enthusiasm of the British aristocracy for racing in the 17th century. Thoroughbred pedigree records are regarded as the most comprehensive records detailing ancestral relationships among domestic animal breeds, and they have been formally catalogued since the appearance of the first edition of the General Stud Book in 1791.

As noted by Binns et al. (2011):
The Thoroughbred horse breed was established in England in the early 1700s based on crosses between stallions of Arabian origin and indigenous mares. The founder population was small, with all current males tracing back to one of three stallions, the Godolphin Arabian, the Byerley Turk and the Darley Arabian; in contrast, on the female side, about 70 foundation mares have been identified. A stud book for Thoroughbred horses was initiated in 1791, and pedigree records for the breed, which now number about five hundred thousand horses, are maintained by Thoroughbred registries worldwide.For the males, the story is continued by Bower et al. (2012):
All living Thoroughbreds trace paternally to just three stallions imported into England in the late 17th and early 18th centuries: Byerley Turk (1680s), Darley Arabian (1704) and Godolphin Arabian (1729). Furthermore, a small number of stallions exerted disproportionate influence on early Classic races resulting in their greater popularity at stud. Therefore, the Thoroughbred gene pool has been restricted by small foundation stock and subsequent limited paternal contributions as a result of sire preference and selection. [Our] historic samples were related largely via the Darley Arabian sire line to which 95% of all living Thoroughbreds can be traced in their paternal lineage.Actually, 95% of living Thoroughbreds trace their male lineage to Eclipse (1764), a great-great grandson of the Darley Arabian, so that it is Eclipse who appears as the progenitor in most published genealogies (eg. see the one below). Information about these early males is available at this Thoroughbred Heritage page.

Females have been of less interest to horse breeders, and so in many cases we do not know who they were, and in many others we have only a generic name (eg. "Miss Darcy's pet mare", "old Montagu mare", "royal mare", etc). This means that in modern horses there is a high level of mtDNA diversity due to multiple female lineages but there is very little sequence diversity on the Y chromosome (Wallner et al. 2013). Nevertheless, Hill et al. (2002) have tried to trace the influence of the early females on current genotypes, singling out 19 of them as having large influence (on the mitochondrial genealogy), while Bower et al. (2011) provide a broader analysis. Information about these early females is available at this Thoroughbred Heritage page.

The relevance of this information for genealogy studies is that it tells us the Thoroughbred genealogy is effectively closed (little outside breeding), and it is thoroughly documented. This is potentially another source of known reticulate genealogies.

Of particular interest to horse breeders is inbreeding (see Binns et al. 2012). In suitable doses this is seen as a Good Thing, because it can produce the homozygous appearance of desirable racing characteristics. However, inbreeding should not be too recent. For example, if we look at the list of the Blood-Horse Top 100 Thoroughbreds of the 20th Century then none of them have inbreeding in the previous generation and only one has inbreeding in the one before that. However, 54% of the horses have inbreeding in the fourth ancestral generation, and 18% in each of the third and fifth generations. Only 9 horses had no inbreeding during the five previous generations.

For this reason, the standard version of horse genealogies only goes back five generations. This is the stage at which the inbreeding coefficient becomes

October 5, 2014


There is a tolerably well-known exercise for illustrating the graphical superiority of a Non-Metric Multidimensional Scaling (NMDS) ordination over a Principal Components Analysis (PCS) ordination. The latter is often subject to distortions, so that the relative positions in the scatter-plot of points do not represent the original measured distances between those points (see the post Distortions and artifacts in Principal Components Analysis analysis of genome data). The exercise consists of using the geographical distances between locations on a map as the input distances to the analyses. The NMDS ordination will re-create the map quite accurately while the PCA ordination will usually not do so.

Some time ago I had the idea of doing this same exercise using a data-display network. Unfortunately, I was beaten to it by Barbara Holland (2013. The rise of statistical phylogenetics. Australian and New Zealand Journal of Statistics 55: 205-220). I will go ahead, anyway, disappointed though I am.

I have chosen the Ukraine as my map. The road distances between 25 of the cities were taken from Ukraine Connections (the same data occur on several other sites, as well).

The geographical data were processed in SplitsTree to produce both a Neighbor-Joining tree and a NeighborNet network.

If these techniques are to be effective as data displays, then the positions of the cities in the line graphs should be approximately the same as those in the map. This is, indeed, roughly so, although I had to spend some time manually adjusting the branch angles in the tree (for the best match). The two graphs are more rectangular in overall shape than is the Ukraine, which is somewhat closer to a square, but the relative locations of the points in the graphs do tell you where to look for the cities on the map.

However, the network is the better of the two representations on two grounds. First, the points are constrained to certain locations, and do not need manual adjustment. Second, the network more accurately gives a sense that these are road distances, and there are multiple roads from one city to another — the tree incorrectly implies that there is only one way to get between the cities.

September 30, 2014


It would be nice to think that genealogical history can be reconstructed with ease. However, this is known not to be so. In particular, being able to reconstruct an overall history from a collection of sub-histories, which can thought of as the "building blocks", is not necessarily guaranteed.

That is, even given a complete collection of all of the sub-histories it is not necessarily possible to reconstruct a unique overall history. In other words, there can be pairs of graphs that do not represent the same evolutionary histories, but still display exactly the same collection of building blocks. ("Display" means roughly that a building block can be obtained by simply deleting some of the edges and vertices in the graph.) Mathematically, the sub-histories do not determine (or encode) the history.

For example, it is known that pedigrees cannot necessarily be reconstructed from a collection of all of the sub-pedigrees (Thatte 2008). Pedigrees are the traditional "family trees" showing the ancestry of individuals. Pedigrees differ from phylogenies in that all of the individuals have two parents (rather than possibly having a single immediate ancestor) and there are probably multiple roots (unless there is considerable inbreeding).

Phylogenetic trees, on the other hand can be uniquely reconstructed from a collection of all of the possible sub-trees (see Dress et al. 2012). This is one of the things that makes trees valuable as a phylogenetic model — it is theoretically possible to collect enough information to construct a unique phylogenetic tree.

Rooted phylogenetic networks do not, however, share this property. For some time it has been known that networks cannot necessarily be built from their building blocks, whether those blocks are rooted trees (Willson 2011) or triplets (= rooted 3-taxon trees) or clusters (= rooted sub-trees = clades) (Gambette and Huber 2012).

This is illustrated in the next figure (adapted from Huber et al.), which shows two networks at the top and below that the four trees that are displayed by both of them (by deleting one of each pair of incoming edges at the two reticulation nodes). Given these four trees we cannot reconstruct a unique network, and yet they are the only four trees associated with either network.

To make matters worse, Huber et al. (in press) have now revealed that we can't reconstruct rooted phylogenetic networks even from sub-networks. To do this they show that networks cannot necessarily be built from trinets (= rooted 3-taxon networks). Certain types of networks (e.g. level-1, level-2, tree-child) can be reconstructed (van Iersel and Moulton 2014), but Huber et al. show the example in the second figure, which shows two networks at the top and below that the four trinets that are displayed by both of them. Given these four trinets we cannot reconstruct a unique network, and yet they are the only four trinets associated with either network.

This means that "even if all of the building blocks for some reticulate evolutionary history were to be taken as the input for any given network building method, the method might still output an incorrect history." The best analogy here is Humpty Dumpty — even given all of the pieces, we literally might not be able to put him back together again. We could if he is a rooted tree, but we cannot guarantee it if he is a rooted network or pedigree.

This may not matter in practice, given that we don't yet know the circumstances under which it is possible to uniquely reconstruct networks, but it does mean that we acquire a certain degree of uncertainty as we move from "tree thinking" to "network thinking".


Dress A, Huber KT, Koolen J, Moulton V, Spillner A (2012) Basic Phylogenetic Combinatorics. Cambridge Uni Press.

Gambette P, Huber K (2012) On encodings of phylogenetic networks of bounded level. Journal of Mathematical Biology 65: 157-180.

Huber KT, van Iersel L, Moulton V, Wu T (in press) How much information is needed to infer reticulate evolutionary histories? Systematic Biology

van Iersel L, Moulton V (2014) Trinets encode tree-child and level-2 phylogenetic networks. Journal of Mathematical Biology 68: 1707-1729.

Thatte BD (2008) Combinatorics of pedigrees i: counterexamples to a reconstruction problem. SIAM Journal of Discrete Mathematics 22: 961-970.

Willson SJ (2011) Regular networks can be uniquely constructed from their trees. IEEE/ACM Transactions on Computational Biology and Bioinformatics 8: 785-796.

September 28, 2014


Family pedigrees seem to be confusing things, because there are two distinct interpretations of the expression "family tree".

First, the pedigree tree could be drawn with a particular contemporary person at the root of the tree, so that the tree expands backwards in time to increasing numbers of ancestors at the leaves (ie. an "ascent tree"). In some ways this seems quite illogical as an analogy, given that base of a real tree is the origin of its growth.

Second, the pedigree tree could be drawn with a particular ancestor at the root of the tree, so that the tree expands forwards in time to increasing numbers of descendants at the leaves (ie. a "descent tree"). This is more logical, although we often draw the root at the top. (The following example is actually a network, rather than strictly a tree; see also Pedigrees and phylogenies are networks not trees.)

Pedigrees are generally somewhat different from phylogenies, but in phylogenetics we do choose the latter option for interpreting trees — we start with a collection of contemporary leaves and try to reconstruct the tree backwards towards the common ancestor. Thus the root is at the "base" of the tree, even when we draw the root at the top of the diagram.

In popular usage these distinctions are often blurred. Consider this "family tree" of the Disney character Goofy. It is taken from Gilles R. Maurice's Calisota web page, where the character names are listed clearly.

This is based on the first usage described above, since Goofy himself is at the base and his ancestors are at the leaves. This is actually closer to a lineage rather than a tree, especially as no females seem to be involved at any stage.

However, roughly the same information can be presented the other way around. This cartoon is taken from a different Calisota page.

Here, Goofy is now at the top of the tree and his ancestry proceeds downwards, with the oldest ancestor at the base (except for his son!). This really is confusing.

September 23, 2014


I have written before about How to interpret splits graphs. However, it is worth emphasizing a few points, so that people don't keep Mis-interpreting splits graphs.

A splits graph can potentially represent two main types of pattern. First, like a clustering analysis, it represents groups in the data that are in some way similar. Each group is represented by an explicit split in the graph (see Recognizing groups in splits graphs). The clusters may be hierarchically arranged (each group nested within another group), and they may overlap, so that objects can simultaneously be a member of more than one group. If the clusters do not overlap then the graph will be a tree.

Second, like on ordination analysis, a splits graph can summarize the multi-dimensional neighborhoods of the different objects. That is, the relative distance between the points on the graph summarizes the relationships among the objects — closer objects, as measured along the edges of the graph, are more similar.

These two patterns often appear in the same splits graph. Unfortunately, many published papers mis-interpret neighborhoods as splits. If there is an explicit split representing a cluster of interest, then the data can be said to support that possible cluster. However, if no such split exists, then the graph is agnostic with respect to that cluster — there might be no support for it in the data, or the split might be left out of the graph because other splits out-weigh it. So, graph objects occupying a particular neighborhood might not be well-supported by the original data, contrary to the interpretation sometimes seen in the literature.

This can be illustrated with a specific example, taken from: Sicoli MA, Holton G (2014) Linguistic phylogenies support back-migration from Beringia to Asia. PLOS One 9: e91722.

The splits graph is a consensus network, summarizing all of the splits with at least 10% support in 3000 MCMC bayesian trees. The authors note that the dashed line represents a "primary division" between the groups, and that the differently colored objects represent "clear groupings".

However, the dashed line is supported only by a small split, which has a larger contradictory split (that puts the North PCA group with the Plains-Apachean group). This split thus cannot be said to be well supported. Furthermore, the South Alaska grouping is not supported by any split shown in the graph (there are, however, two splits that combine uniquely to support it). That is, the South Alaska grouping represents a neighborhood rather than a supported cluster. Finally, the Alaska-Canada-1 grouping is also not supported by an uncontradicted split (ie. the tcb taa tau samples could as easily be part of the West Alaska grouping). All of the other identified groups are supported by unique and uncontradicted splits.

So, there are three types of pattern in this splits graph with respect to the groups of interest to the authors: uncontradicted splits, contradicted splits, and neighborhoods, representing good support, medium support and agnosticism, respectively. It is important to recognize these three possibilities, and to interpret them correctly with respect to "support" for any conclusions.

As an aside, I will point out that in the other splits graph in the same paper (a NeighborNet): the dashed line is not supported by any split, two of the colored groupings are not supported by any split, and two of the others have only a small contradicted split. Thus, the "primary division" and the "clear groupings" mostly represent neighborhoods, and are thus only dubiously supported.

September 21, 2014


I have commented before about the perceived tendency to resist thinking about evolutionary relationships as networks (Resistance to network thinking), and even to present reticulating evolutionary relationships as trees rather than as networks (The dilemma of evolutionary networks and Darwinian trees). Charles Darwin seems to be the guilty party in starting this phenomenon.

This behavior becomes particularly obvious when we consider family genealogies. A good example appears when we consider the family relationships of the Olympian gods of Ancient Greece. Several illustrations of these relationships are gathered together on the Olympian Gods Family Tree web page.

Noteworthy is the particularly frisky nature of Zeus, who "got around a bit", to put it mildly. As shown in the first diagram, Zeus was the offspring of Cronus and Rhea. However, he then fathered children with at least nine people, including two of his own sisters, an aunt, a first cousin, and several first cousins once removed, among others. This creates the complex network shown.

However, not everyone wants to draw family genealogies as reticulating networks. After all, they are usually called "family trees". As shown by the examples below, the most common way to reduce a network to a tree is simply to repeat people's names as often as necessary. That is, rather than have them appear once (representing their birth) with multiple reticulating connections representing their reproductive relationships, they appear repeatedly, once for their birth and once for each relationship, so that there are no reticulations. I will leave it to you to count how often Zeus appears in each of these so-called family trees.

Clearly, this is misleading, and it makes no sense to obscure the fact that a so-called tree is actually a reticulate network. If relationships are reticulate then it is best to illustrate them that way, rather than to disguise the networks as trees.