The Genealogical World of Phylogenetic Networks

Biology, computational science, and networks in phylogenetic analysis


XML feed

Last update

1 hour 31 min ago

October 21, 2014


Phylogenomics, the idea of applying genomic data to phylogenetic studies, has been around for quite a while now (Eisen 1998), although it was probably Rokas et al. (2003) who drew the first widespread attention among phylogeneticists. Molecular phylogenetics started off using the sequence of a single locus (often small-subunit rRNA) as the data, and slowly progressed from there to multiple loci. Currently, it is considered good practice to use half-a-dozen loci, sampling the main genomes (nucleus, mitochondrion, plastid); and genomics offers the possibility of a fast and cost-effective means of generating large amounts of multi-locus sequence data.

Review papers are beginning to appear based explicitly on next-generation sequencing (NGS), such as those of Lemmon & Lemmon (2013) and McCormack et al. (2013), replacing the earlier work of Philippe et al. (2005), and there are suggestions for how phylogenetics analyses might need to change on response to NGS data (Chan and Ragan 2013). These all treat phylogenomics as being very similar to traditional molecular phylogenetics, in the sense that many people are expecting phylogenomics to provide tree-like resolution of questions that remain unresolved with the current smaller datasets. In the words of Rokas et al. (2003), phylogenomics is intent on "resolving incongruence in molecular phylogenies". That is, incongruent gene trees are seen as the major obstacle to be overcome by phylogenetics data analysis (see also Jeffroy et al. 2006).

However, this might be a naive expectation. After all, the existing phylogenetic conflicts are there for a reason. If we cannot resolve certain parts of organismal history in terms of a phylogenetic tree when we use the current levels of multi-locus data (say

October 19, 2014


Some time ago I wrote a blog post about The bourbon family forest, which contained a collection of trees that, rather than being genealogical trees, instead showed the corporate ownership of American whiskey.

Here is a similar arrangement for "the six companies that make 50% of the world's beer", produced by David Yanofsky at the Quartz blog. As before, the vertical axis is actually a time scale, but the trees are only marginally family trees in the genealogical sense. Note that there is a reticulation between two of the trees for the "Scottish & Newcastle" entry, although this was apparently followed immediately by a subsequent divergence.

Nevertheless, roughly the same sort of information could actually be presented as proper genealogies. Here is an example form Philip Howard's blog, restricted to American beer. Note that the genealogies refer to the joining of branches through time, rather than their splitting. There are two reticulation events, one of which also refers to the "Scottish & Newcastle" entry.

It is also worth noting the use of other types of network by Philip Howard, to look at:

October 14, 2014


Periodically, mathematicians and other computationalists produce lists of what they refer to as "Open Problems" in their particular field. Phylogenetics is no exception. We have had a few on this blog before today (e.g.  An open question about computational complexity; Phylogenetic network Millennium problems).

I thought that I should draw your attention to the fact that last year, Barbara Holland produced a few of her own (2013. The rise of statistical phylogenetics. Australian and New Zealand Journal of Statistics 55: 205-220). These are:

Open problem 1: What is the natural analogue of a confidence interval for a phylogenetic tree?

Open problem 2: What are useful residual diagnostics for phylogenetic models?

Open problem 3: What makes a good phylogenetic model?

Open problem 4: Should DAGs be acceptable objects for inference or should network methods be restricted to exploratory data analysis?

It is obviously the latter problem that is of most interest to us here:
DAGs [directed acyclic graphs] can be constructed by beginning with a good tree and then progressively adding edges until the fit between the model and the data is deemed good enough or there is no sufficient improvement in fit by continuing to add edges. The trouble with using DAGs to define mixture models is that this approach doesn’t actually capture the biological processes of interest within the model. The sorts of things we’d like the data to tell us are what is the relative rate of recombination events or hybridisation events to mutation events or speciation events. The danger with using phylogenetic networks in an "add an extra edge until the fit is good enough" approach is that by giving ourselves the capacity to explain everything we risk explaining nothing. At some point have we stopped doing inference and got back to just summarising our data? In phylogenetics we rely on our models for their explanatory power — in the context of network evolution we need to make careful decisions about what biological processes should be included within the model such that inferences about reticulate (non-treelike) processes of evolution can be brought within the realm of stochastic uncertainty rather than being left as a source of inductive uncertainty. This is not a straightforward task, and will require the collaboration of evolutionary biologists and statisticians.One of the principal issues here is that it is almost impossible to consistently distinguish one reticulation process from another based on the structure of the resulting network. These processes all produce gene flow in the biological world, and they all appear as reticulations in the graphical representation of a network. In practice, phylogenetic analysis may boil down to only two biological processes in the model (vertical gene inheritance and horizontal gene flow), followed by biologists trying to sort out the details with post hoc analyses. Deep coalescence and gene duplication are part of the vertical inheritance, while hybridization, introgression, horizontal gene flow and recombination are part of gene flow. It would be nice to think that this model would simplify network analyses.

October 12, 2014


Some years ago Larisa Lehmer, Bruce Ragsdale, John Daniel, Edwin Hayashi and Robert Kvalstad published a medical report about an ingested plastic bag closure caught in someone's colon (Plastic bag clip discovered in partial colectomy accompanying proposal for phylogenic plastic bag clip classification. BMJ Case Reports 2011). This sounds quite painful.

What is more interesting, though, is that the report was accompanied by a phylogenetic and taxonomic evaluation of plastic ties in general, which the authors named Occlupanids.

Note that the proposed morphological changes in the phylogeny match Cope's Rule of phyletic size increase, as discussed in a previous blog post (Steven Jay Gould was wrong).

Shortly afterwards, one of the authors, John Daniel, set up a web page with a more detailed analysis, under the guise of the Holotypic Occlupanid Research Group (HORG).

Among a lot of other interesting information, there is a revised phylogenetic analysis.

Given the data, it seems fairly clear that the genealogical relationship among these objects is reticulate, and that the trees should thus actually be networks. This follows from the simple fact that these phylogenies are rather uninformative (they are bushes showing a few character transformation series). Also, note that contemporary taxa are ancestors, so that the diagrams are more like population networks than species networks.

These ties are used for packets of sliced bread (a relatively recent invention), and so there has been an explosion of Occlupanid forms as they occupy a new adaptive zone. This is a classic instance of recent speciation that is not yet complete. Occlupanids have now reached pest proportions, except where governments have instituted erradication programmes (such as Europe, where they are no longer found).

Part of the difficulty of analysis is that the objects shown constitute only a small part of the known diversity of Occlupanids (e.g. see this photo and this one). There are a number of manufacturers, and their products constitute separate historical lineages. Morphological features have been transferred from one lineage to another, which is a classic case of reticulate history that has not been taken into account in the above phylogenies.

Indeed, the HORG page is not the only detailed web resource about bread ties — see also the now-defunct but fascinating Transactoid page.

October 7, 2014


I noted recently that the best documented human genealogies are those for the various Anabaptist populations (including the Mennonites, Hutterites and Amish) (The importance of the Amish for reticulate genealogies). They have mostly closed populations (ie. marriages occur solely within a population), and they are thus inbred, and most importantly they maintain detailed written genealogies. This makes them ideal for genealogical studies involving reticulation, including being a source of "known" reticulate histories for testing network algorithms.

If we move outside of Homo sapiens then a genealogy that is equally well documented (if not better) is that of English Thoroughbred horses. This breed was developed as a result of the enthusiasm of the British aristocracy for racing in the 17th century. Thoroughbred pedigree records are regarded as the most comprehensive records detailing ancestral relationships among domestic animal breeds, and they have been formally catalogued since the appearance of the first edition of the General Stud Book in 1791.

As noted by Binns et al. (2011):
The Thoroughbred horse breed was established in England in the early 1700s based on crosses between stallions of Arabian origin and indigenous mares. The founder population was small, with all current males tracing back to one of three stallions, the Godolphin Arabian, the Byerley Turk and the Darley Arabian; in contrast, on the female side, about 70 foundation mares have been identified. A stud book for Thoroughbred horses was initiated in 1791, and pedigree records for the breed, which now number about five hundred thousand horses, are maintained by Thoroughbred registries worldwide.For the males, the story is continued by Bower et al. (2012):
All living Thoroughbreds trace paternally to just three stallions imported into England in the late 17th and early 18th centuries: Byerley Turk (1680s), Darley Arabian (1704) and Godolphin Arabian (1729). Furthermore, a small number of stallions exerted disproportionate influence on early Classic races resulting in their greater popularity at stud. Therefore, the Thoroughbred gene pool has been restricted by small foundation stock and subsequent limited paternal contributions as a result of sire preference and selection. [Our] historic samples were related largely via the Darley Arabian sire line to which 95% of all living Thoroughbreds can be traced in their paternal lineage.Actually, 95% of living Thoroughbreds trace their male lineage to Eclipse (1764), a great-great grandson of the Darley Arabian, so that it is Eclipse who appears as the progenitor in most published genealogies (eg. see the one below). Information about these early males is available at this Thoroughbred Heritage page.

Females have been of less interest to horse breeders, and so in many cases we do not know who they were, and in many others we have only a generic name (eg. "Miss Darcy's pet mare", "old Montagu mare", "royal mare", etc). This means that in modern horses there is a high level of mtDNA diversity due to multiple female lineages but there is very little sequence diversity on the Y chromosome (Wallner et al. 2013). Nevertheless, Hill et al. (2002) have tried to trace the influence of the early females on current genotypes, singling out 19 of them as having large influence (on the mitochondrial genealogy), while Bower et al. (2011) provide a broader analysis. Information about these early females is available at this Thoroughbred Heritage page.

The relevance of this information for genealogy studies is that it tells us the Thoroughbred genealogy is effectively closed (little outside breeding), and it is thoroughly documented. This is potentially another source of known reticulate genealogies.

Of particular interest to horse breeders is inbreeding (see Binns et al. 2012). In suitable doses this is seen as a Good Thing, because it can produce the homozygous appearance of desirable racing characteristics. However, inbreeding should not be too recent. For example, if we look at the list of the Blood-Horse Top 100 Thoroughbreds of the 20th Century then none of them have inbreeding in the previous generation and only one has inbreeding in the one before that. However, 54% of the horses have inbreeding in the fourth ancestral generation, and 18% in each of the third and fifth generations. Only 9 horses had no inbreeding during the five previous generations.

For this reason, the standard version of horse genealogies only goes back five generations. This is the stage at which the inbreeding coefficient becomes

October 5, 2014


There is a tolerably well-known exercise for illustrating the graphical superiority of a Non-Metric Multidimensional Scaling (NMDS) ordination over a Principal Components Analysis (PCS) ordination. The latter is often subject to distortions, so that the relative positions in the scatter-plot of points do not represent the original measured distances between those points (see the post Distortions and artifacts in Principal Components Analysis analysis of genome data). The exercise consists of using the geographical distances between locations on a map as the input distances to the analyses. The NMDS ordination will re-create the map quite accurately while the PCA ordination will usually not do so.

Some time ago I had the idea of doing this same exercise using a data-display network. Unfortunately, I was beaten to it by Barbara Holland (2013. The rise of statistical phylogenetics. Australian and New Zealand Journal of Statistics 55: 205-220). I will go ahead, anyway, disappointed though I am.

I have chosen the Ukraine as my map. The road distances between 25 of the cities were taken from Ukraine Connections (the same data occur on several other sites, as well).

The geographical data were processed in SplitsTree to produce both a Neighbor-Joining tree and a NeighborNet network.

If these techniques are to be effective as data displays, then the positions of the cities in the line graphs should be approximately the same as those in the map. This is, indeed, roughly so, although I had to spend some time manually adjusting the branch angles in the tree (for the best match). The two graphs are more rectangular in overall shape than is the Ukraine, which is somewhat closer to a square, but the relative locations of the points in the graphs do tell you where to look for the cities on the map.

However, the network is the better of the two representations on two grounds. First, the points are constrained to certain locations, and do not need manual adjustment. Second, the network more accurately gives a sense that these are road distances, and there are multiple roads from one city to another — the tree incorrectly implies that there is only one way to get between the cities.

September 30, 2014


It would be nice to think that genealogical history can be reconstructed with ease. However, this is known not to be so. In particular, being able to reconstruct an overall history from a collection of sub-histories, which can thought of as the "building blocks", is not necessarily guaranteed.

That is, even given a complete collection of all of the sub-histories it is not necessarily possible to reconstruct a unique overall history. In other words, there can be pairs of graphs that do not represent the same evolutionary histories, but still display exactly the same collection of building blocks. ("Display" means roughly that a building block can be obtained by simply deleting some of the edges and vertices in the graph.) Mathematically, the sub-histories do not determine (or encode) the history.

For example, it is known that pedigrees cannot necessarily be reconstructed from a collection of all of the sub-pedigrees (Thatte 2008). Pedigrees are the traditional "family trees" showing the ancestry of individuals. Pedigrees differ from phylogenies in that all of the individuals have two parents (rather than possibly having a single immediate ancestor) and there are probably multiple roots (unless there is considerable inbreeding).

Phylogenetic trees, on the other hand can be uniquely reconstructed from a collection of all of the possible sub-trees (see Dress et al. 2012). This is one of the things that makes trees valuable as a phylogenetic model — it is theoretically possible to collect enough information to construct a unique phylogenetic tree.

Rooted phylogenetic networks do not, however, share this property. For some time it has been known that networks cannot necessarily be built from their building blocks, whether those blocks are rooted trees (Willson 2011) or triplets (= rooted 3-taxon trees) or clusters (= rooted sub-trees = clades) (Gambette and Huber 2012).

This is illustrated in the next figure (adapted from Huber et al.), which shows two networks at the top and below that the four trees that are displayed by both of them (by deleting one of each pair of incoming edges at the two reticulation nodes). Given these four trees we cannot reconstruct a unique network, and yet they are the only four trees associated with either network.

To make matters worse, Huber et al. (in press) have now revealed that we can't reconstruct rooted phylogenetic networks even from sub-networks. To do this they show that networks cannot necessarily be built from trinets (= rooted 3-taxon networks). Certain types of networks (e.g. level-1, level-2, tree-child) can be reconstructed (van Iersel and Moulton 2014), but Huber et al. show the example in the second figure, which shows two networks at the top and below that the four trinets that are displayed by both of them. Given these four trinets we cannot reconstruct a unique network, and yet they are the only four trinets associated with either network.

This means that "even if all of the building blocks for some reticulate evolutionary history were to be taken as the input for any given network building method, the method might still output an incorrect history." The best analogy here is Humpty Dumpty — even given all of the pieces, we literally might not be able to put him back together again. We could if he is a rooted tree, but we cannot guarantee it if he is a rooted network or pedigree.

This may not matter in practice, given that we don't yet know the circumstances under which it is possible to uniquely reconstruct networks, but it does mean that we acquire a certain degree of uncertainty as we move from "tree thinking" to "network thinking".


Dress A, Huber KT, Koolen J, Moulton V, Spillner A (2012) Basic Phylogenetic Combinatorics. Cambridge Uni Press.

Gambette P, Huber K (2012) On encodings of phylogenetic networks of bounded level. Journal of Mathematical Biology 65: 157-180.

Huber KT, van Iersel L, Moulton V, Wu T (in press) How much information is needed to infer reticulate evolutionary histories? Systematic Biology

van Iersel L, Moulton V (2014) Trinets encode tree-child and level-2 phylogenetic networks. Journal of Mathematical Biology 68: 1707-1729.

Thatte BD (2008) Combinatorics of pedigrees i: counterexamples to a reconstruction problem. SIAM Journal of Discrete Mathematics 22: 961-970.

Willson SJ (2011) Regular networks can be uniquely constructed from their trees. IEEE/ACM Transactions on Computational Biology and Bioinformatics 8: 785-796.

September 28, 2014


Family pedigrees seem to be confusing things, because there are two distinct interpretations of the expression "family tree".

First, the pedigree tree could be drawn with a particular contemporary person at the root of the tree, so that the tree expands backwards in time to increasing numbers of ancestors at the leaves. In some ways this seems quite illogical as an analogy, given that base of a real tree is the origin of its growth.

Second, the pedigree tree could be drawn with a particular ancestor at the root of the tree, so that the tree expands forwards in time to increasing numbers of descendants at the leaves. This is more logical, although we often draw the root at the top. (The following example is actually a network, rather than strictly a tree; see also Pedigrees and phylogenies are networks not trees.)

Pedigrees are generally somewhat different from phylogenies, but in phylogenetics we do choose the latter option for interpreting trees — we start with a collection of contemporary leaves and try to reconstruct the tree backwards towards the common ancestor. Thus the root is at the "base" of the tree, even when we draw the root at the top of the diagram.

In popular usage these distinctions are often blurred. Consider this "family tree" of the Disney character Goofy. It is taken from Gilles R. Maurice's Calisota web page, where the character names are listed clearly.

This is based on the first usage described above, since Goofy himself is at the base and his ancestors are at the leaves. This is actually closer to a lineage rather than a tree, especially as no females seem to be involved at any stage.

However, roughly the same information can be presented the other way around. This cartoon is taken from a different Calisota page.

Here, Goofy is now at the top of the tree and his ancestry proceeds downwards, with the oldest ancestor at the base (except for his son!). This really is confusing.

September 23, 2014


I have written before about How to interpret splits graphs. However, it is worth emphasizing a few points, so that people don't keep Mis-interpreting splits graphs.

A splits graph can potentially represent two main types of pattern. First, like a clustering analysis, it represents groups in the data that are in some way similar. Each group is represented by an explicit split in the graph (see Recognizing groups in splits graphs). The clusters may be hierarchically arranged (each group nested within another group), and they may overlap, so that objects can simultaneously be a member of more than one group. If the clusters do not overlap then the graph will be a tree.

Second, like on ordination analysis, a splits graph can summarize the multi-dimensional neighborhoods of the different objects. That is, the relative distance between the points on the graph summarizes the relationships among the objects — closer objects, as measured along the edges of the graph, are more similar.

These two patterns often appear in the same splits graph. Unfortunately, many published papers mis-interpret neighborhoods as splits. If there is an explicit split representing a cluster of interest, then the data can be said to support that possible cluster. However, if no such split exists, then the graph is agnostic with respect to that cluster — there might be no support for it in the data, or the split might be left out of the graph because other splits out-weigh it. So, graph objects occupying a particular neighborhood might not be well-supported by the original data, contrary to the interpretation sometimes seen in the literature.

This can be illustrated with a specific example, taken from: Sicoli MA, Holton G (2014) Linguistic phylogenies support back-migration from Beringia to Asia. PLOS One 9: e91722.

The splits graph is a consensus network, summarizing all of the splits with at least 10% support in 3000 MCMC bayesian trees. The authors note that the dashed line represents a "primary division" between the groups, and that the differently colored objects represent "clear groupings".

However, the dashed line is supported only by a small split, which has a larger contradictory split (that puts the North PCA group with the Plains-Apachean group). This split thus cannot be said to be well supported. Furthermore, the South Alaska grouping is not supported by any split shown in the graph (there are, however, two splits that combine uniquely to support it). That is, the South Alaska grouping represents a neighborhood rather than a supported cluster. Finally, the Alaska-Canada-1 grouping is also not supported by an uncontradicted split (ie. the tcb taa tau samples could as easily be part of the West Alaska grouping). All of the other identified groups are supported by unique and uncontradicted splits.

So, there are three types of pattern in this splits graph with respect to the groups of interest to the authors: uncontradicted splits, contradicted splits, and neighborhoods, representing good support, medium support and agnosticism, respectively. It is important to recognize these three possibilities, and to interpret them correctly with respect to "support" for any conclusions.

As an aside, I will point out that in the other splits graph in the same paper (a NeighborNet): the dashed line is not supported by any split, two of the colored groupings are not supported by any split, and two of the others have only a small contradicted split. Thus, the "primary division" and the "clear groupings" mostly represent neighborhoods, and are thus only dubiously supported.

September 21, 2014


I have commented before about the perceived tendency to resist thinking about evolutionary relationships as networks (Resistance to network thinking), and even to present reticulating evolutionary relationships as trees rather than as networks (The dilemma of evolutionary networks and Darwinian trees). Charles Darwin seems to be the guilty party in starting this phenomenon.

This behavior becomes particularly obvious when we consider family genealogies. A good example appears when we consider the family relationships of the Olympian gods of Ancient Greece. Several illustrations of these relationships are gathered together on the Olympian Gods Family Tree web page.

Noteworthy is the particularly frisky nature of Zeus, who "got around a bit", to put it mildly. As shown in the first diagram, Zeus was the offspring of Cronus and Rhea. However, he then fathered children with at least nine people, including two of his own sisters, an aunt, a first cousin, and several first cousins once removed, among others. This creates the complex network shown.

However, not everyone wants to draw family genealogies as reticulating networks. After all, they are usually called "family trees". As shown by the examples below, the most common way to reduce a network to a tree is simply to repeat people's names as often as necessary. That is, rather than have them appear once (representing their birth) with multiple reticulating connections representing their reproductive relationships, they appear repeatedly, once for their birth and once for each relationship, so that there are no reticulations. I will leave it to you to count how often Zeus appears in each of these so-called family trees.

Clearly, this is misleading, and it makes no sense to obscure the fact that a so-called tree is actually a reticulate network. If relationships are reticulate then it is best to illustrate them that way, rather than to disguise the networks as trees.

September 16, 2014


Phylogenetic networks are of two types: those that produce direct evolutionary inferences about gene flow (eg. hybridization networks, HGT networks), and those that display multiple patterns in multivariate datasets without any necessary evolutionary implications. The latter (called data-display networks) can be used both a priori as tools for exploratory data analysis (EDA), and a posteriori as a means of evaluating (or cross-checking) the support for inferences derived from other analyses (such as evolutionary networks).

Here, I present an example of the a posteriori usage.

The data and initial analysis come from:
Fu Q, Meyer M, Gao X, Stenzel U, Burbano HA, Kelso J, Pääbo S. (2013) DNA analysis of an early modern human from Tianyuan Cave, China. Proceedings of the National Academy of Sciences of the USA 110: 2223-2227.They describe their genome data and evolutionary analysis like this:
We have extracted DNA from a 40,000-year-old anatomically modern human from Tianyuan Cave outside Beijing, China.To investigate the relationship of the Tianyuan individual to present-day populations, we compared it to chromosome 21 sequences from 11 present-day humans from different parts of the world (a San, a Mbuti, a Yoruba, a Mandenka, and a Dinka from Africa; a French and a Sardinian from Europe; a Papuan, a Dai, and a Han from Asia; and a Karitiana from South America) and a Denisovan individual, each sequenced to 24- to 33-fold genomic coverage. Denisovans are an extinct group of Asian hominins related to Neandertals [and used as an outgroup]. In the combined dataset, 86,525 positions variable in at least one individual are of high quality in all 13 individuals.To more accurately gauge how the population from which the Tianyuan individual is derived was related to Eurasian populations, while taking gene flow between populations into account, we used a recent approach that estimates a maximum-likelihood tree of populations and then identifies relationships between populations that are a poor fit to the tree model and that may be due to gene flow [using the TreeMix program] ... The maximum-likelihood tree [reproduced above] shows that the branch leading to the Tianyuan individual is long, due to its lower sequence quality. However, among Eurasian populations, Tianyuan clearly falls with Asian rather than European populations (bootstrap support 100%). The strongest signal not compatible with a bifurcating tree is an inferred gene-flow event that suggests that 6.7% of chromosome 21 in the Papuan individual is derived from Denisovans ... When this is taken into account, the Tianyuan individual appears ancestral to all Asian individuals studied. We note, however, that the relationship of the Tianyuan and Papuan individuals is not resolved (bootstrap support 31%).Setting aside the faux pas about the Tianyuan individual being "ancestral" to the others (it is shown in the tree-based figure as the sister group not the ancestor), most of the other interpretations can be assessed by looking at the multivariate data independently of any evolutionary inference. This can be done using the pairwise nucleotide differences among the samples (provided in Table 1 of the paper) and a NeighborNet data-display network, as shown in the splits graph below.

We can note the following points, some of which support the authors' conclusions and some of which don't. [Note: the authors refer to their figure as a "tree", although it is an introgression network.]:
  • All terminal edges in the network are long, and so there is actually not much genomic information on chromosome 21 about relationships.
  • The network splits do roughly match the tree splits, and so the network apparently does reflect some evolutionary information.
  • The identified gene flow from the Denisovan to the Papuan is represented by a clear split in the network. The weight (0.7335) makes it the fifth largest non-trivial split. That is, it is larger than some of the splits that purportedly represent tree-like evolution.
  • The largest split (weight = 2.8942) separates the non-African samples from the African samples + Denisovan outgroup, which does accord with the postulated dispersal of humans out of Africa.
  • The second (1.1459) and third (0.8073) largest splits are near the root of the tree.
  • The European split is the fourth largest (0.7670). The South American sample is included with the Asian group, reflecting the idea that the native people of the Americas migrated there from Asia across the Bering Strait.
  • The relationships among the Asian samples in the network do not all match those in the tree. Notably, the Han+Dai split (0.5124) is smaller than the Han+Karitiana split (0.6292), and yet the former appears in the tree with 100% bootstrap support.
  • The Han+Dai+Karitiana split is well supported (0.4450), but the Han+Dai+Karitiana +Papuan split is not (0.0152), as reflected in the 31% bootstrap value for the latter in the tree.
  • The Han+Dai+Karitiana+Papuan+Tianyuan split is not displayed in the network, although it has a long edge in the tree. The closest network split, as displayed, includes the Denisovan sample. Thus, the network emphasizes the reticulate Denisovan-Papuan relationship at the expense of the showing all of the tree-like relationship among the Asian samples.
  • The Tianyuan edge is not long in the network whereas it is long in the tree. This is likely to be because of uncertainty in its placement in the tree, rather than poor sequence quality, as claimed by the authors.

Thus, the data-display network questions some of the details of the authors' evolutionary network. However, it does support placing the Tianyuan sample with the Asian ones, as well as possible gene flow from the Denisovan sample to the Papuan one.

It thus seems to be a valuable procedure to cross-check any evolutionary analysis with a data-display network. As I have noted before (Networks and bootstraps as tree-support criteria; How networks differ from bootstrapped trees), bootstap values on a tree are insufficient as a means of assessing the robustness of evolutionary diagrams.

September 14, 2014


I have noted before that the evolutionary history of musical instruments is likely to be a reticulating network rather than being tree-like (Cornets: from a tree to a network). As another illustration of the pattern, we can consider the evolution over the past few centuries of the Spanish or flamenco guitar (taken from the Origem do nome Violão blog post).

This genealogy (with time proceeding from left to right) shows three basic characteristics that seem to be common in anthropological histories. First, there are multiple roots — in this case, three different instruments from the 16th century have provided input into the modern acoustic guitar. Second, there is an early history of reticulation, with ideas for new instrumentation being taken freely from among the existing instruments, in this case presumably in the search for better sound reproduction. Third, there is simple transformational evolution, with new models replacing the previous ones in popularity — for example, over the past 100 years the Spanish guitar has simply gotten larger (this is Cope's Rule.)

September 9, 2014


I noted in my previous blog post (Charles Darwin and the coalescent) that the multispecies coalescent needs to be based on a network model not a tree model. This is because reticulation processes occur both within species and between species — there is gene flow within genealogies and within phylogenies.

Reticulate genealogies are nothing new, and I have blogged about some of the best-known human genealogies with reticulations due to consanguinity (marriage between close relatives):
King Charles II of Spain
Charles Darwin
Henri Toulouse-Lautrec
Albert Einstein
Pharaoh Tutankhamun
Pharaoh Cleopatra

Importantly, in the modern world there are quite a few genealogical datasets available for study. For example, the Kinsources repository has c. 100 datasets from around the world, covering multi-generational histories for nearly 350,000 individuals. These data are actively used for research (eg. Bailey et al. 2014).

However, the best documented human genealogies are those for the various Anabaptist populations, who moved from Europe to North America during the 18th and 19th centuries. Anabaptists have mostly closed populations (ie. marriages occur solely within a population), and they are thus inbred, and most importantly they maintain detailed written genealogies. These populations include the Mennonites, Hutterites and Amish, the latter being the best known.

As noted by Agarwala et al. (2001):
The term "Anabaptist" literally means "rebaptizer" and is used to refer to a Christian movement that arose in central Europe in the first half of the 16th century. Adherents support adult baptism, pacifism, and separation of church and state. Among the large Anabaptist groups existing today are Mennonites (who were originally followers of Menno Simons), Amish (originally followers of Jakob Ammann who split away from the Mennonites at the end of the 17th century), and Hutterites (originally followers of Jakob Hutter). Amish and Mennonites emigrated to North America in multiple waves in the 18th and 19th centuries. The Hutterites began emigrating to the northern and western parts of North America in the late 1800s.Distribution of Amish settlements in North America
Note the rapid expansion over the past 25 years.
The Mennonites originated in the Swiss Alps, and diffused northward into Germany and the Netherlands. The Dutch/North German Mennonites began the migration to America in the 1680s, followed by a much larger migration of Swiss/South German Mennonites beginning in 1707. The Amish are an early split from the Swiss/South German group that occurred in 1693. There are now at least 200,000 Amish in the eastern United States and eastern Canada (see the map above, taken from here), with the numbers apparently growing rapidly with recently increasing movement westward. There are various subgroups (eg. Old Order Amish, New Order Amish). There are about 1.7 million Mennonites worldwide, with c. 150,000 in the eastern United States and eastern Canada. The genealogies of 295,000 Mennonite and Amish individuals from the eastern USA have been databased (Agarwala et al. 2001).

The Hutterites originated as an Anabaptist offshoot in the Tyrolean Alps in the 1500s, but now there are c. 135,000 Hutterites living on 1,350 communal farms in the northern United States (principally South Dakota) and western Canada. Genealogical records trace all extant Hutterites to 90 ancestors who lived during the early 1700s to the early 1800s (see Ober et al. 1999).

These Anabaptist groups are frequently used in medical studies, because it is possible to relate disease occurrences to the recorded genealogy, and thus to assess the genetic component of the disease (eg. Dorsten et al. 1999, Hou et al. 2013). So, the literature is replete with figures showing the distribution of different diseases plotted onto the genealogy. I have included some of the Amish ones here, to illustrate the extreme reticulation that results when inbreeding is ongoing over many generations.

This first one is from Georgi et al. (2014). The diseased people are marked in red.

The next one is from Garner et al. (2001).

This one is from Lee et al. (2008).

The final one is from Racette et al. (2002).

Here is one small part of this genealogy, which emphasizes that between-generation marriages are an important component of the consanguinity.


Agarwala R, Schaffer A, Tomlin J (2001) Towards a complete North American Anabaptist genealogy II: analysis of inbreeding. Human Biology 73: 533-545.

Bailey DH, Hill KR, Walker RS (2014) Fitness consequences of spousal relatedness in 46 small-scale societies. Biology Letters 10: 20140160.

Dorsten L, Hotchkiss L, King T (1999) The effect of inbreeding on early childhood mortality: twelve generations of an Amish settlement. Demography 36: 263-271.

Garner C, McInnes LA, Service SK, Spesny M, Fournier E, Leon P, Freimer NB (2001) Linkage analysis of a complex pedigree with severe bipolar disorder, using a Markov chain Monte Carlo method. American Journal of Human Genetics 68: 1061-1064.

Georgi B, Craig D, Kember RL, Liu W, Lindquist I, Nasser S, Brown C, Egeland JA, Paul SM, Bućan M (2014) Genomic view of bipolar disorder revealed by whole genome sequencing in a genetic isolate. PLoS Genetics 10: e1004229.

Hou L, Faraci G, Chen DT, Kassem L, Schulze TG, Shugart YY, McMahon FJ (2013) Amish revisited: next-generation sequencing studies of psychiatric disorders among the Plain people. Trends in Genetics 29: 412-418.

Lee SL, Murdock DG, McCauley JL, Bradford Y, Crunk A, McFarland L, Jiang L, Wang T, Schnetz-Boutaud N, Haines JL (2008) A genome-wide scan in an Amish pedigree with parkinsonism. Annals of Human Genetics 72: 621-629.

Ober C, Hyslop T, Hauck WW (1999) Inbreeding effects on fertility in humans: evidence for reproductive compensation. American Journal of Human Genetics 64: 225–231.

Racette BA, Rundle M, Wang JC, Goate A, Saccone NL, Farrer M, Lincoln S, Hussey J, Smemo S, Lin J, Suarez B, Parsian A, Perlmutter JS (2002) A multi-incident, Old-Order Amish family with PD. Neurology2 58: 568-574.

September 7, 2014


In an earlier blog post (The ultimate phylogenetic network?) I reproduced the lattice network from the anthropologist Franz Weidenreich. This comes close to being as complex as a network can get when applied to groups of organisms. However, when we study the genealogy of individuals, the network can get much more complex. This will be most true when there are marriages between close relatives (consanguinity), which creates inbreeding.

The family pedigree (or family tree!) shown here is for a group of people in a recently isolated population from the southwestern area of The Netherlands. There are 4,645 people involved, covering 18 generations (one row each). The average number of consanguineous loops for the 103 study individuals is 71.7, which is what is creating all of the cross-connections that make the network look so horrendous. (Consanguineous or inbreeding loops are illustrated here.)

The genealogy is from:
Liu F, Arias-Vásquez A, Sleegers K, Aulchenko YS, Kayser M, Sanchez-Juan P, Feng BJ, Bertoli-Avella AM, van Swieten J, Axenovich TI, Heutink P, van Broeckhoven C, Oostra BA, van Duijn CM (2007) A genomewide screen for late-onset Alzheimer disease in a genetically isolated Dutch population. American Journal of Human Genetics 81: 17-31.

September 2, 2014


The full title of Charles Darwin's most famous book was On the Origin of Species by Means of Natural Selection, or the Preservation of Favoured Races in the Struggle for Life. It is important to note that this title juxtaposes the concepts of between-species variation and within-species variation (Darwin usually referred to "races" rather than to "breeds", "subspecies", etc). This was one of his major insights: the idea that there is a continuum of variation in biology through time (or, as he put it, that it is arbitrary whether variants are treated as different races or as different species).

As I recently noted, this paved the way for between-species phylogenies to be seen as directly analogous to within-species genealogies (The role of biblical genealogies in phylogenetics) — previous applications of genealogies to non-humans (such as those of Buffon and Duchesne) had been explicitly restricted to within-sepcies relationships.

This conceptual integration of within-species and between-species relationships has become explicit in modern biology by using multispecies coalescent models to integrate population genetics and phylogenetics. As noted by Reid et al. (2014):
These models treat populations, rather than alleles sampled from a single individual, as the focal units in phylogenetic trees. The multispecies coalescent model connects traditional phylogenetic inference, which seeks primarily to infer patterns of divergence between species, and population genetic inference, which has typically focused on intraspecific evolutionary processes. The development of these models was motivated by the common empirical observation that genealogies estimated from different genes are often discordant and the discovery that, if ignored, this discordance can bias parameters of direct interest to systematists, such as the relationships and divergence times among species.However, as specifically emphasized by Reid et al.:
In order to reconcile discordance among gene trees and uncover true species relationships, the first gene tree/species tree models assumed that discordance is solely the result of stochastic coalescence of gene lineages within a species phylogeny ... Coalescent stochasticity, however, is not the only source of gene tree discordance. Selection, hybridization, horizontal gene transfer, gene duplication/extinction, recombination, and phylogenetic estimation error can also result in discordance.They examined this situation by studying the fit of the multispecies coalescent model:
to 25 published data sets. We show that poor model fit is detectable in the majority of data sets; that this poor fit can mislead phylogenetic estimation; and that in some cases it stems from processes of inherent interest to systematists ...Our analyses suggest that poor fit to the multispecies coalescent model can mislead inference in empirical studies. In the case of recent hybridization, the consequences may be severe, as species divergences are forced to post-date gene divergences ... When topological conflict among coalescent genealogies is the result of ancient hybridization, balancing selection, or gene duplication and extinction, the consequences may be less severe.In other words, tree-based phylogenetics is inadequate in practice because of gene flow. Within-species genealogies and between-species phylogenies intersect in the concept of a network, not a tree. That is, the multispecies coalescent needs to be based on a network model not a tree model:
The biological processes that generate variation in gene tree topologies should be explicitly modeled, as should relevant dynamics of molecular evolution. Increasingly complex multispecies coalescent models are being implemented, but there are tradeoffs. Some examine gene duplication and extinction or migration but cannot estimate divergence times.So, current models are inadequate. It will be interesting to see how these approaches develop to incorporate gene flow (reticulation) into what has heretofore been a tree model (modeling only ancestor-descendant relationships), as we are still in need of methods for estimating rooted evolutionary networks.


Reid NM, Hird SM, Brown JM, Pelletier TA, McVay JD, Satler JD, Carstens BC (2014) Poor fit to the multispecies coalescent is widely detectable in empirical data. Systematic Biology 63: 322-333.

August 31, 2014

August 26, 2014


I have previously noted that the first empirical phylogenetic tree apparently was published by St George Jackson Mivart in late 1865, a full 6 years after Charles Darwin released On the Origin of Species (Who published the first phylogenetic tree?). Mivart was not necessarily the first to start producing such a tree, but he got into print first. For example, Franz Martin Hilgendorf wrote a PhD thesis in 1863 for which he produced a hand-drawn tree, but he did not actually include the tree itself in the thesis (The dilemma of evolutionary networks and Darwinian trees). Also, Ernst Heinrich Philipp August Haeckel claimed to have started work on his series of phylogenetic trees in 1864, but the resulting book, Generelle Morphologie der Organismen, was not published until 1866 (Who published the first phylogenetic tree?).

Another actor in this series of events was Fritz Müller, who can also be considered to have published a tree first, in 1864, albeit a very small one.

Johann Friedrich Theodor Müller (1822–1897)

Müller was born in Germany, but in the 1850s he emigrated to southern Brazil with his brother and their wives. As a naturalist in the Atlantic forest, he studied the insects, crustaceans and plants, and he is chiefly remembered today as the describer of what we now call Müllerian mimicry (the phenotypic resemblance between two or more unpalatable species).
Heinrich Bronn's German translation of the Origin appeared in 1860, and Müller read it and agreed with its central thesis (as did Hilgendorf and Haeckel). Indeed, in 1864 he published a book discussing some of the empirical evidence that he adduced with regard to the Crustacea:
Für Darwin
Verlag von Wilhelm Engelman, Leipzig.The book has 91 pages and 67 figures, and the Foreword is dated 7th September 1863. Several copies are available in Google Books (here, here, here).

In this book Müller described the development of Crustacea, illustrating that crustaceans and their larvae could be affected by adaptations and natural selection at any growth stage. He discussed in detail how living forms diverged from ancestral ones, based on his study of aerial respiration, larvae morphology, sexual dimorphism, and polymorphism.

Darwin read the book, and began a life-long correspondence with Müller (ultimately some 60 letters having been exchanged between them). Subsequently, Darwin commissioned an English translation of the book, and in 1869 published it with John Murray on commission (ie. taking the risk himself). Darwin printed 1000 copies but it apparently was not a great success:
Facts and Arguments for Darwin
Translated from the German by W.S. Dallas
John Murray, London.The book has 144 pages and 67 figures, and the Translator's Preface is dated 15th February 1869. A copy is available in the Biodiversity Heritage Library (here).

The following quotes are from this English translation [Note that Müller's unnecessarily convoluted sentences exist in the original German — this writing style is one reason why the book is not as well known as the works of Darwin and Wallace]:
It is not the purpose of the following pages to discuss once more the arguments deduced for and against Darwin's theory of the origin of species, or to weigh them one against the other. Their object is simply to indicate a few facts favourable to this theory ...When I had read Charles Darwin's book 'On the Origin of Species,' it seemed to me that there was one mode, and that perhaps the most certain, of testing the correctness of the views developed in it, namely, to attempt to apply them as specially as possible to some particular group of animals ...When I thus began to study our Crustacea more closely from this new stand-point of the Darwinian theory,—when I attempted to bring their arrangements into the form of a genealogical tree, and to form some idea of the probable structure of their ancestors,—I speedily saw (as indeed I expected) that it would require years of preliminary work before the essential problem could be seriously handled ...But although the satisfactory completion of the "Genealogical tree of the Crustacea" appeared to be an undertaking for which the strength and life of an individual would hardly suffice, even under more favourable circumstances than could be presented by a distant island, far removed from the great market of scientific life, far from libraries and museums—nevertheless its practicability became daily less doubtful in my eyes, and fresh observations daily made me more favourably inclined towards the Darwinian theory.In determining to state the arguments which I derived from the consideration of our Crustacea in favour of Darwin's views, and which (together with more general considerations and observations in other departments), essentially aided in making the correctness of those views seem more and more palpable to me, I am chiefly influenced by an expression of Darwin's: "Whoever," says he ('Origin of Species' p. 482), "is led to believe that species are mutable, will do a good service by conscientiously expressing his conviction."So, for the reason stated, Müller did not produce a complete phylogeny in the book. However, of particular interest to us is the figure on page 6 of the original German edition (page 9 of the translation). It turns out to be a pair of three-taxon statements concerning species of Melita (amphipods), as shown in the figure above (original) and below (translation). Müller has this to say:
[There are five] species of Melita ... in which the second pair of feet bears upon one side a small hand of the usual structure, and o the other an enormous clasp-forceps. This want of symmetry is something so unusual among the Amphipoda, and the structure of the clasp-forceps differs so much from what is seen elsewhere in the this order, and agrees so closely in the five species, that one must unhesitatingly regard them as having sprung from common ancestors belonging to them alone among known species.This is as clear a statement of synapomorphy, and its relationship to constructing a phylogeny, as you could get; and so we could credit Müller with having produced an empirical phylogenetic tree (the one on the left in the figures).

Equally interestingly, Müller then goes on to consider a potentially contradictory character: the secondary flagellum of the anterior antennae, which is missing in one species. This would produce a different three-taxon statement (shown on the right in the figures). He resolves the issue by suggesting that the flagellum might be similar to the situation in other species, where it is "reduced to a scarcely perceptible rudiment—nay, that it is sometimes present in youth and disappears at maturity". This is a clear example of the character conflict that arises when trying to construct an empirical phylogeny; and it was also encountered by Mivart in his studies of primate skeletons (Is this the first network from conflicting datasets?).


Müller did not publish a complete phylogeny, but instead discussed how to produce one, and illustrated the practicality (and necessity) of doing so. In the process, he produced a simple three-taxon statement (which is not even numbered as a figure). Nevertheless, this cladogram is technically the first in print, pre-dating Mivart by a year. Darwin was right to recognize its importance, although he seemed to take a while to bring it to the attention of the English-speaking public. Furthermore, Müller was apparently the first to encounter the empirical difficulty of how to deal with conflicting data, which would produce different phylogenetic trees. This is an issue that is just as important today as it was then.

August 24, 2014


For those of you who do not understand the notation:
Homo apriorius ponders the probability of a specified hypothesis, while Homo pragamiticus is interested by the probability of observing particular data. Homo frequentistus wishes to estimate the probability of observing the data given the specified hypothesis, whereas Homo sapients is interested in the joint probability of both the data and the hypothesis. Homo bayesianis estimates the probability of the specified hypothesis given the observed data.

August 19, 2014


Phylogeneticists treat the tree image as having special meaning for themselves. Conceptually, the tree is used as a metaphor for phylogenetic relationships among taxa, and mathematically it is used as a model to analyze phenotypic and genotypic data to uncover those relationships. Irrespective of whether this metaphor / model is adequate or not, it has a long history as part of phylogenetics (Pietsch 2012). Of particular interest ihas been Charles Darwin's reference to the "Tree of Life" as a simile, since that is clearly the key to the understanding of phylogenetics by the general public.

The principle on which phylogenetic trees are based seems to be the same as that for human genealogies. That is, phylogenies are conceptually the between-species homolog of within-species genealogies. As far as Western thought is concerned, human genealogies make their first important appearance in the Bible, with a rather specific purpose. The Bible contains many genealogies, mostly presented as chains of fathers and sons. For example, Genesis 5 lists the descendants of Adam+Eve down to Noah and his sons, which can be illustrated as a pair of chains (as shown in the first figure); and the rest of Genesis gets from there down to Moses' family, for which the genealogy can be illustrated as a complex tree.

The genealogy as listed in Genesis 5.
Cain's lineage was terminated by the Flood.
However, the theologically most important genealogies are those of Jesus, as recorded in Matthew 1:2-16 and Luke 3:23-38. Matthew apparently presents the genealogy through Joseph, who was Jesus' legal father; and Luke apparently traces Jesus' bloodline through Mary's father, Eli. These two lineages coalesc in David+Bathsheba, and from there they have a shared lineage back to Abraham. Their importance lies in the attempt to substantiate that Jesus' ancestry fulfils the biblical prophecies that the Messiah would be descended from Abraham (Genesis 12:3) through Isaac (Genesis 17:21) and Jacob (Genesis 28:14), and that he would be from the tribe of Judah (Genesis 49:8), the family of Jesse (Isaiah 11:1) and the house of David (Jeremiah 23:5).

That is, these genealogies legitimize Jesus as the prophesied Messiah. Following this lead, subsequent use of genealogies has commonly been to legitimize someone as a monarch, so that royal genealogies have been of vital political and social importance throughout recorded history (see the example in the next figure). This importance was not lost on the rest of the nobility, either, so that documented genealogies of most aristocratic families allow us to identify the first-born son of the first-born son, etc, and thus legitimize claimants to noble titles — genealogies are a way for nobles to assert their nobility.

The genealogy of the current royal family of Sweden. [Note: most children are not shown]
The lineage of the recent monarchs is highlighted as a chain, with an aborted side-branch dashed.
If we focus solely on the line of descent involved in legitimization, then genealogies can be represented as a chain (as shown in the genealogy above). However, if we include the rest of the paternal lines of descent then family genealogies can be represented as a tree. However, if we include some or all of the maternal lineages as well, then family genealogies can be represented as a network. For example, the biblical genealogies only rarely name women, but where females are specifically named the genealogies actually form a reticulated network. Jacob produced offspring with both Rachel and Leah, who were his first cousins; and Isaac and Rebekah were first cousins once removed. Even Moses was the offspring of parents who were, depending on the biblical source consulted, either nephew-aunt, first cousins, or first cousins once removed. These relationships cannot be represented in a tree. (See also the complex genealogy of the Spanish branch of the Habsburgs, who were kings of Spain from 1516 to 1700.)

This idea of genealogical chains, trees and networks was straightforward to transfer from humans to other species. Originally, biologists stuck pretty much to the idea of a chain of relationships among organisms, as presented in the early part of Genesis. Human genealogies were traced upwards to Adam and from there to God, and thus species relationships were traced upwards to God via humans. However, by the second half of the 1700s both trees and networks made their appearance as explicit suggestions for representing biological relationships. In particular, Buffon (1755) and Duchesne (1766) presented genealogical networks of dog breeds and strawberry cultivars, respectively.

However, these authors did not take the conceptual leap from within-species genealogies to between-species phylogenies. Indeed, they seem to have explicitly rejected the idea, confining themselves to relationships among "races". It was Charles Darwin and Alfred Russel Wallace, a century later, who first took this leap, apparently seeing the evolutionary continuum that connects genealogies to phylogenies. In this sense, they both took ideas that had been "in the air" for several decades, but previously applied only within species, and applied them to the origin of species themselves. [See the Note below.] Both of them, however, confined themselves to genealogical trees rather than using networks. It seems to me that it was Pax (1888) who first put the whole thing together, and produced inter-species phylogenetic networks (along with some intra-species ones).

In this sense, the biblical Tree of Life has only a peripheral relevance to phylogenetics. Darwin used it as a rhetorical device to arouse the interest of his audience (Hellström 2011), but it was actually the biblical genealogies that were of most practical importance to his evolutionary ideas. Apart from anything else, the original biblical tree was actually the lignum vitae (Tree of Eternal Life) not the arbor vitae (Tree of Life). Similarly, the tree from which Adam and Eve ate the forbidden fruit was the lignum scientiae boni et mali (Tree of Knowledge of Good and Evil), not the arbor scientiae (Tree of Knowledge) that was subsequently used as a metaphor for human knowledge.

Note. Along with phylogenetic trees, Darwin and Wallace did not actually originate the idea of natural selection, which had previously been discussed by people such as James Hutton (1794), William Charles Wells (1818), Patrick Matthew (1831), Edward Blyth (1835) and Herbert Spencer (1852). However, this discussion had been in relation to within-species diversity, whereas Wallace and Darwin applied the idea to the origin of between-species diversity (i.e. the origin of new species).


Buffon G-L de. 1755. Histoire naturelle générale et particulière, tome V. Paris: Imprimerie

Duchesne A.N. 1766. Histoire naturelle des fraisiers. Paris: Didot le Jeune & C.J. Panckoucke.

Hellström N.P. 2011. The tree as evolutionary icon: TREE in the Natural History Museum, London. Archives of Natural History 38: 1-17.

Pax F.A. 1888. Monographische übersicht über die arten der gattung Primula. Bot. Jahrb. Syst. Pflanzeng. Pflanzengeo. 10:75-241.

Pietsch T.W. 2012. Trees of life: a visual history of evolution. Baltimore: Johns Hopkins University Press.

August 17, 2014


These illustrations are from Alper Uzun's Biocomicals web site.

Bioinformaticians' dream

Bioinformaticians' reality