The Genealogical World of Phylogenetic Networks

Biology, computational science, and networks in phylogenetic analysis


XML feed

Last update

1 hour 12 min ago

September 30, 2014


It would be nice to think that genealogical history can be reconstructed with ease. However, this is known not to be so. In particular, being able to reconstruct an overall history from a collection of sub-histories, which can thought of as the "building blocks", is not necessarily guaranteed.

That is, even given a complete collection of all of the sub-histories it is not necessarily possible to reconstruct a unique overall history. In other words, there can be pairs of graphs that do not represent the same evolutionary histories, but still display exactly the same collection of building blocks. ("Display" means roughly that a building block can be obtained by simply deleting some of the edges and vertices in the graph.) Mathematically, the sub-histories do not determine (or encode) the history.

For example, it is known that pedigrees cannot necessarily be reconstructed from a collection of all of the sub-pedigrees (Thatte 2008). Pedigrees are the traditional "family trees" showing the ancestry of individuals. Pedigrees differ from phylogenies in that all of the individuals have two parents (rather than possibly having a single immediate ancestor) and there are probably multiple roots (unless there is considerable inbreeding).

Phylogenetic trees, on the other hand can be uniquely reconstructed from a collection of all of the possible sub-trees (see Dress et al. 2012). This is one of the things that makes trees valuable as a phylogenetic model — it is theoretically possible to collect enough information to construct a unique phylogenetic tree.

Rooted phylogenetic networks do not, however, share this property. For some time it has been known that networks cannot necessarily be built from their building blocks, whether those blocks are rooted trees (Willson 2011) or triplets (= rooted 3-taxon trees) or clusters (= rooted sub-trees = clades) (Gambette and Huber 2012).

This is illustrated in the next figure (adapted from Huber et al.), which shows two networks at the top and below that the four trees that are displayed by both of them (by deleting one of each pair of incoming edges at the two reticulation nodes). Given these four trees we cannot reconstruct a unique network, and yet they are the only four trees associated with either network.

To make matters worse, Huber et al. (in press) have now revealed that we can't reconstruct rooted phylogenetic networks even from sub-networks. To do this they show that networks cannot necessarily be built from trinets (= rooted 3-taxon networks). Certain types of networks (e.g. level-1, level-2, tree-child) can be reconstructed (van Iersel and Moulton 2014), but Huber et al. show the example in the second figure, which shows two networks at the top and below that the four trinets that are displayed by both of them. Given these four trinets we cannot reconstruct a unique network, and yet they are the only four trinets associated with either network.

This means that "even if all of the building blocks for some reticulate evolutionary history were to be taken as the input for any given network building method, the method might still output an incorrect history." The best analogy here is Humpty Dumpty — even given all of the pieces, we literally might not be able to put him back together again. We could if he is a rooted tree, but we cannot guarantee it if he is a rooted network or pedigree.

This may not matter in practice, given that we don't yet know the circumstances under which it is possible to uniquely reconstruct networks, but it does mean that we acquire a certain degree of uncertainty as we move from "tree thinking" to "network thinking".


Dress A, Huber KT, Koolen J, Moulton V, Spillner A (2012) Basic Phylogenetic Combinatorics. Cambridge Uni Press.

Gambette P, Huber K (2012) On encodings of phylogenetic networks of bounded level. Journal of Mathematical Biology 65: 157-180.

Huber KT, van Iersel L, Moulton V, Wu T (in press) How much information is needed to infer reticulate evolutionary histories? Systematic Biology

van Iersel L, Moulton V (2014) Trinets encode tree-child and level-2 phylogenetic networks. Journal of Mathematical Biology 68: 1707-1729.

Thatte BD (2008) Combinatorics of pedigrees i: counterexamples to a reconstruction problem. SIAM Journal of Discrete Mathematics 22: 961-970.

Willson SJ (2011) Regular networks can be uniquely constructed from their trees. IEEE/ACM Transactions on Computational Biology and Bioinformatics 8: 785-796.

September 28, 2014


Family pedigrees seem to be confusing things, because there are two distinct interpretations of the expression "family tree".

First, the pedigree tree could be drawn with a particular contemporary person at the root of the tree, so that the tree expands backwards in time to increasing numbers of ancestors at the leaves. In some ways this seems quite illogical as an analogy, given that base of a real tree is the origin of its growth.

Second, the pedigree tree could be drawn with a particular ancestor at the root of the tree, so that the tree expands forwards in time to increasing numbers of descendants at the leaves. This is more logical, although we often draw the root at the top. (The following example is actually a network, rather than strictly a tree; see also Pedigrees and phylogenies are networks not trees.)

Pedigrees are generally somewhat different from phylogenies, but in phylogenetics we do choose the latter option for interpreting trees — we start with a collection of contemporary leaves and try to reconstruct the tree backwards towards the common ancestor. Thus the root is at the "base" of the tree, even when we draw the root at the top of the diagram.

In popular usage these distinctions are often blurred. Consider this "family tree" of the Disney character Goofy. It is taken from Gilles R. Maurice's Calisota web page, where the character names are listed clearly.

This is based on the first usage described above, since Goofy himself is at the base and his ancestors are at the leaves. This is actually closer to a lineage rather than a tree, especially as no females seem to be involved at any stage.

However, roughly the same information can be presented the other way around. This cartoon is taken from a different Calisota page.

Here, Goofy is now at the top of the tree and his ancestry proceeds downwards, with the oldest ancestor at the base (except for his son!). This really is confusing.

September 23, 2014


I have written before about How to interpret splits graphs. However, it is worth emphasizing a few points, so that people don't keep Mis-interpreting splits graphs.

A splits graph can potentially represent two main types of pattern. First, like a clustering analysis, it represents groups in the data that are in some way similar. Each group is represented by an explicit split in the graph (see Recognizing groups in splits graphs). The clusters may be hierarchically arranged (each group nested within another group), and they may overlap, so that objects can simultaneously be a member of more than one group. If the clusters do not overlap then the graph will be a tree.

Second, like on ordination analysis, a splits graph can summarize the multi-dimensional neighborhoods of the different objects. That is, the relative distance between the points on the graph summarizes the relationships among the objects — closer objects, as measured along the edges of the graph, are more similar.

These two patterns often appear in the same splits graph. Unfortunately, many published papers mis-interpret neighborhoods as splits. If there is an explicit split representing a cluster of interest, then the data can be said to support that possible cluster. However, if no such split exists, then the graph is agnostic with respect to that cluster — there might be no support for it in the data, or the split might be left out of the graph because other splits out-weigh it. So, graph objects occupying a particular neighborhood might not be well-supported by the original data, contrary to the interpretation sometimes seen in the literature.

This can be illustrated with a specific example, taken from: Sicoli MA, Holton G (2014) Linguistic phylogenies support back-migration from Beringia to Asia. PLOS One 9: e91722.

The splits graph is a consensus network, summarizing all of the splits with at least 10% support in 3000 MCMC bayesian trees. The authors note that the dashed line represents a "primary division" between the groups, and that the differently colored objects represent "clear groupings".

However, the dashed line is supported only by a small split, which has a larger contradictory split (that puts the North PCA group with the Plains-Apachean group). This split thus cannot be said to be well supported. Furthermore, the South Alaska grouping is not supported by any split shown in the graph (there are, however, two splits that combine uniquely to support it). That is, the South Alaska grouping represents a neighborhood rather than a supported cluster. Finally, the Alaska-Canada-1 grouping is also not supported by an uncontradicted split (ie. the tcb taa tau samples could as easily be part of the West Alaska grouping). All of the other identified groups are supported by unique and uncontradicted splits.

So, there are three types of pattern in this splits graph with respect to the groups of interest to the authors: uncontradicted splits, contradicted splits, and neighborhoods, representing good support, medium support and agnosticism, respectively. It is important to recognize these three possibilities, and to interpret them correctly with respect to "support" for any conclusions.

As an aside, I will point out that in the other splits graph in the same paper (a NeighborNet): the dashed line is not supported by any split, two of the colored groupings are not supported by any split, and two of the others have only a small contradicted split. Thus, the "primary division" and the "clear groupings" mostly represent neighborhoods, and are thus only dubiously supported.

September 21, 2014


I have commented before about the perceived tendency to resist thinking about evolutionary relationships as networks (Resistance to network thinking), and even to present reticulating evolutionary relationships as trees rather than as networks (The dilemma of evolutionary networks and Darwinian trees). Charles Darwin seems to be the guilty party in starting this phenomenon.

This behavior becomes particularly obvious when we consider family genealogies. A good example appears when we consider the family relationships of the Olympian gods of Ancient Greece. Several illustrations of these relationships are gathered together on the Olympian Gods Family Tree web page.

Noteworthy is the particularly frisky nature of Zeus, who "got around a bit", to put it mildly. As shown in the first diagram, Zeus was the offspring of Cronus and Rhea. However, he then fathered children with at least nine people, including two of his own sisters, an aunt, a first cousin, and several first cousins once removed, among others. This creates the complex network shown.

However, not everyone wants to draw family genealogies as reticulating networks. After all, they are usually called "family trees". As shown by the examples below, the most common way to reduce a network to a tree is simply to repeat people's names as often as necessary. That is, rather than have them appear once (representing their birth) with multiple reticulating connections representing their reproductive relationships, they appear repeatedly, once for their birth and once for each relationship, so that there are no reticulations. I will leave it to you to count how often Zeus appears in each of these so-called family trees.

Clearly, this is misleading, and it makes no sense to obscure the fact that a so-called tree is actually a reticulate network. If relationships are reticulate then it is best to illustrate them that way, rather than to disguise the networks as trees.

September 16, 2014


Phylogenetic networks are of two types: those that produce direct evolutionary inferences about gene flow (eg. hybridization networks, HGT networks), and those that display multiple patterns in multivariate datasets without any necessary evolutionary implications. The latter (called data-display networks) can be used both a priori as tools for exploratory data analysis (EDA), and a posteriori as a means of evaluating (or cross-checking) the support for inferences derived from other analyses (such as evolutionary networks).

Here, I present an example of the a posteriori usage.

The data and initial analysis come from:
Fu Q, Meyer M, Gao X, Stenzel U, Burbano HA, Kelso J, Pääbo S. (2013) DNA analysis of an early modern human from Tianyuan Cave, China. Proceedings of the National Academy of Sciences of the USA 110: 2223-2227.They describe their genome data and evolutionary analysis like this:
We have extracted DNA from a 40,000-year-old anatomically modern human from Tianyuan Cave outside Beijing, China.To investigate the relationship of the Tianyuan individual to present-day populations, we compared it to chromosome 21 sequences from 11 present-day humans from different parts of the world (a San, a Mbuti, a Yoruba, a Mandenka, and a Dinka from Africa; a French and a Sardinian from Europe; a Papuan, a Dai, and a Han from Asia; and a Karitiana from South America) and a Denisovan individual, each sequenced to 24- to 33-fold genomic coverage. Denisovans are an extinct group of Asian hominins related to Neandertals [and used as an outgroup]. In the combined dataset, 86,525 positions variable in at least one individual are of high quality in all 13 individuals.To more accurately gauge how the population from which the Tianyuan individual is derived was related to Eurasian populations, while taking gene flow between populations into account, we used a recent approach that estimates a maximum-likelihood tree of populations and then identifies relationships between populations that are a poor fit to the tree model and that may be due to gene flow [using the TreeMix program] ... The maximum-likelihood tree [reproduced above] shows that the branch leading to the Tianyuan individual is long, due to its lower sequence quality. However, among Eurasian populations, Tianyuan clearly falls with Asian rather than European populations (bootstrap support 100%). The strongest signal not compatible with a bifurcating tree is an inferred gene-flow event that suggests that 6.7% of chromosome 21 in the Papuan individual is derived from Denisovans ... When this is taken into account, the Tianyuan individual appears ancestral to all Asian individuals studied. We note, however, that the relationship of the Tianyuan and Papuan individuals is not resolved (bootstrap support 31%).Setting aside the faux pas about the Tianyuan individual being "ancestral" to the others (it is shown in the tree-based figure as the sister group not the ancestor), most of the other interpretations can be assessed by looking at the multivariate data independently of any evolutionary inference. This can be done using the pairwise nucleotide differences among the samples (provided in Table 1 of the paper) and a NeighborNet data-display network, as shown in the splits graph below.

We can note the following points, some of which support the authors' conclusions and some of which don't. [Note: the authors refer to their figure as a "tree", although it is an introgression network.]:
  • All terminal edges in the network are long, and so there is actually not much genomic information on chromosome 21 about relationships.
  • The network splits do roughly match the tree splits, and so the network apparently does reflect some evolutionary information.
  • The identified gene flow from the Denisovan to the Papuan is represented by a clear split in the network. The weight (0.7335) makes it the fifth largest non-trivial split. That is, it is larger than some of the splits that purportedly represent tree-like evolution.
  • The largest split (weight = 2.8942) separates the non-African samples from the African samples + Denisovan outgroup, which does accord with the postulated dispersal of humans out of Africa.
  • The second (1.1459) and third (0.8073) largest splits are near the root of the tree.
  • The European split is the fourth largest (0.7670). The South American sample is included with the Asian group, reflecting the idea that the native people of the Americas migrated there from Asia across the Bering Strait.
  • The relationships among the Asian samples in the network do not all match those in the tree. Notably, the Han+Dai split (0.5124) is smaller than the Han+Karitiana split (0.6292), and yet the former appears in the tree with 100% bootstrap support.
  • The Han+Dai+Karitiana split is well supported (0.4450), but the Han+Dai+Karitiana +Papuan split is not (0.0152), as reflected in the 31% bootstrap value for the latter in the tree.
  • The Han+Dai+Karitiana+Papuan+Tianyuan split is not displayed in the network, although it has a long edge in the tree. The closest network split, as displayed, includes the Denisovan sample. Thus, the network emphasizes the reticulate Denisovan-Papuan relationship at the expense of the showing all of the tree-like relationship among the Asian samples.
  • The Tianyuan edge is not long in the network whereas it is long in the tree. This is likely to be because of uncertainty in its placement in the tree, rather than poor sequence quality, as claimed by the authors.

Thus, the data-display network questions some of the details of the authors' evolutionary network. However, it does support placing the Tianyuan sample with the Asian ones, as well as possible gene flow from the Denisovan sample to the Papuan one.

It thus seems to be a valuable procedure to cross-check any evolutionary analysis with a data-display network. As I have noted before (Networks and bootstraps as tree-support criteria; How networks differ from bootstrapped trees), bootstap values on a tree are insufficient as a means of assessing the robustness of evolutionary diagrams.

September 14, 2014


I have noted before that the evolutionary history of musical instruments is likely to be a reticulating network rather than being tree-like (Cornets: from a tree to a network). As another illustration of the pattern, we can consider the evolution over the past few centuries of the Spanish or flamenco guitar (taken from the Origem do nome Violão blog post).

This genealogy (with time proceeding from left to right) shows three basic characteristics that seem to be common in anthropological histories. First, there are multiple roots — in this case, three different instruments from the 16th century have provided input into the modern acoustic guitar. Second, there is an early history of reticulation, with ideas for new instrumentation being taken freely from among the existing instruments, in this case presumably in the search for better sound reproduction. Third, there is simple transformational evolution, with new models replacing the previous ones in popularity — for example, over the past 100 years the Spanish guitar has simply gotten larger (this is Cope's Rule.)

September 9, 2014


I noted in my previous blog post (Charles Darwin and the coalescent) that the multispecies coalescent needs to be based on a network model not a tree model. This is because reticulation processes occur both within species and between species — there is gene flow within genealogies and within phylogenies.

Reticulate genealogies are nothing new, and I have blogged about some of the best-known human genealogies with reticulations due to consanguinity (marriage between close relatives):
King Charles II of Spain
Charles Darwin
Henri Toulouse-Lautrec
Albert Einstein
Pharaoh Tutankhamun
Pharaoh Cleopatra

Importantly, in the modern world there are quite a few genealogical datasets available for study. For example, the Kinsources repository has c. 100 datasets from around the world, covering multi-generational histories for nearly 350,000 individuals. These data are actively used for research (eg. Bailey et al. 2014).

However, the best documented human genealogies are those for the various Anabaptist populations, who moved from Europe to North America during the 18th and 19th centuries. Anabaptists have mostly closed populations (ie. marriages occur solely within a population), and they are thus inbred, and most importantly they maintain detailed written genealogies. These populations include the Mennonites, Hutterites and Amish, the latter being the best known.

As noted by Agarwala et al. (2001):
The term "Anabaptist" literally means "rebaptizer" and is used to refer to a Christian movement that arose in central Europe in the first half of the 16th century. Adherents support adult baptism, pacifism, and separation of church and state. Among the large Anabaptist groups existing today are Mennonites (who were originally followers of Menno Simons), Amish (originally followers of Jakob Ammann who split away from the Mennonites at the end of the 17th century), and Hutterites (originally followers of Jakob Hutter). Amish and Mennonites emigrated to North America in multiple waves in the 18th and 19th centuries. The Hutterites began emigrating to the northern and western parts of North America in the late 1800s.Distribution of Amish settlements in North America
Note the rapid expansion over the past 25 years.
The Mennonites originated in the Swiss Alps, and diffused northward into Germany and the Netherlands. The Dutch/North German Mennonites began the migration to America in the 1680s, followed by a much larger migration of Swiss/South German Mennonites beginning in 1707. The Amish are an early split from the Swiss/South German group that occurred in 1693. There are now at least 200,000 Amish in the eastern United States and eastern Canada (see the map above, taken from here), with the numbers apparently growing rapidly with recently increasing movement westward. There are various subgroups (eg. Old Order Amish, New Order Amish). There are about 1.7 million Mennonites worldwide, with c. 150,000 in the eastern United States and eastern Canada. The genealogies of 295,000 Mennonite and Amish individuals from the eastern USA have been databased (Agarwala et al. 2001).

The Hutterites originated as an Anabaptist offshoot in the Tyrolean Alps in the 1500s, but now there are c. 135,000 Hutterites living on 1,350 communal farms in the northern United States (principally South Dakota) and western Canada. Genealogical records trace all extant Hutterites to 90 ancestors who lived during the early 1700s to the early 1800s (see Ober et al. 1999).

These Anabaptist groups are frequently used in medical studies, because it is possible to relate disease occurrences to the recorded genealogy, and thus to assess the genetic component of the disease (eg. Dorsten et al. 1999, Hou et al. 2013). So, the literature is replete with figures showing the distribution of different diseases plotted onto the genealogy. I have included some of the Amish ones here, to illustrate the extreme reticulation that results when inbreeding is ongoing over many generations.

This first one is from Georgi et al. (2014). The diseased people are marked in red.

The next one is from Garner et al. (2001).

This one is from Lee et al. (2008).

The final one is from Racette et al. (2002).

Here is one small part of this genealogy, which emphasizes that between-generation marriages are an important component of the consanguinity.


Agarwala R, Schaffer A, Tomlin J (2001) Towards a complete North American Anabaptist genealogy II: analysis of inbreeding. Human Biology 73: 533-545.

Bailey DH, Hill KR, Walker RS (2014) Fitness consequences of spousal relatedness in 46 small-scale societies. Biology Letters 10: 20140160.

Dorsten L, Hotchkiss L, King T (1999) The effect of inbreeding on early childhood mortality: twelve generations of an Amish settlement. Demography 36: 263-271.

Garner C, McInnes LA, Service SK, Spesny M, Fournier E, Leon P, Freimer NB (2001) Linkage analysis of a complex pedigree with severe bipolar disorder, using a Markov chain Monte Carlo method. American Journal of Human Genetics 68: 1061-1064.

Georgi B, Craig D, Kember RL, Liu W, Lindquist I, Nasser S, Brown C, Egeland JA, Paul SM, Bućan M (2014) Genomic view of bipolar disorder revealed by whole genome sequencing in a genetic isolate. PLoS Genetics 10: e1004229.

Hou L, Faraci G, Chen DT, Kassem L, Schulze TG, Shugart YY, McMahon FJ (2013) Amish revisited: next-generation sequencing studies of psychiatric disorders among the Plain people. Trends in Genetics 29: 412-418.

Lee SL, Murdock DG, McCauley JL, Bradford Y, Crunk A, McFarland L, Jiang L, Wang T, Schnetz-Boutaud N, Haines JL (2008) A genome-wide scan in an Amish pedigree with parkinsonism. Annals of Human Genetics 72: 621-629.

Ober C, Hyslop T, Hauck WW (1999) Inbreeding effects on fertility in humans: evidence for reproductive compensation. American Journal of Human Genetics 64: 225–231.

Racette BA, Rundle M, Wang JC, Goate A, Saccone NL, Farrer M, Lincoln S, Hussey J, Smemo S, Lin J, Suarez B, Parsian A, Perlmutter JS (2002) A multi-incident, Old-Order Amish family with PD. Neurology2 58: 568-574.

September 7, 2014


In an earlier blog post (The ultimate phylogenetic network?) I reproduced the lattice network from the anthropologist Franz Weidenreich. This comes close to being as complex as a network can get when applied to groups of organisms. However, when we study the genealogy of individuals, the network can get much more complex. This will be most true when there are marriages between close relatives (consanguinity), which creates inbreeding.

The family pedigree (or family tree!) shown here is for a group of people in a recently isolated population from the southwestern area of The Netherlands. There are 4,645 people involved, covering 18 generations (one row each). The average number of consanguineous loops for the 103 study individuals is 71.7, which is what is creating all of the cross-connections that make the network look so horrendous. (Consanguineous or inbreeding loops are illustrated here.)

The genealogy is from:
Liu F, Arias-Vásquez A, Sleegers K, Aulchenko YS, Kayser M, Sanchez-Juan P, Feng BJ, Bertoli-Avella AM, van Swieten J, Axenovich TI, Heutink P, van Broeckhoven C, Oostra BA, van Duijn CM (2007) A genomewide screen for late-onset Alzheimer disease in a genetically isolated Dutch population. American Journal of Human Genetics 81: 17-31.

September 2, 2014


The full title of Charles Darwin's most famous book was On the Origin of Species by Means of Natural Selection, or the Preservation of Favoured Races in the Struggle for Life. It is important to note that this title juxtaposes the concepts of between-species variation and within-species variation (Darwin usually referred to "races" rather than to "breeds", "subspecies", etc). This was one of his major insights: the idea that there is a continuum of variation in biology through time (or, as he put it, that it is arbitrary whether variants are treated as different races or as different species).

As I recently noted, this paved the way for between-species phylogenies to be seen as directly analogous to within-species genealogies (The role of biblical genealogies in phylogenetics) — previous applications of genealogies to non-humans (such as those of Buffon and Duchesne) had been explicitly restricted to within-sepcies relationships.

This conceptual integration of within-species and between-species relationships has become explicit in modern biology by using multispecies coalescent models to integrate population genetics and phylogenetics. As noted by Reid et al. (2014):
These models treat populations, rather than alleles sampled from a single individual, as the focal units in phylogenetic trees. The multispecies coalescent model connects traditional phylogenetic inference, which seeks primarily to infer patterns of divergence between species, and population genetic inference, which has typically focused on intraspecific evolutionary processes. The development of these models was motivated by the common empirical observation that genealogies estimated from different genes are often discordant and the discovery that, if ignored, this discordance can bias parameters of direct interest to systematists, such as the relationships and divergence times among species.However, as specifically emphasized by Reid et al.:
In order to reconcile discordance among gene trees and uncover true species relationships, the first gene tree/species tree models assumed that discordance is solely the result of stochastic coalescence of gene lineages within a species phylogeny ... Coalescent stochasticity, however, is not the only source of gene tree discordance. Selection, hybridization, horizontal gene transfer, gene duplication/extinction, recombination, and phylogenetic estimation error can also result in discordance.They examined this situation by studying the fit of the multispecies coalescent model:
to 25 published data sets. We show that poor model fit is detectable in the majority of data sets; that this poor fit can mislead phylogenetic estimation; and that in some cases it stems from processes of inherent interest to systematists ...Our analyses suggest that poor fit to the multispecies coalescent model can mislead inference in empirical studies. In the case of recent hybridization, the consequences may be severe, as species divergences are forced to post-date gene divergences ... When topological conflict among coalescent genealogies is the result of ancient hybridization, balancing selection, or gene duplication and extinction, the consequences may be less severe.In other words, tree-based phylogenetics is inadequate in practice because of gene flow. Within-species genealogies and between-species phylogenies intersect in the concept of a network, not a tree. That is, the multispecies coalescent needs to be based on a network model not a tree model:
The biological processes that generate variation in gene tree topologies should be explicitly modeled, as should relevant dynamics of molecular evolution. Increasingly complex multispecies coalescent models are being implemented, but there are tradeoffs. Some examine gene duplication and extinction or migration but cannot estimate divergence times.So, current models are inadequate. It will be interesting to see how these approaches develop to incorporate gene flow (reticulation) into what has heretofore been a tree model (modeling only ancestor-descendant relationships), as we are still in need of methods for estimating rooted evolutionary networks.


Reid NM, Hird SM, Brown JM, Pelletier TA, McVay JD, Satler JD, Carstens BC (2014) Poor fit to the multispecies coalescent is widely detectable in empirical data. Systematic Biology 63: 322-333.

August 31, 2014

August 26, 2014


I have previously noted that the first empirical phylogenetic tree apparently was published by St George Jackson Mivart in late 1865, a full 6 years after Charles Darwin released On the Origin of Species (Who published the first phylogenetic tree?). Mivart was not necessarily the first to start producing such a tree, but he got into print first. For example, Franz Martin Hilgendorf wrote a PhD thesis in 1863 for which he produced a hand-drawn tree, but he did not actually include the tree itself in the thesis (The dilemma of evolutionary networks and Darwinian trees). Also, Ernst Heinrich Philipp August Haeckel claimed to have started work on his series of phylogenetic trees in 1864, but the resulting book, Generelle Morphologie der Organismen, was not published until 1866 (Who published the first phylogenetic tree?).

Another actor in this series of events was Fritz Müller, who can also be considered to have published a tree first, in 1864, albeit a very small one.

Johann Friedrich Theodor Müller (1822–1897)

Müller was born in Germany, but in the 1850s he emigrated to southern Brazil with his brother and their wives. As a naturalist in the Atlantic forest, he studied the insects, crustaceans and plants, and he is chiefly remembered today as the describer of what we now call Müllerian mimicry (the phenotypic resemblance between two or more unpalatable species).
Heinrich Bronn's German translation of the Origin appeared in 1860, and Müller read it and agreed with its central thesis (as did Hilgendorf and Haeckel). Indeed, in 1864 he published a book discussing some of the empirical evidence that he adduced with regard to the Crustacea:
Für Darwin
Verlag von Wilhelm Engelman, Leipzig.The book has 91 pages and 67 figures, and the Foreword is dated 7th September 1863. Several copies are available in Google Books (here, here, here).

In this book Müller described the development of Crustacea, illustrating that crustaceans and their larvae could be affected by adaptations and natural selection at any growth stage. He discussed in detail how living forms diverged from ancestral ones, based on his study of aerial respiration, larvae morphology, sexual dimorphism, and polymorphism.

Darwin read the book, and began a life-long correspondence with Müller (ultimately some 60 letters having been exchanged between them). Subsequently, Darwin commissioned an English translation of the book, and in 1869 published it with John Murray on commission (ie. taking the risk himself). Darwin printed 1000 copies but it apparently was not a great success:
Facts and Arguments for Darwin
Translated from the German by W.S. Dallas
John Murray, London.The book has 144 pages and 67 figures, and the Translator's Preface is dated 15th February 1869. A copy is available in the Biodiversity Heritage Library (here).

The following quotes are from this English translation [Note that Müller's unnecessarily convoluted sentences exist in the original German — this writing style is one reason why the book is not as well known as the works of Darwin and Wallace]:
It is not the purpose of the following pages to discuss once more the arguments deduced for and against Darwin's theory of the origin of species, or to weigh them one against the other. Their object is simply to indicate a few facts favourable to this theory ...When I had read Charles Darwin's book 'On the Origin of Species,' it seemed to me that there was one mode, and that perhaps the most certain, of testing the correctness of the views developed in it, namely, to attempt to apply them as specially as possible to some particular group of animals ...When I thus began to study our Crustacea more closely from this new stand-point of the Darwinian theory,—when I attempted to bring their arrangements into the form of a genealogical tree, and to form some idea of the probable structure of their ancestors,—I speedily saw (as indeed I expected) that it would require years of preliminary work before the essential problem could be seriously handled ...But although the satisfactory completion of the "Genealogical tree of the Crustacea" appeared to be an undertaking for which the strength and life of an individual would hardly suffice, even under more favourable circumstances than could be presented by a distant island, far removed from the great market of scientific life, far from libraries and museums—nevertheless its practicability became daily less doubtful in my eyes, and fresh observations daily made me more favourably inclined towards the Darwinian theory.In determining to state the arguments which I derived from the consideration of our Crustacea in favour of Darwin's views, and which (together with more general considerations and observations in other departments), essentially aided in making the correctness of those views seem more and more palpable to me, I am chiefly influenced by an expression of Darwin's: "Whoever," says he ('Origin of Species' p. 482), "is led to believe that species are mutable, will do a good service by conscientiously expressing his conviction."So, for the reason stated, Müller did not produce a complete phylogeny in the book. However, of particular interest to us is the figure on page 6 of the original German edition (page 9 of the translation). It turns out to be a pair of three-taxon statements concerning species of Melita (amphipods), as shown in the figure above (original) and below (translation). Müller has this to say:
[There are five] species of Melita ... in which the second pair of feet bears upon one side a small hand of the usual structure, and o the other an enormous clasp-forceps. This want of symmetry is something so unusual among the Amphipoda, and the structure of the clasp-forceps differs so much from what is seen elsewhere in the this order, and agrees so closely in the five species, that one must unhesitatingly regard them as having sprung from common ancestors belonging to them alone among known species.This is as clear a statement of synapomorphy, and its relationship to constructing a phylogeny, as you could get; and so we could credit Müller with having produced an empirical phylogenetic tree (the one on the left in the figures).

Equally interestingly, Müller then goes on to consider a potentially contradictory character: the secondary flagellum of the anterior antennae, which is missing in one species. This would produce a different three-taxon statement (shown on the right in the figures). He resolves the issue by suggesting that the flagellum might be similar to the situation in other species, where it is "reduced to a scarcely perceptible rudiment—nay, that it is sometimes present in youth and disappears at maturity". This is a clear example of the character conflict that arises when trying to construct an empirical phylogeny; and it was also encountered by Mivart in his studies of primate skeletons (Is this the first network from conflicting datasets?).


Müller did not publish a complete phylogeny, but instead discussed how to produce one, and illustrated the practicality (and necessity) of doing so. In the process, he produced a simple three-taxon statement (which is not even numbered as a figure). Nevertheless, this cladogram is technically the first in print, pre-dating Mivart by a year. Darwin was right to recognize its importance, although he seemed to take a while to bring it to the attention of the English-speaking public. Furthermore, Müller was apparently the first to encounter the empirical difficulty of how to deal with conflicting data, which would produce different phylogenetic trees. This is an issue that is just as important today as it was then.

August 24, 2014


For those of you who do not understand the notation:
Homo apriorius ponders the probability of a specified hypothesis, while Homo pragamiticus is interested by the probability of observing particular data. Homo frequentistus wishes to estimate the probability of observing the data given the specified hypothesis, whereas Homo sapients is interested in the joint probability of both the data and the hypothesis. Homo bayesianis estimates the probability of the specified hypothesis given the observed data.

August 19, 2014


Phylogeneticists treat the tree image as having special meaning for themselves. Conceptually, the tree is used as a metaphor for phylogenetic relationships among taxa, and mathematically it is used as a model to analyze phenotypic and genotypic data to uncover those relationships. Irrespective of whether this metaphor / model is adequate or not, it has a long history as part of phylogenetics (Pietsch 2012). Of particular interest ihas been Charles Darwin's reference to the "Tree of Life" as a simile, since that is clearly the key to the understanding of phylogenetics by the general public.

The principle on which phylogenetic trees are based seems to be the same as that for human genealogies. That is, phylogenies are conceptually the between-species homolog of within-species genealogies. As far as Western thought is concerned, human genealogies make their first important appearance in the Bible, with a rather specific purpose. The Bible contains many genealogies, mostly presented as chains of fathers and sons. For example, Genesis 5 lists the descendants of Adam+Eve down to Noah and his sons, which can be illustrated as a pair of chains (as shown in the first figure); and the rest of Genesis gets from there down to Moses' family, for which the genealogy can be illustrated as a complex tree.

The genealogy as listed in Genesis 5.
Cain's lineage was terminated by the Flood.
However, the theologically most important genealogies are those of Jesus, as recorded in Matthew 1:2-16 and Luke 3:23-38. Matthew apparently presents the genealogy through Joseph, who was Jesus' legal father; and Luke apparently traces Jesus' bloodline through Mary's father, Eli. These two lineages coalesc in David+Bathsheba, and from there they have a shared lineage back to Abraham. Their importance lies in the attempt to substantiate that Jesus' ancestry fulfils the biblical prophecies that the Messiah would be descended from Abraham (Genesis 12:3) through Isaac (Genesis 17:21) and Jacob (Genesis 28:14), and that he would be from the tribe of Judah (Genesis 49:8), the family of Jesse (Isaiah 11:1) and the house of David (Jeremiah 23:5).

That is, these genealogies legitimize Jesus as the prophesied Messiah. Following this lead, subsequent use of genealogies has commonly been to legitimize someone as a monarch, so that royal genealogies have been of vital political and social importance throughout recorded history (see the example in the next figure). This importance was not lost on the rest of the nobility, either, so that documented genealogies of most aristocratic families allow us to identify the first-born son of the first-born son, etc, and thus legitimize claimants to noble titles — genealogies are a way for nobles to assert their nobility.

The genealogy of the current royal family of Sweden. [Note: most children are not shown]
The lineage of the recent monarchs is highlighted as a chain, with an aborted side-branch dashed.
If we focus solely on the line of descent involved in legitimization, then genealogies can be represented as a chain (as shown in the genealogy above). However, if we include the rest of the paternal lines of descent then family genealogies can be represented as a tree. However, if we include some or all of the maternal lineages as well, then family genealogies can be represented as a network. For example, the biblical genealogies only rarely name women, but where females are specifically named the genealogies actually form a reticulated network. Jacob produced offspring with both Rachel and Leah, who were his first cousins; and Isaac and Rebekah were first cousins once removed. Even Moses was the offspring of parents who were, depending on the biblical source consulted, either nephew-aunt, first cousins, or first cousins once removed. These relationships cannot be represented in a tree. (See also the complex genealogy of the Spanish branch of the Habsburgs, who were kings of Spain from 1516 to 1700.)

This idea of genealogical chains, trees and networks was straightforward to transfer from humans to other species. Originally, biologists stuck pretty much to the idea of a chain of relationships among organisms, as presented in the early part of Genesis. Human genealogies were traced upwards to Adam and from there to God, and thus species relationships were traced upwards to God via humans. However, by the second half of the 1700s both trees and networks made their appearance as explicit suggestions for representing biological relationships. In particular, Buffon (1755) and Duchesne (1766) presented genealogical networks of dog breeds and strawberry cultivars, respectively.

However, these authors did not take the conceptual leap from within-species genealogies to between-species phylogenies. Indeed, they seem to have explicitly rejected the idea, confining themselves to relationships among "races". It was Charles Darwin and Alfred Russel Wallace, a century later, who first took this leap, apparently seeing the evolutionary continuum that connects genealogies to phylogenies. In this sense, they both took ideas that had been "in the air" for several decades, but previously applied only within species, and applied them to the origin of species themselves. [See the Note below.] Both of them, however, confined themselves to genealogical trees rather than using networks. It seems to me that it was Pax (1888) who first put the whole thing together, and produced inter-species phylogenetic networks (along with some intra-species ones).

In this sense, the biblical Tree of Life has only a peripheral relevance to phylogenetics. Darwin used it as a rhetorical device to arouse the interest of his audience (Hellström 2011), but it was actually the biblical genealogies that were of most practical importance to his evolutionary ideas. Apart from anything else, the original biblical tree was actually the lignum vitae (Tree of Eternal Life) not the arbor vitae (Tree of Life). Similarly, the tree from which Adam and Eve ate the forbidden fruit was the lignum scientiae boni et mali (Tree of Knowledge of Good and Evil), not the arbor scientiae (Tree of Knowledge) that was subsequently used as a metaphor for human knowledge.

Note. Along with phylogenetic trees, Darwin and Wallace did not actually originate the idea of natural selection, which had previously been discussed by people such as James Hutton (1794), William Charles Wells (1818), Patrick Matthew (1831), Edward Blyth (1835) and Herbert Spencer (1852). However, this discussion had been in relation to within-species diversity, whereas Wallace and Darwin applied the idea to the origin of between-species diversity (i.e. the origin of new species).


Buffon G-L de. 1755. Histoire naturelle générale et particulière, tome V. Paris: Imprimerie

Duchesne A.N. 1766. Histoire naturelle des fraisiers. Paris: Didot le Jeune & C.J. Panckoucke.

Hellström N.P. 2011. The tree as evolutionary icon: TREE in the Natural History Museum, London. Archives of Natural History 38: 1-17.

Pax F.A. 1888. Monographische übersicht über die arten der gattung Primula. Bot. Jahrb. Syst. Pflanzeng. Pflanzengeo. 10:75-241.

Pietsch T.W. 2012. Trees of life: a visual history of evolution. Baltimore: Johns Hopkins University Press.

August 17, 2014


These illustrations are from Alper Uzun's Biocomicals web site.

Bioinformaticians' dream

Bioinformaticians' reality

August 12, 2014


Sampling bias refers to a statistical sample that has been collected in such a way that some members of the intended statistical population are less likely to be included than are others. The resulting biased sample does not necessarily represent the population (which it should), because the population members were not all equally likely to have been selected for the sample.

This affects scientific work because all scientific questions are about the population not the sample (ie. we infer from the sample to the population), and we can only answer these questions if the samples we have collected truly represent the populations we are interested in. That is, our results could be due to to the method of sampling but erroneously be attributed to the phenomenon under study instead. Bias needs to be accounted for, but it cannot be assessed by looking at the sampled data alone. Bias can only be addressed via the sampling protocol itself.

In genome sequencing, sampling bias is often referred to as ascertainment bias, but clearly it is simply an example of the more general phenomenon. This is potentially a big problem for next generation sequencing (NGS) because there are multiple steps at which sampling is performed during genome sequencing and assembly. These include the initial collection of sequence reads, assembling sequence reads into contigs, and the assembly of these into orthologous loci across genomes. (NB: For NGS technologies, sequence reads are of short lengths, generally

August 10, 2014


In many games of chance the odds of winning or losing remain constant during play, such as in the street coin-game Two-Up and for the casino Roulette wheel. At the other extreme, the odds of winning are sometimes determined by the players to a much greater extent, such as in the card game Poker. This is why poker is such a popular form of gambling — all players are under the delusion that the advantage lies with them alone.

In between these extremes, there are games of chance where the odds of winning vary depending on the circumstances. If a player can identify these circumstances, then they can increase their wagers when the circumstances are favorable and decrease them when they are unfavorable, thus maximizing their chances of making a profit. This is called Advantage Gambling, and it is amenable to formal mathematical analysis. These analyses have kept a number of mathematicians gainfully employed over the centuries.

Some well-known examples of advantage gambling are the use of Arbitrage Bets in sports betting, and of Card Counting in card games. This blog post is about the latter, especially as applied to the casino card-game of Blackjack. [There are also many similar games played both inside and outside casinos, such as Twenty-one, Vingt-et-un, Spanish 21, Pontoon, etc.]

In blackjack the player is betting their card hand against that of the dealer (not any other player). The basic idea is to be dealt a hand of cards whose face values sum to a final score that is higher than that of the dealer's hand without exceeding a sum of 21. There are many variants throughout the world, although they tend to be minor variations on a single basic theme (as described by Wikipedia). In general, the dealer follows a strict set of rules specifying how many cards they can be dealt, while the player has a free choice regarding their own hand.

Clearly, the composition of the cards being dealt must change throughout a series of hands being dealt, because the deck of cards (or more usually several decks) gradually becomes exhausted. If the cards have been shuffled so that the random order of the cards is very even then there will be little change in composition through time, but if the random order is clustered (as it can be by random chance) then the composition of the cards remaining to be dealt may favor either the dealer or the player.

This favoritism happens because the dealer has to follow a fixed strategy, and certain cards favor that strategy. In particular, the dealer must always be dealt another card when their hand sums to a total in the range 12-16 (and sometimes 17). If the card dealt is a 10, J, Q or K (all of which have a value of ten) then the dealer's sum will exceed 21, and the player will win. Thus, if there is a high proportion of these cards remaining in the deck then the dealer is at a disadvantage relative to the player, who can chose not to take the extra card. On the other hand, if there is a high proportion of low cards remaining (especially 4s, 5s and 6s) then the dealer will not be disadvantaged.

In general play, the casino dealer will have an advantage of 0.5-1%, depending on the precise rules of play and how many decks of cards are in use simultaneously. So, in the long term the casino will make a profit, which is why they are in the gambling business in the first place. However, they make a smaller profit from blackjack than from any of their other games (for example, in roulette the casino's advantage is usually 5.3% in the USA and 2.7% in Europe), and this means that for blackjack the advantage gambler doesn't have to move the advantage very far for it to be in their favor instead of the casino's.

There is a Basic Strategy in blackjack, which stipulates what the gambler should do when their hand has any specified total against that of the dealer's — that is, whether they should Stand, Hit, Double Down, or Split. This was first explained by Roger Baldwin, Wilbert E. Cantey, Herbert Maisel and James P. McDermott in 1956 (Optimum strategy in blackjack. Journal of the American Statistical Association 51: 429-439); and Wikipedia provides a simple exposition. For the gambler, this strategy will lose the least amount of money to the casino in the long term (ie. lose only the 0.5-1% referred to above), as determined by mathematical analysis.

The advantage gambler wants to change these odds. The most common advantage play for blackjack is card counting, and it can change the advantage to be up to 2% in the gambler's favor. The essential idea is to keep a running track of whether the remaining undealt cards are biased towards small values (2, 3, 4, 5, 6) or large values (10, J, Q, K, A). To do this, a pre-specified value is added to the running total for each of the small cards that have already been dealt (and therefore can't still be in the deck), and a pre-specified value is subtracted for each of the large cards. The value of the running count will then indicate how much the advantage is in favor of the gambler. The gambler can then bet according to the size of their advantage.

There is nothing unique about this: "anyone who aspires to play Bridge, Stud Poker, Rummy, Gin, Pinochle, or Go Fish knows that you must keep track of the played cards" (Norman Wattenberger. 2009. Modern Blackjack: an Illustrated Guide to Blackjack Advantage Play). It requires no especial mathematical ability, although you do have to pay attention, and not forget what your count currently is (this is far simpler than playing bridge, where to play well you need to keep precise information about the remaining cards). Blackjack has apparently increased in popularity over the last 40 years because it is one of the few casino games that can consistently be won using expert play (maybe also video poker). However, the casinos will not unexpectedly try to stop you from winning via card counting.

The idea of counting cards in blackjack has been around since at least the 1950s, but the first popular text on the subject was Edward O. Thorp's book Beat the Dealer: a Winning Strategy for the Game of Twenty-One (1962). Since then, oodles of card counting systems have been devised, which differ in how many points are to be added to or subtracted from the running total for each card that is dealt. They range from relatively easy to implement to unnecessarily difficult.

We can look at the relationship between the different counting systems using a phylogenetic network. The data for 24 of these systems are available at Norman Wattenberger's Card Counting page (see also Popular Card Counting Strategies). The above graph is a NeighborNet (based on the manhattan distance) of these data. Systems near each other in the network have a similar assignment of points to cards, while systems further apart are progressively more different from each other. The network shows a simple trend of increasing complexity of the systems from the top-right to the bottom-left. [Note that some of the systems use the same points, and thus appear at the same place in the network, but these do differ in other ways.]

This trend correlates quite well with the perceived ease of use of the systems, with the hardest ones to use being highlighted in red in the network and the medium ones in blue. The hardest ones do seem to be the most successful at predicting good betting situations. However, the consensus seems to be that the most complex systems are not that much better than some of the simpler ones — these are slightly less powerful but far easier to use. That is, the differences in difficulty are much greater than are the differences in performance, and so the complex ones are rarely recommended these days.

The powerful but simple systems include KISS III, K-O, REKO and Red Seven. Indeed, K-O appears to be becoming one of the most popular card counting systems. However, the older Hi-Lo is probably the most used counting strategy in existence.

Other games

Actually, consistently winning at blackjack is now old hat. What is far more interesting is trying to be an advantage gambler at games like lotto and the lotteries. Advantage gambling at lotto turns out sometimes to be an investment strategy rather than a gamble. For example, there have been times when the prize money has actually been greater than the cost of the betting tickets required to cover all of the needed number combinations (see The International Lotto Fund) and other times when the prize distribution has made each ticket worth more than it costs (see Massachusetts' Cash WinFall). My favorite, though, is trying to work out how to use advantage gambling for scratch lotteries, the gambling that usually has the worst chance of winning (see this article about Joan Ginther, who has clearly tried).

August 5, 2014


Data-display networks are a means of visualizing complex patterns in multivariate data. One particular use is for displaying the patterns in a set of trees. For example, Consensus Networks and SuperNetworks are splits graphs that display the patterns common to some specified subset of a collection of trees (eg. a set of equally optimal trees, or a set of trees sampled by a bayesian or bootstrap analysis). Alternatively, Parsimony Networks try to simultaneously display all of the trees in a collection of most-parsimonious trees for a single dataset.

Another display method for multiple trees is what has been called a Cloudogram (see the post Cloudograms and data-display networks). These superimpose the set of all trees arising from an analysis, so that dark areas in such a diagram will be those parts where many of the trees agree on the topology, while lighter areas will indicate disagreement.

Yet another method for combining trees into a graph while retaining all of the original information from the source trees is the Tree Alignment Graph (TAG), an idea introduced by Stephen A. Smith, Joseph W. Brown and Cody E. Hinchliff (2013. Analyzing and synthesizing phylogenies using tree alignment graphs. PLoS Computational Biology 9: e1003223).

The authors note:
These methods address the problem of identifying common nodes and edges across sets of phylogenetic trees and constructing a data structure that efficiently contains this information while retaining original source information ... Mapping trees into a TAG exploits the fact that rooted phylogenetic trees are in fact a specific type of graph: they are directed, acyclic, and require that each node has, at most, one parent. By relaxing these requirements, we can combine multiple trees into a common graph, while minimizing changes to the semantic interpretations of nodes and edges in the trees. Because they contain nodes and edges directly analogous to those from their source trees, TAGs have the desirable quality of retaining the full identifiability of the original source trees they contain. Additionally, because they are not restricted to the bifurcating model of evolution, TAGs may represent conflict among source trees as reticulations in the graph.The basic principal is illustrated in the first figure (about). Internal nodes represent collections of terminal nodes, and arcs (directed edges) represent their relationships. Nodes and arcs are added to the growing TAG, each of which represents one relationship shown in one of the original trees. TAG A in the figure shows the result of combining the black, blue and orange trees, while TAG B shows the result of then adding the gray and green trees to TAG A (the arcs are colour-coded). The resulting TAG is thus a database of all of the original information, which can then be queried in any way to provide summaries of the data. In particular, standard network summaries can be used, such as node degree, which will highlight parts of the TAG with interesting characteristics.

The authors provide two empirical examples of applications. The one shown here involves 100 bootstrap trees for 640 species representing the majority of known lineages from the Angiosperm Tree of Life dataset (chloroplast, mitochondrial, and ribosomal data). The TAG is shown lightly in the background. Superimposed on this, the nodes are coloured to represent the effective number of parent nodes, and their size represents node bootstrap support. Highly supported nodes with a low number of effective parents (large blue nodes) are frequently recovered and confidently placed in the source trees, while highly supported nodes with a low number of effective parents (large and pink or orange) are frequently resolved in the source trees but their placement varies among bootstrap replicates. So, the three largest problem areas as illustrated in the TAG correspond to the Malpighiales, Lamiales and Ericales.

For comparison, a NeighborNet analysis of the same data is shown in the blog post When is there support for a large phylogeny? This simply shows an unresolved blob.

August 3, 2014


Cheese making is about 8,000 years old, and there are now about 1,000 distinct types of cheese throughout the world. As with most ancient crafts, the art of making cheese is to get the microbes to do most of the work for you.

To this end, there has been much interest in the microbial communities that occur in cheese rinds (the bit around the outside). Different communities are expected to be associated with different styles of cheese, since the production process can be quite different. This is shown in the first figure, which emphasizes that much of the difference between cheeses is due to different maturation procedures.

From Wolfe et al. (2014).
Recently, Wolfe BE, Button JE, Santarelli M, and Dutton R (2014. Cheese rind communities provide tractable systems for in situ and in vitro studies of microbial diversity. Cell 158: 422-433) had a look at the dominant genera of bacteria and microfungi in the rind communities of 137 different types of cheese. They don't actually tell us much about which cheeses these were, merely claiming:
We attempted to evenly sample across rind type (24 bloomy rind cheeses, 52 washed rind cheeses, and 61 natural rind cheeses) and geographic regions (87 European cheeses across 9 countries; 50 American cheeses across 13 states from the West Coast to the east Coast). We also attempted to sample across different milk types (77 cow milk, 34 goat milk, 21 sheep milk, and 5 mixed milk) and milk treatments (99 raw milk, 38 pasteurized).Based on sequencing the bacterial 16S and fungal ITS loci, the authors identified 14 bacterial and 10 fungal genera (moulds and yeasts) that occurred with an average abundance of >1%, as shown in the next figure.

The 137 rind samples with their bacterial (middle row) and fungal (bottom row) genera indicated
by different colours. The order of the samples was determined by UPGMA clustering (top row).
The authors also used shotgun metagenomic sequencing to identify a range of genes in the microorganisms. They present a phylogeny of one particular gene (shown in the next figure) that shows a close relationship between some of the cheese microbes and marine bacteria:
The widespread distribution and high abundance of marine-associated gamma-Proteobacteria, enriched in both washed and bloomy rind cheeses, was an unexpected finding in our survey of taxonomic diversity ... One possible source of these marine microbes is the sea salt used in cheese production.[Note: the other cheese rind bacterium shown in the phylogeny, Brevibacterium linens, is the one responsible for the unbelievable smell of washed-rind cheeses such as Epoisses, Münster and Limburger. It is also responsible for personal-hygiene issues such as foot odour. You can imagine how it first got into cheese making!]

However, Ropars J, Cruaud C, Lacoste S, and Dupont J (2012. A taxonomic and ecological overview of cheese fungi. International Journal of Food Microbiology 155: 199-210), in a related study, have pointed out the usual problem with microbial phylogenies: gene trees are frequently incongruent. So, the gene phylogeny shown above is not likely to be the species phylogeny. It would thus be of great interest to investigate the full microbial network, rather than looking at a single tree.

July 29, 2014


This post is just to let everyone know that Dan Gusfield's long-awaited book on the interface between phylogenetics and population genetics is now available.

The book is targeted for mathematically inclined readers. It has a few contributions from Charles H. Langley, Yun S. Song and Yufeng Wu. The title is described as "a portmanteau word derived from the single-crossover recombination of the words 'recombination' and 'combinatorics'."

Hardcover 448 pp; ISBN: 9780262027526; $60.00 £30.95
More information is available from The MIT Press.

This new book joins these previous contributions to the genre:

Image from Celine Scornavacca.

July 27, 2014


On pages 72-73 of the book Guide to Urban Moonshining: How to Make and Drink Whiskey (written by Colin Spoelman and edited by David Haskell, 2013, published by Harry N. Abrams), there is an illustration of something called the "American whiskeys family tree". This is reproduced in in the article The Bourbon Family Tree for GQ magazine, from where I sourced the copy here.

The author describes it as follows:
This chart shows the major distilleries operating in Kentucky, Tennessee, and Indiana, grouped horizontally by corporate owner, then subdivided by distillery. Each tree shows the type of whiskey made, and the various expressions of each style of whiskey or mash bill, in the case of bourbons. For instance, Basil Hayden's is a longer-aged version of Old Grand-Dad, and both are made at the Jim Beam Distillery. So, while the vertical axis is indeed a time scale, the trees are only marginally family trees in the genealogical sense. This is much more an attempt to  illustrate the corporate ownership of American whiskey, which is made principally from corn (and thus is generically called bourbon, although in Tennessee they seem to rarely use this word). The main distinctions among the brands are (i) whether the non-corn part is made from rye, a little bit more rye, or wheat, and (ii) the length of time it is aged between distillation and sale.

The reticulations among the trees apparently refer to blends. The ghost lineages at the right are described thus:
Willett, formerly only a bottler as Kentucky Bourbon Distillers, has been distilling its own product for about a year; I include the brands that it bottles from other sources for reference.