The Genealogical World of Phylogenetic Networks

Biology, computational science, and networks in phylogenetic analysis


XML feed

Last update

1 hour 48 min ago

February 24, 2015


Today is the third anniversary of starting this blog, and this is post number 325. Thanks to all of our visitors over the past three years — we hope that the next year will be as productive as this past one has been.

I have summarized here some of the accumulated data, in order to document at least some of the productivity.

As of this morning, there have been 238,613 pageviews, with a median of 192 per day. The blog has continued to grow in popularity, with a median of 70 pageviews per day in the first year, 189 per day in the second year, and 353 per day in the third year. The range of pageviews was 172-1148 per day during this past year. The daily pattern for the three years is shown in the first graph.

Line graph of the number of pageviews through time, up to today.
The largest values are off the graph. The green line is the half-way mark.
The inset shows the mean (blue) and standard deviation of the daily number of pageviews.
There are a few general patterns in the data, the most obvious one being the day of the week, as shown in the inset of the above graph. The posts have usually been on Mondays and Wednesdays, and these two days have had the greatest mean number of pageviews.

Some of the more obvious dips include times such as Christmas - New Year; and the biggest peaks are associated with mentions of particular blog posts on popular sites.

Unfortunately, the data are also seriously skewed by visits from troll sites. These have been particularly from the Ukraine, which is solely responsible for the peak between days 900 and 1000. The smaller following peak represents visits from Taiwan.

The posts themselves have varied greatly in popularity, as shown in the next graph. It is actually a bit tricky to assign pageviews to particular posts, because visits to the blog's homepage are not attributed by the counter to any specific post. Since the current two posts are the ones that appear on the homepage, these posts are under-counted until they move off the homepage, (after which they can be accessed only by a direct visit to their own pages, and thus always get counted). On average, 30% of the blog's pageviews are to the homepage, rather than to a specific post page, and so there is considerable under-counting.

Scatterplot of post pageviews through time, up to last week; the line is the median.
Note the log scale, and that the values are under-counted (see the text).
It is good to note that the most popular posts were scattered throughout the years. Keeping in mind the initial under-counting, the top collection of posts (with counted pageviews) have been:
8 The Music Genome Project is no such thing
Charles Darwin's unpublished tree sketches
The acoustics of the Sydney Opera House
Why do we still use trees for the dog genealogy?
How do we interpret a rooted haplotype network?
Carnival of Evolution, Number 52
Who published the first phylogenetic tree?
Phylogenetics with SpongeBob
Charles Darwin's family pedigree network
Faux phylogenies
Evolutionary trees: old wine in new bottles?
Network analysis of scotch whiskies
Tattoo Monday 8,347
1,747This list is not very different to the same time last year. Posts 129 (which is linked in Wikipedia) and 172 continue to receive visitors almost every day.

The audience for the blog continues to be firmly in the USA. Based on the number of pageviews, the visitor data are:
United States
Ukraine [spurious]
United Kingdom
Finally, if anyone wants to contribute, then we welcome guest bloggers. This is a good forum to try out all of your half-baked ideas, in order to get some feedback, as well as to raise issues that have not yet received any discussion in the literature. If nothing else, it is a good place to be dogmatic without interference from a referee!

February 22, 2015


As a means of motivating his interest in speciation, in The Origin of Species Charles Darwin highlighted the diversity of morphological forms among the finches of the Galápagos Islands, in the south-eastern Pacific Ocean, which he visited while circumnavigating the world in The Beagle. He considered this to be a prime example of biodiversity related to adaptation and natural selection, what we would now call an adaptive radiation.

Recently, the following paper, which provides a genomic-scale study of these birds, has attracted considerable attention:
Lamichhaney S, Berglund J, Almén MS, Maqbool K, Grabherr M, Martinez-Barrio A, Promerová M, Rubin CJ, Wang C, Zamani N, Grant BR, Grant PR, Webster MT, Andersson L (205) Evolution of Darwin's finches and their beaks revealed by genome sequencing. Nature 58: 371-375.The authors note:
Darwin's finches are a classic example of a young adaptive radiation. They have diversified in beak sizes and shapes, feeding habits and diets in adapting to different food resources. The radiation is entirely intact, unlike most other radiations, none of the species having become extinct as a result of human activities.Here we report results from whole genome re-sequencing of 120 individuals representing all Darwin's finch species and two closely related tanagers. For some species we collected samples from multiple islands. We comprehensively analyse patterns of intra- and inter-specific genome diversity and phylogenetic relationships among species. We find widespread evidence of inter-specific gene flow that may have enhanced evolutionary diversification throughout phylogeny, and report the discovery of a locus with a major effect on beak shape.Sadly, the authors try to study the intra- and inter-specific variation principally using phylogenetic trees. They do this in spite of noting that:
Extensive sharing of genetic variation among populations was evident, particularly among ground and tree finches, with almost no fixed differences between species in each group.Clearly, this situation requires a phylogenetic network for adequate study, as a network can always display at least as much phylogenetic information as a tree, and usually considerably more. The authors do recognize this:
A network constructed from autosomal genome sequences indicates conflicting signals in the internal branches of ground and tree finches that may reflect incomplete lineage sorting and/or gene flow ... We used PLINK to calculate genetic distance (on the basis of proportion of alleles identical by state) for all pairs of individuals separately for autosomes and the Z chromosome. We used the neighbour-net method of SplitsTree4 to compute the phylogenetic network from genetic distances.However, this network is tucked away as Fig. 3 in the appendices. It is shown here in the first figure. The authors attribute the gene flow to introgression, but occasionally refer to hybridization and convergent evolution. Indeed, they suggest both relatively recent hybridization as well as the possibility of more ancient hybridization between warbler finches and other finches.

Clearly, this network is not particularly tree-like in places, especially with respect to the delimitation of species based on their morphology, as reflected in their current taxonomy. Nevertheless, the authors prefer to present as their main result as a:
maximum-likelihood phylogenetic tree based on autosomal genome sequences ... We used FastTree to infer approximately maximum-likelihood phylogenies with standard parameters for nucleotide alignments of variable positions in the data set. FastTree computes local support values with the Shimodaira–Hasegawa test.This tree is shown in the second figure.

This apparently well-supported tree is not a particularly accurate representation of the pattern shown by the network. Indeed, it makes clear just why it is inadequate to use a tree to study the interplay of intra- and inter-specific variation. Gene flow requires a network for accurate representation, not a tree.

The authors do acknowledge this situation. While they try to date the nodes on their tree, they do note that:
Although these estimates are based on whole-genome data, they should be considered minimum times, as they do not take into account gene flow.Actually, in the face of gene flow the concept that a node has a specific date is illogical, because the nodes do not represent discrete events (see Representing macro- and micro-evolution in a network). Given the authors' final conclusion, it seems quite inappropriate to rely on trees rather than networks:
Evidence of introgressive hybridization, which has been documented as a contemporary process, is found throughout the radiation. Hybridization has given rise to species of mixed ancestry, in the past and the present. It has influenced the evolution of a key phenotypic trait: beak shape ... The degree of continuity between historical and contemporary evolution is unexpected because introgressive hybridization plays no part in traditional accounts of adaptive radiations of animals.

February 17, 2015


In biology we often distinguish microevolutionary events, which occur at the population level, from macroevolutionary events, which involve species. We have traditionally treated phylogenetics as a study of macroevolution. However, more recently there has been a trend to include population-level events, such as incomplete lineage sorting and introgression.

This is of particular importance for the resulting display diagrams. A phylogenetic tree was originally conceived to represent macroevolution. For example, speciation and extinction occur as single events at particular times, and these events apply to discrete groups of organisms. The taxa can be represented as distinct lineages in a tree graph, and the events by having these lineages stop or branch in the graph.

This idea is easily extended to phylogenetic networks, where the gene-flow events are also treated as singular, so that hybridization or horizontal gene transfer can be represented as single reticulations among the lineages.

These are sometimes called "pulse" events. However, there are also "press" events that are ongoing. That is, a lot of genetic variation is generated where populations repeatedly mix, so that every gene-flow instance is part of a continuous process of mixing. This often occurs, for example, in the context of isolation by distance, such as ring species or clinal variation. Under these circumstances, processes like introgression and HGT can involve ongoing events.

For instance, in an earlier life I once studied three species of plant in the Sydney region (Morrison DA, McDonald M, Bankoff P, Quirico P, Mackay D. 1994. Reproductive isolation mechanisms among four closely-related species of Conospermum (Proteaceae). Botanical Journal of the Linnean Society 116: 13-31). One of the species was ecologically isolated from the other two (it occurred in dry rather than damp habitats), and the other two were geographically isolated from each other (they occurred on separate sandstone uplands with a large valley in between). These species look very different from each other, as shown in the picture above, but looks are deceiving. Where the ecological isolation was incomplete, introgression occurred and admixed populations could be found.

These dynamics are more difficult to represent in a phylogenetic tree or network. We do not have discrete groups that can be represented by lines on a graph, but instead have fuzzy groups with indistinct boundaries. Furthermore, we do not have discrete events, but instead have ongoing (repeated) processes.

Nevertheless, it seems clear that there is a desire in modern biology to integrate macroevolutionary and microevolutionary dynamics in a single network diagram. That is, some parts of the diagram will represent pulse events involving discrete groups and other parts will represent press events among fuzzy groups. This situation seems to be currently addressed by practitioners by first creating a tree to represent the pulse events (and possibly their times), and then adding imprecisely located dashed lines as a representation of ongoing gene flow — see the example in Producing trees from datasets with gene flow. This particular mixture of precision and imprecision seems rather unsatisfactory.

Perhaps someone might like to have a think about this aspect of phylogenetic networks, to see if there is some way we can do better.

February 15, 2015


As usual at the beginning of the week, this blog presents something in a lighter vein.

Homologies lie at the heart of phylogenetic analysis. They express the historical relationships among the characters, rather than the historical relationships of the taxa. As such, homology assessment is the first step of a phylogenetic analysis, while building a tree or network is the second step.

With a colleague (Mike Crisp, now retired), I once wrote a tongue-in-cheek article about how to mis-interpret homologies, and the consequences of this for any subsequent tree-building analysis. This article appeared in 1989 in the Australian Systematic Botany Society Newsletter 60: 24–26. Since this issue of the Newsletter is not online, presumably no-one has read this article since then. However, you should read it, and so I have linked to a PDF copy [1.2 MB] of the paper:
An Hennigian analysis of the Eukaryotae

February 10, 2015


Recently, a number of computer programs have been released that are intended to produce phylogenetic networks representing introgression (or admixture) (see Admixture graphs – evolutionary networks for population biology).

A recent example of the use of these programs is presented by:
Jónsson H, Schubert M, Seguin-Orlando A, Ginolhac A, Petersen L, Fumagalli M, Albrechtsen A, Petersen B, Korneliussen TS, Vilstrup JT, Lear T, Myka JL, Lundquist J, Miller DC, Alfarhan AH, Alquraishi SA, Al-Rasheid KA, Stagegaard J, Strauss G, Bertelsen MF, Sicheritz-Ponten T, Antczak DF, Bailey E, Nielsen R, Willerslev E, Orlando L (2014) Speciation with gene flow in equids despite extensive chromosomal plasticity. Proceedings of the National Academy of Sciences of the USA 111: 18655-18660.This study presents a phylogenetic analysis of the extant genomes of the genus Equus, the horses, asses and zebras. This analysis leads the authors to the conclusion that there is "evidence for gene flow involving three contemporary equine species despite chromosomal numbers varying from 16 pairs to 31 pairs." The gene flow is indicated by the light-blue reticulations in the first diagram.

One important issue with these types of analyses is the logic on which the procedure is based. Programs like TreeMIx (used in this analysis) were developed to allow modelling of gene flow across the branches of trees at a microevolutionary (population) scale. Specifically, the graph generated by TreeMix models singular (pulse) introgression events in phylogenetic history.

The issue is that a tree is produced first, and then reticulations are added to it. The tree represents descent and the reticulations represent gene flow. But how do we produce a tree from a dataset that contains evidence of both descent and gene flow? The authors' initial tree is shown below.

The procedural logic works as follows:
(i) we assume that the traditionally recognized species exist
(ii) we assume that we have a representative sample of them, with one genome each
(iii) we construct a tree based on the assumption that there is no gene flow among the species
(iv) we then assess the species for gene flow, and discover it.

Isn't this rather circular? Surely (iv) invalidates the assumptions inherent in (i)-(iii)? How can we then assess the reliability of the sampling in (ii) and the analyses in (iii)? Why have we made assumption (i)? At best the species are fuzzy groups to one extent or another, and we do not know where we have sampled within the probabilistic space assigned to the groups.

This seems like a very poor way to go about studying the interaction between descent and gene flow. First we assume descent only, and then we assess gene flow. When we find gene flow we continue to accept the results of the initial analyses based on descent alone.

I would hate to have to justify this philosophy to someone outside phylogenetics, because I have a horrible feeling that they would either smile tolerantly or laugh outright.

This between-species situation is even more extreme for those within-species patterns where groups are recognized. Human races and domesticated breeds are two concepts that have received constant criticism. Neither races nor breeds form clear-cut groups, as there are no sharp boundaries between them, due to gene flow. Their "central locations" in genotype space are usually very different, however. Therefore it is quite possible to perform a tree-based analysis of samples from the central locations, and this would tell us a lot about descent. But it would tell us almost nothing about gene flow; and we would have a very distorted view of the phylogenetic history.

February 8, 2015


Over the past century a number of food styles have become internationalized, including hamburgers and fried chicken. Not all of these foodstuffs are nutritious, and some people have noted that not all of them are even particularly edible. However, perhaps the most interesting of these foods is the venerable pizza, not least because the customer has considerable say in what it looks and tastes like, but also because it is made and cooked fresh, right in front of us.

Pizza originated in Italy, Greece, or Persia, depending on how we define pizza. After all, covering flat bread with a topping is an idea that goes back a very long way. In the ancient world, the Egyptians made flat bread; the Indians baked bread in an oven, but without a topping; and the Persians cooked their bread without an oven, but they did put melted cheese on it. The Passion 4 Pizza site notes this more recent history: "The ancient Greeks had a flat bread called plakountos, on which they placed various toppings [eg. herbs, onion and garlic], and we know also that Naples was founded (as Neopolis) by the Greeks; and Naples is the home of the modern pizza."

In 16th century Naples, a yeast-based flat bread was referred to as a pizza, eaten by poor people as a street food; but the idea that led to modern pizza was the use of tomato as a topping. Tomatoes were introduced to Europe from South America in the 16th century, and by the 18th century it was common for the poor of the area around Naples to add tomato to their bread. Pizza was brought to the United States by the Italian immigrants in the late 19th century, and became popular in places like New York and Chicago.

Kenji López-Alt publishes The Pizza Lab, which is part of the Serious Eats blog, and he has taken a serious interest in pizza styles, at least in New York. He recognizes three main styles of pizza, based on their dough, the way it is treated, and the temperature at which it is cooked (see the picture above, left to right):
  • New York
  • Sicilian
  • Neapolitan
He also has several variants on these styles.

As a basis for discussion, I have analyzed the dough ingredients of these three styles, using a phylogenetic network as a tool for exploratory data analysis. To create the network, I first calculated the similarity of the pizzas using the Manhattan distance, and a Neighbor-net analysis was then used to display the between-dough similarities as a phylogenetic network. So, pizza-dough styles that are closely connected in the network are similar to each other based on their ingredients, and those that are further apart are progressively more different from each other.

The Neapolitan-style dough is the simplest in terms of ingredients. The dough is not kneaded, but instead is allowed to rise for 3-5 days in the refrigerator, although it remains a thin-crust pizza. It is cooked quickly at a high temperature. The New York-style dough is an offshoot of this that is slightly thicker, and is cooked cooler and slower. The unkneaded dough stands in the fridge for only 1 day. Like all of the styles except the Neapolitan, olive oil is used in the dough, but unlike any of the others it also contains sugar (to help the crust brown more evenly). The Sicilian-style dough is intended for a thick-crust pizza. It requires only a little kneading, after which it is allowed to rise for 2 hours at room temperature. It is essentially fried in olive oil while baking.

The Sfincione is the original Sicilian pizza style, thinner and chewier than the New York Sicilian. It is also cooked at a lower temperature. The Deep Pan pizza is, of course, another thick-crust style. It is allowed to rise for longer than the Sicilian, and is cooked at a higher temperature. The network shows that these all have closely related doughs.

The Greek-style pizza is allegedly a style "found mostly in the 'Pizza Houses' and 'Houses of Pizza' in New England". As shown by the reticulation in the network, it has characteristics of the Neapolitan pizza dough (relatively low water content) and the Sicilian (relatively high oil content). It is left to rise at room temperature overnight, and is cooked like the New York and Deep Pan pizzas.

There are many other pizza styles, of course, but I do not have recipes for them. For example, there is another Deep Dish style found in Chicago.

February 3, 2015


Computer simulations are an important part of phylogenetics, not least because people use them to evaluate analytical methods, for example for alignment strategies or network and tree-building algorithms.

For this reason, biologists often seem to expect that there is some close connection between simulation "experiments" and the performance of data-analysis methods in phylogenetics, and yet the experimental results often have little to say about the methods' performance with empirical data.

There are two reasons for the disconnection between simulations and reality, the first of which is tolerably well known. This is that simulations are based on a mathematical model, and the world isn't (in spite of the well-known comment from James Jeans that "God is a mathematician"). Models are simplifications of the world with certain specified characteristics and assumptions. Perhaps the most egregious assumption is that variation associated with the model involves independent and identically distributed (IID) random variables. For example, simulation studies of molecular sequences make the IID assumption, by generating substitutions and indels at random in the simulated sequences (called stochastic modeling). This IID assumption is rarely true, and therefore simulated sequences deviate strongly from real sequences, where variation occurs distinctly non-randomly and non-independently, both in space and time.

The second problem with simulations seems to be less well understood. This is that they are not intended to tell you anything about which data-analysis method is best. Instead, whatever analysis method matches the simulation model most closely will almost always do best, irrespective of any characteristics of the model.

To take a statistical example, consider assessing the t-test versus the Mann-Whitney test — this is the simplest form of statistical analysis, comparing two groups of data. If we simulate the data using a normal probability distribution, then we know a priori that the t-test will do best, because its assumptions perfectly match the model. What the simulation will tell us is how well the t-test does under perfect conditions; and indeed we find that its success is 100%. Furthermore, the Mann-Whitney test scores about 95%, which is pretty good. But we know a priori that it will do worse than the t-test; what we want to know is how much worse. All of this tells us nothing about which test we should use. It only tells us which method most closely matches the simulation model, and how close it gets to perfection. If we change the simulation model to one where we do not know a priori which analysis method is closest (eg. a lognormal distribution), then the simulation will tell us which it is.

This is what mathematicians intended simulations for — to compare methods relative to the models for which they were designed, and to deviations from those models. So, simulations evaluate models as much as methods. They will mainly tell you which model assumptions are important for your chosen analysis method. To continue the example, non-normality matters for the t-test when the null hypothesis being tested is true, but not when it is false. Instead, inequality of variances matters for the t-test when the null hypothesis is false. This is easily demonstrated using simulations, as it also is for the Mann-Whitney test. But does it tell you whether to use t-tests or Mann-Whitney tests?

This is not a criticism of simulations as such, because mathematicians are interested in the behaviour of their methods, such as their consistency, efficiency, power, and robustness. Simulations help with all of these things. Instead it is a criticism of the way simulations are used (or interpreted) by biologists. Biologists want to know about "accuracy" and about which method to use. Simulations were never intended for this.

To take a first phylogenetic example. People simulate sequence data under likelihood models, and then note that maximum likelihood tree-building does better than parsimony. Maximum likelihood matches the model better than parsimony, so we know a priori that it will do better. What we learn is how well maximum likelihood does under perfect conditions (it is some way short of 100%) and how well parsimony does relative to maximum likelihood.

As a second example, we might simulate sequence-alignment data with the gaps in multiples of three nucleotides. We then discover that an alignment method that puts gaps in multiples of three does better than ones that allow any size of gap. So what? We know a priori which method matches the model. What we don't know is how well it does (it is not 100%), and how close to it the other methods will get. But this is all we learn. We learn nothing about which method we should use.

So, it seems to me that biologists often over-interpret computer simulations. They are tempted to over-interpret the results and not see them for what they are, which is simply an exploration of one set of models versus other models within the specified simulation framework. The results have little to say about the data-analysis methods' performance with empirical data in phylogenetics.

February 1, 2015


Here is a new collection of interesting tattoos.

For other examples of circular trees see Tattoo Monday, Tattoo Monday V and Tattoo Monday VII. For circular trees with pictures see Tattoo Monday II, and for DNA trees see Tattoo Monday IV. For other March of Progress tattoos see Tattoo Monday VIII.

January 27, 2015


We don't normally discuss individual papers in this blog (except as example datasets), but today I am simply drawing your attention to what appears to be a little-known paper on phylogenetic networks.

Naruya Saitou has not contributed much to the theory of networks, being instead best known for the development of the neighbor-joining method for phylogenetic trees. (The 20th most cited paper ever; see Massive citations of bioinformatics in biology papers) However, this recent paper is of interest:
Naruya Saitou, Takashi Kitano (2013) The PNarec method for detection of ancient recombinations through phylogenetic network analysis. Molecular Phylogenetics and Evolution 66: 507-514.The paper presents a new method for detecting ancient recombinations through phylogenetic network analysis. Recent recombinations are easily detectable using alternative methods, although splits graphs can also be used, but older recombinations are more tricky.

Importantly, I particularly like the opening paragraph of the paper:
The good old days of constructing phylogenetic trees from relatively short sequences are over. Reticulated or "non-tree" structures are omnipresent in genome sequences, and the construction of phylogenetic networks is now the default for describing these complex realities. Recombinations, gene conversions, and gene fusions are biological mechanisms to produce non-tree structures to gene phylogenies, while gene flow is a well known factor for creating reticulations within population phylogenies.These are heart-warming words from the developer of the most commonly used tree-building method!

January 25, 2015


It might be nice to live in a world where the mere fact that you are male or female does not attract attention to you within your profession. But while we are waiting for that day, you might like to ask yourself about women in systematics. David Archibald suggests that the tree produced by Anna Maria Redfield is "the first tree – creationist or evolutionary – by a woman and may well be the only such tree by a woman until well into the twentieth century."

Anna Maria Redfield (1800-1888, née Treadwell) is described in these terms by Michon Scott's Strange Science web site:
Born at the dawn of the 19th century, Anna Maria Redfield earned the equivalent of a master's degree from the first U.S. institution of higher learning devoted to female students: Ingham University, and became perhaps the first woman to design a tree-like diagram of animal life. Although tree-like, her diagram didn't show common ancestry but instead showed the "embranchements" established by Georges Cuvier: vertebrates, arthropods, mollusks, and "radiata" (today classified as cnidarian and echinoderm phyla). To be fair, this diagram was published before Darwin's Origin of Species but later editions of her work made no mention of evolution either. Instead, she wrote about our simian cousins, "The teeth, bones and muscles of the monkey decisively forbid the conclusion that he could by any ordinary natural process, ever be expanded into a Man." Still, her elegant work is great fun to behold even now.The tree-like diagram (shown in miniature above) was a wall chart (1.56 x 1.56 m) called A General View of the Animal Kingdom, published in 1857 by E.B. and E.C. Kellogg, New York. It is heavily illustrated with images of the taxa, their names, and brief notes: eg. "Man alone can articulate sounds, and is capable of improving his faculties or advancing his condition". Only three lithograph copies of the original tree are now known, one of which was sold at auction by Christie's in 2005 for £7,200.

The following year the same publishers produced a companion volume to the chart, called Zoölogical Science, or Nature in Living Forms: Adapted to Elucidate the Chart of the Animal Kingdom, and designed for the higher seminaries, common schools, libraries, and the family circle (1858, reprinted 1860, 1865, 1874). A copy is available in the Biodiversity Heritage Library. Only 57 original copies of the book are now known.

This book of 743 pages is richly illustrated, the artist being unacknowledged in the first edition but credited as E.D. Maltbie from then on. (He is presumably responsible for the chart as well.) The book has the frontispiece shown below, which is an edited version of the base of the tree.

Redfield and her chart have recently been discussed by Susan Butts (2011. Conservation of the Anna Maria Redfield wall chart: A General View of the Animal Kingdom. Society for the Preservation of Natural History Collections Newsletter 25(1): 18-19). She notes:
The wall chart is a masterpiece, with intricate and accurate illustrations of representatives of the animal kingdom portrayed as a Tree of Life, which illuminates the relationships of the major groups of organisms. It is an important document in the study of biology and in the pioneering work of women in science. The wall chart has eloquent phrases, which express a Victorian humanistic view of nature (often intermingled with anthropomorphism, biblical overtones, and the biological superiority of humans).Redfield's views on evolution are clear from her book, indicating that the relationships shown represent affinity not evolution:
There is no evidence whatever that one species has succeeded, or been the result of transmutation of a former species.Butts notes that unfortunately Redfield "remains a relatively minor and poorly recorded figure in the history of women in science, let alone biological and evolution studies in general."

January 20, 2015


Charles Darwin's metaphor of the Tree of Life was not a tree, even in The Origin of Species. As noted by Franz Hilgendorf (see The dilemma of evolutionary networks and Darwinian trees) "the branches of a tree do not fuse again", and yet in his book Darwin discusses at least one circumstance when they do precisely that — hybridization.

Darwin's discussion of hybridization occupies all of chapter 8 of the Origin. His stated motivation is to address what many people might see as a fatal objection to his theory of species origins by means of natural selection. One of Darwin's main arguments in the book is that "descent with modification" is continuous, and therefore the distinction between species and varieties (and subspecies, etc) is an arbitrary cut in a continuum of biodiversity. However, it was conventionally accepted that varieties within the same species could cross-breed freely, but any attempt to hybridize distinct species would always fail. Darwin opposes this view by citing extensive evidence showing that varying degrees of sterility are encountered in efforts to cross-breed different species of plants (and a few birds) — if the species are closely related then often there will be a small degree of fertility in the hybrid offspring. So, as two related forms diverge from one another in the course of evolution, their ability to inter-breed gradually diminishes and eventually falls to zero (absolute sterility).

It is important to note that his motivation for writing about hybridization was independent of his ideas about phylogeny. So, he seems not to have noticed the consequence of hybridization for phylogenetic patterns.

This is similar to the situation regarding his so-called "tree diagram", in chapter 4. His motivation for the diagram (the only figure in his book) was a discussion of descent with modification, and particularly the continuity of evolutionary processes. He was expressing his idea about uninterrupted historical connections. In particular, this was part of his concern that there is no fundamental distinction between varieties and species, because evolutionary divergence is continuous — it is all a matter of degree, without sharp boundaries. His Tree of Life image expressed the continuity of evolutionary connections, not phylogenetic patterns. This is clear from his poetic invocation of the biblical Tree of Life, which is about the inter-connectedness of all living things along tree branches, not about patterns of biodiversity.

Implicit in this world view is the idea that the Tree of Life is still a tree in spite of hybridization. That is, Darwin failed to see that his "tree simile" (chapter 4) had to ignore hybridization (chapter 8) in order to work. His figure does not show any evidence of hybridization, only divergence. It was not intended to be what we would now call a phylogeny, but merely an idealized view of divergence and continuity of descent. When introducing the Tree of Life, he was using religious imagery to stimulate the imagination of his readers, and in so doing presented a contradictory argument — there is continuity along the branches as well as continuity of inter-connections.

The alternative conception is that Darwin's Tree of Life was never a tree — it was a network. From this world view, Hilgendorf's dilemma was actually irrelevant. He commented:
An observation which, as far as I know, contradicts these previously discussed views, [would be], that formerly separate species approach each other and finally merge with each other. This would not fit the beautiful image that Darwin presented about the connection of species in a branch-rich tree; the branches of a tree do not fuse again.Well, they do, even in a Darwinian tree.

January 18, 2015


The Tree of Life and the Tree of Knowledge are images that have appeared in many cultures throughout the world. They are often combined as a cosmic or world tree, with the tree of knowledge supporting the heavens and earth and the tree of life connecting all living beings. However, the word "tree" is obviously rather nebulous in these images, and it can take many forms.

In the christian Bible these trees appear in the garden of Eden in a more restricted form as the Tree of Eternal Life and the Tree of Knowledge of Good and Evil. Even here, though, it is not clear whether they are one and the same tree. For example, only one tree is mentioned in the book of Revelation, when promising a new Eden.

The Tree of Knowledge was co-opted in Medieval times as a symbol of learning, and a metaphor for arranging all human knowledge, the Arbor Scientiae (see Relationship trees drawn like real trees). This idea was adopted by biology in the 1700s, where trees were used as metaphors for the relationships among biological species. In modern parlance, these depicted affinity or phenetic relationships, and so they represented knowledge (not life). In the mid 1800s Charles Darwin (in the Origin of Species) took this pre-existing tree idea and instead made it represent evolutionary relationships among species. In the process he re-named it the Tree of Life, thus once again uniting the Tree of LIfe and the Tree of Knowledge. We have been stuck with the ToL name ever since.

At about the same time as the rise of the Arbor Scientiae, a combined Tree of Life and Tree of Knowledge also appeared as the central mystical symbol of the Kabbalah of esoteric Judaism, consisting of the 10 Sephirot (enumerations). It is shown above in its full modern form. This is a reinterpretation of the Hebrew Bible, conceptually representing a list the attributes of God (how God emanates).

In the Kabbalist view, both of the trees in the biblical garden of Eden were alternative perspectives of the Sephirot. The 10 Sephirot are arranged into three columns, with 22 Paths of Connection. As a tree, it has roots above and branches below. To quote Wikipedia:
Its diagrammatic representation, arranged in 3 columns/pillars, derives from Christian and esoteric sources and is not known to the earlier Jewish tradition. The tree, visually or conceptually, represents as a series of divine emanations God's creation itself ex nihilo, the nature of revealed divinity, the human soul, and the spiritual path of ascent by man. In this way, Kabbalists developed the symbol into a full model of reality, using the tree to depict a map of Creation.My main point here is that by combining two conceptual trees this icon is clearly a network, unlike most other conceptual trees such as the dichotomous Tree of Knowledge.

The Kabbalah started without an image, being described solely in words. The diagram of the Tree used by modern Jewish Kabbalists is usually based on the diagram published in the print edition of Rabbi Moses Cordovero's Pardes Rimonim from 1591 [composed 1548], and sometimes called the "Safed Tree". It is shown in the next figure.

One of the earliest illustrations comes from the 1516 Portae Lucis of Paolo Riccio, a Latin translation of Joseph ben Abraham Gikatilla's most influential kabbalistic work, Sha'are Orah (Gates of Light) from the 1300s. It is shown in the next figure.

There are actually two modern version of the Kabbalah. The one shown here in the first illustration has the crossing diagonals lower down than does the one shown in the second illustration. The one with two diagonals at the bottom is an earlier version that is still favoured by Hermetic Kabbalists. Both made their first public appearance in the Pardes Rimonim.

January 13, 2015


BLAST is a computer program that searches a database for similarity matches to a given query sequence, either DNA or amino acid. It is most commonly used to search the GenBank database for matches to any new sequence that we might happen to have, in the hope that we will find one or more homologous sequences.

To most of us BLAST is a black box, in the sense that we have little idea about the details of how it does what it does. So, maybe we should at least look at what it does, just in case we ever need to know.

About 10 years ago I was working with some EST data. For those of you not old enough to know, ESTs consist of short DNA reads from arbitrary primers. In the hope of identifying the coding gene represented by each EST, BLASTX is used to search the GenBank protein database using each translated nucleotide query (in all six possible reading frames). BLASTX produces an E-value for each matching sequence, representing the strength of the match to the query. An E-value is not a probability (as they can vary from 0 to infinity), but at p=0.050 the expected E-value happens to be E=0.051. There is no consensus for what E-value should serve as indicating a "significant" match.

I decided to find out what happens if a DNA query sequence varies in either length or GC content. I used both random sequences (which were thus not in GenBank) as well as real sequences (which were in GenBank). The short answer is that the BLASTX results vary a lot. I never published these results because I figured the first thing a referee would do is ask me to explain BLASTX's behaviour, and I did not have an explanation (and still don't).

I present the results here for what they are still worth. Obviously, the results are not restricted to EST data, but apply any time that we use BLASTX.


The content of GenBank is quite different today to what it was back in late 2003, and so maybe the results will vary if the work was to be repeated. For reference, the first graph shows the GC content of the GenBank protein-coding sequences at the time of my work. Also, it is possible that BLASTX is different as well — I used v. 2.2.6 with default parameters (BLOSUM62, edge correction, length correction, SEG filtering, universal genetic code, gap penalty 11+k). Maybe some intrepid soul will be inspired to find out what happens nowadays.

Random sequences

I generated sets of 1,000 replicate "ESTs" using the perl script Randseq by M. Raymer (5/27/2003). These sets varied in DNA length (100–1,000 nt) and in GC content (0–94%), but were otherwise random sequences of nucleotides. These sequences are not expected to be homologous to anything already in GenBank, and should thus form BLASTX matches only by random chance.

The results for varying the sequence length are shown in the next graph, with each point representing the mean E-value observed. The lines represent four somewhat different GC contents; and the anticipated E-value for random data (0.051) is also shown. Clearly, very few points are near the expected value. The lines all show the same shape, with a minimum E-value near 450 nt, and rising slowly with longer lengths and rising rapidly with shorter lengths.

A more detailed assessment of the results for varying the GC content is shown in the third graph. The lines represent two somewhat different sequence lengths; and the anticipated E-value for random data (0.051) is also shown. It is clear that the E-value is capable of varying by up to seven orders of magnitude in response to variation in the GC content of the sequence.

Real sequences

I used the sequences contained in the Poxvirux Orthologous Clusters database (POCs), which used to be available at: This has since been replaced by the Viral Orthologous Clusters database (VOCs). These virus protein sequences are expected to already be in GenBank, and they should thus form good BLASTX matches.

The POCs database could be queried by both sequence length and GC content, and it was the only such database that I could find at the time. For each combination of length (in 50-nt bands) and GC-value (in 10% bands) I gathered a minimum of 20 sequences. There were few sequences for the shortest lengths, so I chopped up the longest sequences (longer then needed) to increase the sample size. There were also few sequences at the greatest GC values, so I used sequence AE004437.1 from GenBank (a Halobacterium sp.) to increase the sample size.

The results are shown in the final graph, with each point representing the mean E-value observed. The E-values are all small, since they represent actual database matches. Clearly, variation in sequence length can lead to orders of magnitude variation in E-value, while variation in GC content has an effect only at longer sequence lengths.


For a program that is supposed to produce comparable results, no matter what the sequence, these BLASTX results are disquieting. After all, BLAST is one of the most cited programs ever (see Massive citations of bioinformatics in biology papers), and yet I suspect that most people do not realize that it behaves like this.

The random sequences assess the effect of false positives. That they vary so much in E-value is amazing. Clearly, BLASTX E-values are not comparable between sequences. It is interesting that GC content seems to have a bigger effect than sequence length — for any given GC content the effect of length is relatively small for sequences longer than c. 600 nt. However, variation in GC content can produce orders of magnitude of effect at any given sequence length.

The real sequences assess the effect of true positives. That they vary in E-value is also not good — the E-values all represent true database matches (and presumably exact ones). Nevertheless, the effect of variation in sequence length and GC content is repeated for these real sequences. However, variation in GC content only has a large effect for the longer sequences, and instead it is the sequence length that produces the orders of magnitude variation in E-value.

You can make of this what you will.

January 11, 2015


To a modern phylogeneticist the answer to this question is obviously "no". Phylogenetic trees occur in the literature with their root at the top, the left or the bottom, and more rarely on the right. The graph has the same interpretation no matter where the root is placed, as all of the edges are implicitly directed away from the root. The tree can even be circular, with the root in the centre and the tree radiating outwards.

However, this was not always so for genealogies, and indeed this freedom seems to be a product of the past 200 years or so. The history of tree orientation has been discussed in detail by Christiane Klapisch-Zuber (1991. The genesis of the family tree. I Tatti Studies in the Italian Renaissance 4: 105-129).

Originally, genealogies were drawn with the root at the top, as shown in previous blog posts: The first royal pedigree, and The first known pedigree of a non-noble family. These pedigree trees (ie. genealogies of individuals) have a particular ancestor at the root of the "tree", so that the tree expands forwards in time down the page, to increasing numbers of descendants at the leaves (ie. a "descent tree"). This made linguistic sense, because people "descended" from the ancestor down the page. In European languages pages are read top to bottom, and so the natural reading order was the same as the time sequence.

However, this arrangement makes no sense if one refers to the graph as a "tree". Trees have their root at the bottom, not the top. Trying to draw the pedigree as a tree while retaining the original orientation could lead to unusual results, as shown in the first figure, from the end of the 1300s CE (from Universitätsbibliothek, Innsbruk, ms. 590, folio 116r). This is actually an Arbor Consanguinitatis rather than an empirical pedigree — it shows the various relatives of a nominated individual (the man pictured in the center) and their degree of relationship to that person. These diagrams have been used to compute which relatives can marry without committing incest, or which can inherit if a person dies intestate. Jean-Baptiste Piggin, at his web site Macro-Typography, has noted that the earliest known examples are from the 400s CE.

In order to match a real tree, the genealogy has to be read from bottom to top. This implies an ascent through time, instead, with a spreading out of the family upwards through time.

The first known empirical pedigree in which the ancestor is at the base is the Genealogia Welforum, the pedigree of a dynasty of German nobles and rulers (Dukes of Bavaria, and Holy Roman Emperors, successors of the Carolingians). The earliest known example, drawn as part of the Historia Welforum [Welf Chronicle], is shown in the second figure (from Hessische Landesbibliothek, Fulda, ms. D.11 folio 13v). The original text version of the pedigree is dated 1167-1184 CE, with the miniatures added sometime from 1185-1191 CE.

Clearly, this diagram is only sketchily like a tree, with many of the people placed along the main trunk, and medallions hanging off for other relatives. This seems to arise from the pedigree's origin as prose, and the subsequent literal illustration of that prose.

The ancestor is labeled "Welf Primus", and he apparently lived in the time of Charlemagne (the best known of the Carolingian dynasty). The empty space at the top of the chart was apparently intended for a picture of Emperor Frederick I Barbarossa, of the House of Hohenstaufen. The woman at the top right is Henry the Black's daughter Judith, who was the mother of Barbarossa. Intriguingly, the final bend of the Welf trunk to the left, combined with Barbarossa at the top, seems to imply that it is the descendants of Barbarossa who continue the Welf lineage, rather than those bearing the Welf name.

Historically, it seems to have been the proliferation, after about 1200 CE, of illustrations of the biblical Tree of Jesse that popularized the idea of "pedigrees as trees". The next figure shows such a tree from c. 1320 CE (from a Speculum Humanae Salvationis manuscript, Kremsen ms. 243/55). Jesse lies at the base of the tree, and the tree actually arises from him. His descendants then ascend to Jesus, shown at the crucifixion, with Heaven illustrated at the top. The tree thus uses Christ's pedigree to symbolize the ascent of humans to heaven (via his crucifixion), rather than simply the descent of humans through time. That is, the tree correctly represents ascent (as well as descent).

This leaves us contemplating just when we added the final twist to the iconography, by putting a single descendant at the base of the tree, and having the ancestors branching out above as leaves (ie. an "ascent tree"). This means that time flows from the top to bottom of the figure, even though the tree is oriented from bottom to top. This is quite illogical as an analogy, given that the base of a real tree is the origin of its growth (see Goofy genealogies). This particular iconography is not used for phylogenies but is very commonly used for pedigrees.

I have no idea when this first occurred. However, David Archibald (2014. Aristotle's Ladder, Darwin's Tree: The Evolution of Visual Metaphors for Biological Order. Columbia Uni Press) draws attention to a very tree-like pedigree of Ludwig (Louis III), fifth Duke of Württemberg, from the late 1500s, shown here as the final figure (from Württembergisches Landesmuseum, Stuttgart). Ludwig is at the base of the tree, and ironically he had no descendants (although he married twice). His parents are above him in the tree (Christoph, Duke of Württemberg, to the left, and Anna Maria von Brandenburg-Ansbach, to the right), followed by four further ancestral generations. Note the leaves and hanging fruits, which highlight the tree metaphor.

January 6, 2015


Sometimes there has been discussion about the structural complexity of phylogenetic networks. At one extreme, species phylogenies are seen as trees with occasional reticulations, and at the other end there is a whole cobweb of reticulations with no visible tree. In this context, comments are sometimes made about the likeliness of those outputs from network programs that show extensive gene flow. If a biologist does not believe that the history of "their" organisms involves extensive reticulation, then the algorithmic outputs might be dismissed as unrealistic.

Here I present one well-known example of extensive hybridization, in which the computer programs seem to agree on the same complex solution — the history of common bread wheat.

The data and analyses are from:
Marcussen T, Sandve SR, Heier L, Spannagl M, Pfeifer M, International Wheat Genome Sequencing Consortium, Jakobsen KS, Wulff BB, Steuernagel B, Mayer KF, Olsen OA (2014) Ancient hybridizations among the ancestral genomes of bread wheat. Science 345: 1250092.The hybridization network shown above is a montage of two different phylogenies from the original paper. It shows four splits, one homoploid hybridization, and two polyploid hybridizations. The time is shown in the circles in units of millions of years (note that the scale is not linear).

The first split (6.5 million years ago) is between the genera Triticum (wheat) and Aegilops (goatgrasses), which are morphologically highly distinct, with Aegilops having rounded glumes rather than keeled glumes. There are currently c.20 recognized species in both Aegilops and Triticum, so only a small part of the diversity is shown in the network.

Domesticated Bread wheat (T. aestivum) is a hexaploid species, with the three diploid genomes being known as A, B and D. Their lineages are labeled and colored in the network diagram. The genome D lineage is the result of a homoploid hybridization (which has been taxonomically treated as part of Aegilops). Bread wheat is then the recent result of two successive allopolyploid hybridizations, with a tetraploid lineage as the intermediate.

Of the other species shown in the network, all of the goatgrasses are wild diploid species, as is T. uartu. T. monococcum is also diploid, with domesticated Einkorn wheat being derived from the wild ancestor. T. turgidum is a tetraploid species, with domesticated Emmer wheat being derived from the wild ancestor — it has recently diversified into many modern wheat species.

This is one of the most complex phylogenetic networks known, although that complexity is at least partly the result of leaving out most of the other diploid species in the Triticum and Aegilops clades. Program outputs that are more complex than this are unlikely to be realistic.

January 3, 2015


Networks are visually more complicated than trees, because there are extra edges representing reticulate relationships. Technically this means that some of the nodes have in-degree >1, and that there are one-to-many connections among these nodes. This can create visual clutter. I recently presented one simple way that might alleviate this (Circular phylograms for phylogenetic networks).

Another possibility is to add to the network what are called meta-nodes. These meta-nodes represent groups of nodes, so that the edges between the meta-nodes and the other nodes can represent different types of relationship. This reduces the one-to-many connections in the graph.

As pointed out by Elijah Meeks at the Digital Humanities Specialist blog, pedigrees represent a neat example of this concept. In this example, there are several types of traditional relationship that can be represented: husband, wife and child. Since these relationships are explicitly shown (ie. the direction of the relationship is explicitly shown), the figure can be drawn unrooted.

The example shown here (reproduced from Meeks' post) has the meta-nodes in grey, each representing a family. These nodes are unlabeled, while the person-nodes are labeled with the person's name and noble title. Females have pink nodes, and males blue ones. The edges connecting them to the grey nodes are colour-coded as: blue = husband, pink = wife, orange = child.

So, for example, the right-hand family node indicates that Charles I and Henrietta Maria were husband and wife, and that they had three children: Mary Henrietta, James II and Charles II.

In this case, the reduction in one-to-many connections does make the relationships more clear, so that interpretation is easy. However, it potentially makes the network more complicated (as Meeks notes) because of "just how tangled up certain families can be" — adding the extra meta-nodes exacerbates the tangling. Meeks provides another example in his blog post.

December 29, 2014


This end-of-year post has nothing to do with networks, or even phylogenetics, although the general principle involved might apply to both. My point here is simply that experts sometime look foolish when they commentate on fields outside their own area of expertise.

As an introductory example, I remember reading a paper in a physics journal that tried to convince the readers that humans could potentially live forever. Unfortunately, the authors confused the concepts of lifespan and longevity, which is pretty basic stuff in population biology. Lifespan is the length of time for which humans normally live. We have more than doubled this over the past millenium, due to changes in sanitation, medication, surgery and safety. Longevity is the length of time for which humans are capable of living. We have not changed this by even one year, as it seems to be related to phenomena like programmed cell death. Changes in lifespan do not therefore entail changes in longevity; all that has happened is that our expected lifespan is now closer to our observed longevity than it previously has been.

More recently, an electrical engineer drifted into the field of literature while claiming to be a scientist — Mikhail Simkin (2013) Scientific evaluation of Charles Dickens. Journal of Quantitative Linguistics 20: 68-73. Sadly, his article displays neither of the characteristics of science (replication and control), nor does it appear to contribute anything much to literature.

As noted on his web page, the author had trouble publishing this article, and he has subsequently received "a flood of criticism", which he naively seems to believe he has rebutted at the Significance blog.

His intention was a simple one: a comparison of the writing style of Charles Dickens and that of Edward Bulwer (later known as Edward Bulwer-Lytton). His premise was: "Edward Bulwer-Lytton is the worst writer in history of letters ... In contrast, Charles Dickens is one of the best writers ever." He put online a quiz with "a dozen representative literary passages, written either by Bulwer-Lytton or by Dickens." The takers had to nominate the author of each quote. Simkin discovered that on average the votes were "about 50%, which is on the level of random guessing. This suggests that the quality of Dickens's prose is the same as that of Bulwer-Lytton." The results are shown in the graph above.

Simkin's intention seems to have been to demonstrate that currently revered and non-revered authors do not differ much in style, which is a contention that I see no reason to disagree with, but if so he has gone about showing this in a remarkably unscientific manner.

Let us take the premise first, for which the author provides no personal justification nor any reference to a published one. It seems patently true that the current fashion is for Dickens to be widely read but Bulwer not. This on its own means little, however, as even the Shakespearean works have had a century or so of being out of fashion, although not in the past couple of hundred years (to the dismay of anyone who has had an English-language education).

Was Bulwer a bad writer? Well, first, the results of Simkin's poll imply "no", at least in comparison to Dickens. But more importantly, many other sources say "no", as well. Indeed, Wikipedia makes a strong case both for his popularity in his own time, and for considerable influence on literature since then. Indeed, he is so 'obscure' that towns as far apart as Canada and Australia are named after him. His works are so 'poorly known' that we continue to use his expressions "pursuit of the almighty dollar" and "the pen is mightier than the sword". His works have been so 'derided' that several operas are based on his books, including one by Richard Wagner; and authors such as Edgar Allan Poe have paraphrased his words. His books are such 'poor examples' of English that people have felt compelled to translate them into Serbian, German, Russian, Norwegian, Swedish, French, Finnish, Spanish and Japanese, among other languages.

Clearly, the premise that Bulwer represents the nadir of English-language literature holds no water. He is currently obscure, but as John S. Moore has noted, the fact that he is not read does not mean that he is not worth reading.

Indeed, a scientist would immediately note the lack of replication here. Why are "best" and "worst" writers not replicated in the experiment? This would immediately address any possible mis-judgements about potential literary worth. It is repeated patterns that provide convincing evidence in science, not isolated pairwise comparisons. This poll is hardly a "scientific evaluation", as claimed by the author.

Now let us consider the experimental procedure. This consisted of choosing "representative literary passages", without any explanation for how this was done or what were the criteria for choice. Clearly, this choice is the key to the experiment. After all, all the experiment does is show that one can find passages by both Dickens and Bulwer that are hard to distinguish. That could very well be true of almost any pair of writers from the same culture (ie. country and century). The experimental comparison has thus not been controlled, as it would be in science.

What would experimental control look like in this case? Clearly, the issue is one of style, since authors vary their writing style depending on the book, the plot situation, and even the character involved. (One of Bulwer's passages is actually taken from the dialog of one of his characters, which hardly represents the author's own writing style!) The objective, then, must be to find passages that represent the range of styles present in the corpus of each writer. One might try grouping the passages into topics or styles, for example, or whether they describe actions or locations, etc.

Without either replication or control, this literary evaluation cannot be considered to be scientific. Sadly, on his website Simkin has several other so-called scientific comparisons within the arts, designed in exactly the same inadequate way.

As a final note, we can ask why was Bulwer chosen for this comparison in the first place? The choice seems to be almost solely due to various extant parodies of the opening of one of his books, Paul Clifford (1830): "It was a dark and stormy night; the rain fell in torrents ..." For example, this was chosen by Charles Schultz in his Peanuts cartoons, as the opening of one of Snoopy's failed attempts to be a world-reknowned author. The full sentence does not actually seem bad, although it tries to cram a bit much information into the number of words available. Thomas Hardy later tried the same thing, but with more success, in The Return of the Native (1878): "A Saturday afternoon in November was approaching the time of twilight ..."

However, the award for sheer bravado surely goes to D.H. Lawrence, in his short story Tickets, Please! (1919), which starts with a paragraph consisting of a sentence of 118 words, followed by sentences of 15 words, 27 words and finally 113 words.** A plethora of commas, colons, semi-colons and dashes are needed to keep the meaning coherent in this page-long paragraph. You and I could not get away this, which is why Lawrence is considered to be one of the great English literary stylists. Apparently, Bulwer did not get away with it, either.

** My count is based on the original publication in The Strand magazine, which is slightly different to subsequent versions.

December 24, 2014


Season's greetings.

For your Christmas reading, this blog usually provides a seasonally appropriate post on fast-food, including to date: nutrition (McDonald's fast-food) and geography (Fast-food maps). This year, we will focus on the effects of fast-food on people.

Defining fast-food is a bit of a trick. The U.S. Census of Retail Trade defines a fast-food establishment merely as one that does not offer table service. However, legislation recently passed in Los Angeles defines fast-food establishments as those that have a limited menu, items prepared in advance or heated quickly, no table service, and disposable wrappings or containers. Some people feel that these definitions should include all pizza restaurants, even those that do offer table service in addition to take-away (or take-out). The latter are sometimes distinguished as fast-casual restaurants rather than fast-food restaurants.

About 90% of Americans say they eat fast-food, including those who visit an establishment on average once per day. The main concern about the effect of fast-food, then, is on people's diet. By "diet" I mean the combination of foodstuffs consumed each day, which may or may not match what is known to be required for a healthy human. Fast-food rarely matches this diet, and so there must be some effect of eating the stuff.

In particular, fast-food has been implicated in what is now known within medicine as the "obesity epidemic" — the observation that an increasing proportion of the people in the developed world are formally classified as obese. The usual symptom of obesity is a body mass index (BMI) > 30 (overweight is 25-30, normal is 18.5-25). BMI is an approximate measure of body fat.

Obesity has risen rapidly in recent decades, but there is some evidence that the levels are now beginning to stabilize (Obesity Rates & Trends Overview). The main risk with obesity is its strong association with potentially fatal health problems, notably heart disease, stroke, high blood pressure, and diabetes. Indeed, it has been suggested that obesity may be the greatest cause of preventable death in the United States.

Demonstrating a relationship between fast-food and obesity is not hard, given the high sugar, carbohydrate, fat, and salt content of most of the food items. This results in the intake of more energy than the body uses, and this excess is stored as fat. This pattern shows up clearly in large-scale samples of prevalence, such as this one collated on the DataMasher site, where each point represents a state of the USA.

An obvious issue concerning fast-food is our ability, or lack of it, to understand just how many calories (or joules) there are in fast-food meals. The marketing people seem to have a clear idea about how different fast-food chains are presented in terms of their food quality, as shown in this Perceptual Map.

However, this perception is clearly not accurate in terms of calories, especially for Subway. An article in the British Medical Journal evaluated the ability of people to estimate the calorie content of the fast-food meal they had just purchased. As shown in the next graph, clearly in most cases there was a major under-estimate, and this was worst for the highest-calorie meals. The under-estimation of calorie content was largest among Subway diners. Diners at both Subway and Burger King showed greater under-estimation of meal calorie content than those at McDonald's, whereas diners at Dunkin' Donuts had less under-estimation. In other words, Subway is not as healthy for you as you think it is, but you already know how bad those Donuts are.

One response to this situation has been to insist that fast-food places advertise the calorie content of their food on the menu board itself. For example, it has been suggested that nutrition experts can compose apparently healthy meals based on the nutritional information provided in the menus of fast-food restaurant chains.

This will only have an effect, however, if people actually use this information when choosing their meal. An article in the Journal of Public Health suggested that most young people don't actually do so, and that people who eat fast-food most often are least likely to do so. Indeed, a report from Sandelman Associates showed that the only people who are likely to use calorie information regularly are those with a specific "calorie target" for their personal diet, as shown in this next chart.

Nevertheless, an article published in the British Medical Journal has reported a decrease in the energy content of fast-food purchases after the introduction of calorie information on the menu boards, except at Subway, where there was an increase. (before the labeling the Subway meals chosen had fewer calories than for the other chains but afterwards they had more!)

Another important feature of fast-food is the usually large portion sizes, which exacerbates the energy imbalance. An article in the Journal of the American Dietetic Association has shown that not only does modern fast-food exceed dietary standard serving sizes by at least a factor of 2, and sometimes by as much as 8, these serving sizes have increased dramatically over the past 50 years.

What is perhaps most surprising is the truly vast difference that can occur between servings of what is allegedly the same fast-food product, not only between countries but within a single country. The following graph is from an article in the International Journal of Obesity. It shows, for the named locations, the amounts of total fat in a meal consisting of 171 g McDonald's french fries and 160 g KFC chicken nuggets. The darker colour indicates the added amounts of industrially produced trans fat. The values in parenthesis are the amount of trans fat as a percentage of total fat.

On a somewhat different note, one of the main characteristics of fast-food is the focus on a sweet taste, rather than on a diversity of tastes. In contrast, traditional cooking in many cultures has focussed on mixing together a diversity of complementary ingredients. Indeed, this was the impetus for the formation of the Slow Food movement, founded "to prevent the disappearance of local food cultures and traditions ... and combat people's dwindling interest in the food they eat, where it comes from and how our food choices affect the world around us." (It was organized after a public demonstration at the intended site of a McDonald's franchise at the historic Spanish Steps, in Rome.)

This topic was investigated in detail in an article published in Nature Scientific Reports. The authors produced the following network of food flavours.

Interestingly, they conclude that:
We introduce a flavor network that captures the flavor compounds shared by culinary ingredients. Western cuisines show a tendency to use ingredient pairs that share many flavor compounds, supporting the so-called food-pairing hypothesis. By contrast, East Asian cuisines tend to avoid compound-sharing ingredients.There is diversity even in the amount of diversity.

December 21, 2014


Here are five more tattoos in our compilation of evolutionary tree tattoos from around the internet. For more examples of this circular design for a phylogenetic tree, in a variety of body locations, see Tattoo Monday, Tattoo Monday V, and Tattoo Monday VII.

December 16, 2014


It has been noted before that we have a wide range of mathematical techniques available for producing data-display networks, most notably the many variants of splits graphs (see Huson & Scornavacca 2011). For example, NeighborNets and Consensus networks are commonly encountered in the phylogenetics literature, and Reduced median networks and Median-joining networks are commonly used for haplotype networks in population biology.

However, there are few techniques used to produce evolutionary networks. Studies of reticulate evolutionary histories, which include recombination networks, hybridization networks, introgression networks and HGT networks, have no unifying theme as yet. So, the biological literature has many papers in which biologists struggle with reticulate evolutionary histories using ad hoc collections of techniques, which often boil down to simply presenting incongruent phylogenetic trees from different datasets (see Morrison 2014a).

So, maybe a brief look at the current state of play with evolutionary networks would be useful. There are enough worthwhile techniques out there for people to be using them more often than they are.


Almost all current phylogenetic methods assume that the basic building unit is a non-recombining sequence block, for which the evolutionary history is strictly tree-like. We tend to call these blocks "genes" and their history "gene trees", but this is just for semantic convenience. In practice, we first collect data for various loci, and we then simply make the assumption that there is recombination between the loci but not within them. This is basically the assumption of independence between loci. At the limit, each nucleotide along a chromosome has a tree-like history, but for aggregations of nucleotides it is all assumptions.

Furthermore, we assume that there are no data errors that will confound any reconstruction of the phylogenetic trees. Possible sources of error include: incorrect data (e.g. contamination), inappropriate sampling (taxa or characters), and model mis-specification. Any of these errors will lead to stochastic variation at best and to bias at worst.

Gene-tree incongruence

Reticulate evolutionary processes lead to gene trees that are not all congruent. However, there are two other processes that have been widely recognized as also producing gene-tree incongruence, but which do not involve reticulation in the strict sense: incomplete lineage sorting (deep coalescence; ancestral polymorphism), and gene duplication-loss.

Many studies have now shown that stochastic variation due to ILS can be very large (see Degnan & Rosenberg 2009), and that this varies in relation to both the population sizes of the taxa and the times between divergence events. The expectation of completely congruent gene trees is thus very naive, even when the evolutionary history of the taxa has been strictly tree-like. A number of methods have been developed to reconstruct species trees in the face of ILS (Nakhleh 2013).

DL involves gene duplication (which can be repeated to create gene families) followed by selective gene loss. The phylogenetic history of the genes is usually presented as an unfolded species tree, where each gene copy has its own part of the tree. A number of methods have been developed to reconstruct gene DL histories given a "known" species tree, which is called gene-tree reconciliation (Szöllősi et al 2015). However, our interest here is in the reverse process, in which reconstructed but incongruent gene trees are combined into a single species tree, given a model of duplication and selective loss, which is called species-tree inference (which is the same as cophylogeny reconstruction; Drinkwater & Charleston 2014).


Known biological processes such as recombination, reassortment, hybridization, introgression and horizontal gene transfer all create reticulate phylogenetic histories. However, it is a moot point as to whether these processes can be distinguished from each other solely in the context of an evolutionary network (Holder et al. 2001; Morrison 2015). These evolutionary processes operate by distinct biological mechanisms, but the evolutionary patterns that they create can all be rather similar. The processes all result in gene flow among contemporaneous organisms (usually called horizontal flow or transfer), whereas other evolutionary processes involve gene flow from parent to offspring (usually called vertical inheritance), including ILS and DL. These gene flows create incongruent gene histories, which we may detect directly in the data or via reconstructed gene trees. The patterns of incongruence do not necessarily allow us to infer the causal process.

There are a number of differences in pattern, but the consistency of these is doubtful. Polyploid hybridization produces the most distinctive pattern, because there is duplication of the genome in the hybrid. However, subsequent aneuploidy will serve to obscure this pattern. Homoploid hybridization nominally involves 50% of the genome coming from difference sources, while introgression ultimately involves a smaller percentage. However, in practice, genome mixtures vary continuously from 0 to 50%. HGT also involves a small percentage of the genome, but in theory it also can vary from 0 to 50%. Reassortment produces mixtures of viral genes, which can occur in such a great number that reconstructing the history is severely problematic.

So, in the absence of independent experimental evidence, distinguishing one form of evolutionary network from another is almost a matter of definition. This has become increasingly obvious in the methodological literature, where semantic confusion abounds.

For example, a network produced directly from a set of characters has usually been called a "recombination network", while one produced from a set of trees has usually been called a "hybridization network", irrespective of what processes the gene trees represent. Furthermore, models that add reticulation events to DL trees have usually referred to the horizontal gene flow as "HGT", whereas models that add reticulation events to ILS trees have usually referred to the horizontal gene flow as "hybridization" (Morrison 2014a). Studies of horizontal gene flow during human evolution have usually referred to "admixture", which is a more process-neutral term.

In many, if not most, cases we might all be better off if network methods simply distinguish gene flow among contemporaries (horizontal) from gene inheritance between generations (vertical), rather than trying to infer a process — process inference can often best take place after network construction. This does not help anthropologists, of course, who are dealing with evolutionary networks where oblique gene flow is possible (so that they do not have Time inconsistency in evolutionary networks).


There seems to be a dichotomy of purposes to current method development, which are neatly summarized by the contrasting theoretical views of Mindell (2013) and Morrison (2014b). These views each recognize that evolutionary history involves both vertical and horizontal processes, but they reconstruct the resulting evolutionary patterns as a species tree and a species network, respectively. Obviously, this blog is dedicated to the latter point of view, but it is the former one (the so-called Tree of Life) that seems to currently dominate the literature.

Focussing on gene-tree inference, Szöllősi et al (2015) provide a comprehensive review of the various models that have been used to describe the dependence between gene trees and species trees. Essentially, gene trees are contained within the species tree, and they may differ from it in relative branch lengths and/or topology. The differences between genes and species are the result of population-level processes, often modeled using the coalescent. These authors recognize four current classes of probabilistic model that combine different evolutionary processes:
  • the DLCoal model, which combines coalescence and DL
  • the DTLSR model and the ODT model, both of which combine gene transfer and DL
  • models that combine hybridization and ILS
  • models of allopolyploidization.
When inferring species trees from gene trees (species-tree inference), we basically combine the scores for all of the gene trees, and then search for the species tree with the best overall score. This involves adding the scores in parsimony analyses, or multiplying the conditional probabilities in likelihood analyses (ie. maximum-likelihood or bayesian context). Many methods have been developed for inferring a species tree based on multi-locus data. These differ in whether the gene and species trees are estimated simultaneously or sequentially, and in how the gene trees are used to infer the species tree. Nakhleh (2013) and Szöllősi et al (2015) discuss both parsimony and likelihood methods for species-tree inference based on either ILS or DL models.

Extending these ideas to infer networks (rather than species trees) is a bit more tricky, and most of the work to date has involved combining hybridization and ILS. There has been no recent summary of the ideas. However, calculating the parsimony score of a network, given a set of gene-tree topologies, has been beed addressed by Yu et al (2011); and Yu et al (2013a) have extended these ideas to heuristically search the network space for the optimal network (the one that minimizes the number of extra reticulation lineages in a species tree). Furthermore, methods for computing the likelihood of a phylogenetic network, given a set of gene-tree topologies, have been devised by Yu et al (2012, 2013b); and Yu et al (2014) have extended these ideas to heuristically search for the maximum-likelihood network for limited cases of introgression or hybridization (since they differ only in degree).

There are also several methods that simply use gene-tree incongruence to infer reticulation events in a species network (Huson et al. 2010). Basically, these methods combine gene trees into "hybridization networks" by minimizing the number of reticulations required for reconciliation, measured either by counting the reticulations or calculating the network level. The combinatorial optimization can be based on trees, triplets or clusters, using parsimony as the optimality criterion. These methods model homoploid hybridization by assuming that reticulation is the sole cause of all gene-tree incongruence. This means that they are likely to overestimate the amount of reticulation in a dataset when other processes are co-occurring.

The most completely developed network methods involve data for allopolyploid hybrids. Here, there are multiple copies of each gene, one in each copy of the genome, so that allopolyploid hybrids have more copies than do their diploid parent taxa. To construct a hybridization network topology, Huber et al (2006) developed a parsimony method based on first estimating a multi-labeled gene tree, and then searching for the single-labeled network that best accommodates the multiple gene patterns. The model has been extended to heuristically include ILS (Marcussen et al 2012), as well as dates for the internal nodes (Marcussen et al 2015). Jones et al. (2013) have also developed models that incorporate ILS in a bayesian context, but only for the case of a single hybridization event between two diploid species (an allotetraploid).

Species-tree inference for a pair of gene phylogenies that may be networks not trees, has been considered in terms of parsimony by Drinkwater & Charleston (2014).

This brings us to the matter of introgression. The massive recent influx of genome-scale data for hominids has lead to the development of methods explicitly for the analysis of what is termed admixture among the lineages. These methods basically work by constructing a phylogenetic tree that includes admixture events, the topology inference being based on allele frequencies. There has been no formal comparison of the methods, and not much application to non-humans. Three such methods have been produced so far (Patterson et al 2012; Pickrell & Pritchard 2012; Lipson et al 2013).

Recombination has somewhat been the poor cousin to other causes of reticulation, as most network methods assume it to be absent. Nevertheless, Gusfield (2014) has recently provided an ample survey of the study methods available to date.


Degnan JH, Rosenberg NA (2009) Gene tree discordance, phylogenetic inference and the multispecies coalescent. Trends in Ecology & Evolution 24: 332-340.

Drinkwater B, Charleston MA (2014) An improved node mapping algorithm for the cophylogeny reconstruction problem. Coevolution 2: 1-17.

Gusfield D (2014) ReCombinatorics: the Algorithmics of Ancestral Recombination Graphs and Explicit Phylogenetic Networks. MIT Press, Cambridge.

Holder MT, Anderson JA, Holloway AK (2001) Difficulties in detecting hybridization. Systematic Biology 50: 978-982.

Huber KT, Oxelman B, Lott M, Moulton V (2006) Reconstructing the evolutionary history of polyploids from multilabeled trees. Molecular Biology & Evolution 23: 1784-1791.

Huson D, Rupp R, Scornavacca C (2010) Phylogenetic Networks: Concepts, Algorithms, and Applications. Cambridge University Press, Cambridge.

Huson DH, Scornavacca C (2011) A survey of combinatorial methods for phylogenetic networks. Genome Biology & Evolution 3: 23-35.

Jones G, Sagitov S, Oxelman B (2013) Statistical inference of allopolyploid species networks in the presence of incomplete lineage sorting. Systematic Biology 62: 467-478.

Lipson M, Loh P-R, Levin A, Reich D, Patterson N, and Berger B (2013) Efficient moment-based inference of population admixture parameters and sources of gene flow. Molecular Biology & Evolution 30: 1788-1802.

Marcussen T, Heier L, Brysting AK, Oxelman B, Jakobsen KS (2015) From gene trees to a dated allopolyploid network: insights from the angiosperm genus Viola (Violaceae). Systematic Biology 64: 84-101.

Marcussen T, Jakobsen KS, Danihelka J, Ballard HE, Blaxland K, Brysting AK, Oxelman B (2012) Inferring species networks from gene trees in high-polyploid north American and Hawaiian violets (Viola, Violaceae). Systematic Biology 61: 107-126.

Mindell DP (2013) The Tree of Life: metaphor, model, and heuristic device. Systematic Biology 62: 479-489.

Morrison DA (2014a) Phylogenetic networks: a review of methods to display evolutionary history. Annual Research and Review in Biology 4: 1518-1543.

Morrison DA (2014b) Is the Tree of Life the best metaphor, model or heuristic for phylogenetics? Systematic Biology 63: 628-638.

Morrison DA (2015, in press) Pattern recognition in phylogenetics: trees and networks. In: Elloumi M, Iliopoulos CS, Wang JTL, Zomaya AY (eds) Pattern Recognition in Computational Molecular Biology: Techniques and Approaches. Wiley, New York.

Nakhleh L (2013) Computational approaches to species phylogeny inference and gene tree reconciliation. Trends in Ecology & Evolution 28: 719-728.

Patterson NJ, Moorjani P, Luo Y, Mallick S, Rohland N, Zhan Y, Genschoreck T, Webster T, Reich D (2012) Ancient admixture in human history. Genetics 192: 1065-1093.

Pickrell JK, Pritchard JK (2012) Inference of population splits and mixtures from genome-wide allele frequency data. PLoS Genetics 8: e1002967.

Szöllősi GJ, Tannier E, Daubin V, Boussau B (2015) The inference of gene trees with species trees. Systematic Biology 64: e42-e62.

Yu Y, Barnett RM, Nakhleh L (2013a) Parsimonious inference of hybridization in the presence of incomplete lineage sorting. Systematic Biology 62: 738-751.

Yu Y, Degnan JH, Nakhleh L (2012) The probability of a gene tree topology within
a phylogenetic network with applications to hybridization detection. PLoS Genetics 8:

Yu Y, Dong J, Liu KJ, Nakhleh L (2014) Maximum likelihood inference of reticulate evolutionary histories. Proceedings of the National Academy of Sciences of the USA 111: 16448-16453.

Yu Y, Ristic N, Nakhleh L (2013b) Fast algorithms and heuristics for phylogenomics
under ILS and hybridization. BMC Bioinformatics 14: S6.

Yu Y, Than C, Degnan JH, Nakhleh L (2011) Coalescent histories on phylogenetic networks and detection of hybridization despite incomplete lineage sorting. Systematic Biology 60: 138-149.