The Genealogical World of Phylogenetic Networks

Biology, computational science, and networks in phylogenetic analysis


XML feed

Last update

1 hour 38 min ago

April 21, 2014


It is always interesting to see what the media make of scientific publications. Some time ago, several of us were involved in a paper in Trends in Genetics advocating the more widespread use of phylogenetic networks (Networks: expanding evolutionary thinking), which seemed mild enough. For example, the Idaho State University press release about the paper made it onto the Phys.Org news site reasonably accurately (Amending the Tree of Life).

However, the Intelligent Design site Evolution News and Views had a different take on things (Demolishing Darwin's Tree), reaching a series of conclusions that might surprise stun the authors of the original Trends in Genetics paper. You will need to read the ID commentary for yourself (and you should, if only for your own education), but the final set of conclusions will give you some of the flavour:
One can only welcome this paper's bold proposal to overturn entrenched dogma ... the "network" diagram seems conducive to ID research inasmuch as it calls into question universal common ancestry via natural selection (i.e., neo-Darwinism), and seeks to portray the evidence honestly ... It's too soon to tell if Darwin security forces will let this band of independent thinkers gather a following. If nothing else, it shows (notwithstanding the insistences of the National Center for Science Education) that insiders know about the fundamental controversies in evolutionary theory, and are calling for some of the same reforms that advocates of intelligent design do.I am not sure that all of these conclusions are logically consistent with the words of the original paper.

April 16, 2014


The following text was written a few years ago, but much of it never got published. So, I thought that this might be a good opportunity to make it available, since what it says is still true today.

Since a phylogenetic tree is interpreted in terms of the monophyletic groups that it hypothesizes, it is important to quantitatively assess the robustness of all of these groups (i.e. the degree of support for each branch in the tree) — is the support for a particular group any better than would be expected from a random data set? This issue of clade robustness is the same as assessing branch support on the tree, since each branch represents a clade. Many different techniques have been developed, including:
  1. analytical procedures, such as interior-branch tests (Nei et al. 1985; Sneath 1986), likelihood-ratio tests (Felsenstein 1988; Huelsenbeck et al. 1996b), and clade significance (Lee 2000);
  2. resampling procedures, such as the bootstrap (Felsenstein 1985), the jackknife (Lanyon 1985), topology-dependent permutation (Faith 1991), and clade credibility or posterior probability (Larget and Simon 1999); and
  3. non-statistical procedures, such as the decay index (Bremer 1988), clade stability (Davis 1993), and spectral signals (Hendy and Penny 1993).
Of these, far and away the most popular and widely used method has been the bootstrap technique (Holmes 2003; Soltis and Soltis 2003).

The bootstrap

This method was first introduced by Efron (1979) as an alternative method to jackknifing for producing standard errors on estimates of central location other than the mean (e.g. the median), but it has since been expanded to cover probabilistic confidence intervals as well (Efron and Tibshirani 1993; Davison and Hinkley 1997). It was introduced into phylogenetic studies by Penny et al. (1982) and then formalized by Felsenstein (1985), who suggested that it could be implemented by holding the taxa constant and resampling the characters randomly with replacement, the tree-building analysis then being applied to each of the bootstrap resamples.

Bootstrapping is a monte carlo procedure that it generates "pseudo" data sets from the original data, and uses these new data sets for its inferences. That is, it tries to derive the population inferences (i.e the "true" answer) from repeated generation of new samples, each sample being constrained by the characteristics of the original data sample. It thus relies on an explicit analogy between the sample and the appropriate population: that sampling from the sample is the same as sampling from the population. Clearly, the strongest requirement for bootstrapping to work is that the sample be a reasonable representation of the population.

Bootstrap confidence intervals are only ever approximate, especially for complex data structures, as they are a fundamentally more ambitious measure of accuracy than is a simple standard error (SE). For example, the usual formula for calculating a confidence interval (CI) when the population frequency distribution is assumed to be normal is: CI = t * SE, where t is the Student t-value associated with the particular sample size and confidence percentage required. However, the main use of bootstrapping is in situations where the population frequency distribution is either indeterminate or is difficult to obtain empirically, and so this simple formula cannot be applied. Getting from the standard error to a confidence interval is then not straightforward. As a result, there are actually several quite distinct procedures for performing bootstrapping (Carpenter and Bithell 2000), with varying degrees of expected success.

Types of bootstrap

The original technique is called the percentile bootstrap. It is based on the principle of using the minimum number of ad hoc assumptions, and so it merely counts the percentage of bootstrap resamples that meet the specified criteria. F§or example, to estimate the standard error of a median, the median can be calculated for each bootstrap resample and then the standard deviation of the resulting frequency distribution will be the estimated standard error of the original median. The method is thus rather simplistic, and is often referred to as the naïve bootstrap, because it assumes no knowledge of how to calculate population estimates. It is a widespread method, as it can be applied even when the other methods cannot. However, it is known to have certain problems associated with the estimates produced, particularly for confidence intervals, such as bias and skewness (especially when the parent frequency distribution is not symmetrical). These were pointed out right from the start (Efron 1979), and efforts have subsequently been made to deal with them. Nevertheless, this is the form of bootstrap introduced by Felsenstein (1985), and it is the one used by most phylogeny computer programs. It is therefore the one that will be discussed in more detail below.

These known problems with the naïve bootstrap can be overcome by using bias-corrected (BC) bootstrap estimates — that is, the bias is estimated and removed from the calculation of the confidence interval. Possible dependence of the standard error on the parameter being estimated, which creates skewness, can be dealt with by using bias-corrected and accelerated (BCa) bootstrap estimates, so that the bias and skewness are both estimated and removed from the calculation of the confidence interval. The BCa method is the one usually recommended for use (Carpenter and Bithell 2000), because it corrects for both bias and skewness. This method is much slower to calculate than the simple percentile bootstrap, because it requires an extra parameter to be estimated for each of the bias and skewness corrections, and the latter correction is actually estimated by performing a separate jackknife analysis on each bootstrap resample (which means that the analysis can take 100 times as long as a naïve analysis). There have been several attempts to apply this form of correction methodology to bootstrapping in a phylogenetic context (Rodrigo 1993; Zharkikh and Li 1995; Efron et al. 1996; Shimodaira 2002), but while these can be successful at correcting bias and skewness (Sanderson and Wojciechowski 2000) these have not caught on, possibly because of the time factor involved.

Alternatively, we can decide not to be naïve when calculating confidence intervals, and to therefore calculate them in the traditional manner, using the standard error and the t-distribution. However, we then need to overcome any non-normal distribution problems of these two estimates by estimating both of them using bootstrapping. That is, bootstrapped-t confidence intervals are derived by calculating both the standard error and the t-value using bootstrapping, and then calculating the confidence interval as ±t * SE. To many people, this is the most natural way to calculate confidence intervals, since it matches the usual parametric procedure, and thus it is frequently recommended (Carpenter and Bithell 2000). Once again, this method is much slower to calculate than the percentile bootstrap, because the t-value is actually estimated by performing a separate bootstrap analysis on each bootstrap resample (which means that the analysis can take 100 times as long as a naïve analysis). This methodology seems not to have yet been suggested in a phylogenetic context, and in any case the time factor may be restrictive.

It is also possible to calculate test-inversion confidence intervals. This idea is based on the reciprocal relationship of statistical tests and confidence intervals, where (for example) non-overlapping 95% confidence intervals indicate statistically significant patterns at p75% tend to be underestimates of the amount of support while they are overestimates below this level. The graph is based upon 1000 bootstrap resamples of 100 simulated characters for a clade of three taxa plus outgroup (based on data presented by Zharkikh and Li 1992a). The true probability represents the amount of character support for the clade in the simulated data, while the bootstrap probability is the proportion of resamples that included the clade.
These studies have demonstrated that the probability of bootstrap resampling supporting the true tree may be either under- or overestimated, depending on the particular situation. For example, bootstrap values >75% tend to be underestimates of the amount of support, while they may be overestimates below this level, as shown in the first graph (above). That is, when the branch support is strong (i.e. the clade is part of the true tree) there will be an underestimation and when the support is weak (i.e. the clade is not part of the true tree) there will be an overestimation. This situation has been reported time and time again, with various theoretical explanations (e.g. Felsenstein and Kishino 1993; Efron et al. 1996; Newton 1996), although there are dissenting voices (e.g. Taylor and Piel 2004) as would be expected for a complex situation. Unfortunately, practitioners seem to ignore this fact, and to assume incorrectly that bootstrap values are always underestimates.

Just as importantly, the theoretical studies show that the pattern of over- and underestimation depends on (i) the shape of the tree and the branch lengths, (ii) the number of taxa, (iii) the number of characters, (iv) the evolutionary model used, and (v) the number of bootstrap resamples. This was first reported by Zharkikh and Li (1992a), and has been reconfirmed since then. For example, with few characters the bootstrap index tends to overestimate the support for a clade and to underestimate it for more characters. This is particularly true if the number of phylogenetically informative characters is increased or the number of non-independent characters is increased; and the index becomes progressively more conservative (i.e. lower values) as the number of taxa is increased.

Moreover, these patterns of under- and overestimation are increased with an increasing number of bootstrap replications, as shown in the next graph — this called "being wrong, with confidence".

An example of the relationship between the true clade probability and the observed non-parametric bootstrap proportion for two simulated data sets with different numbers of characters (as shown). The lines are based on data presented by Zharkikh & Li, (1995) for 1000 bootstrap resamples of a clade of three taxa plus outgroup.
The following graph pair of graphs show the effect of varying the evolutionary model used to generate the data, where under-specification of the analysis model leads to a general over-estimate of the true probability (cross-over at p=0.8, as shown in the first graph of the pair), while matching the generating and analysis models leads to a general under-estimation (cross-over at p=0.3, as shown in the second graph of the pair).

An example of the relationship between the true tree probability and the difference between the observed percentile bootstrap proportion and the true probability for two simulated data sets. The label in the bottom corner shows the substitution model used to simulate the data, then the model assumed in the bootstrap analysis (the sequence length is 100 nucleotides); JC69 = Jukes-Cantor, GTRG = general time- reversible + gamma-distributed among-site rate variation. The points are based on data presented by Huelsenbeck & Rannala (2004).
These are serious issues, which seem to be often ignored by practitioners. We can't just assume that the "true" support value is larger than our observed bootstrap value. In particular, this means that bootstrap values are not directly comparable between trees, even for the same taxa, and thus there can be no "agreed" level of bootstrap support that can be considered to be "statistically significant". A bootstrap value of 90% on a branch on one tree may actually represent less support than a bootstrap value of 85% on another tree, depending on the characteristics of the dataset concerned and the bootstrapping procedure used (although within a single tree the values should be comparable).

This complex situation means that we have to consider carefully how best to interpret bootstrap values in a phylogenetic context (Sanderson 1995). The bootstrap proportion (i.e. the proportion of resampled trees containing the branch/clade of interest) has variously been interpreted as (Berry and Gascuel 1996):
  1. a measure of reliability, telling us what would be expected to happen if we repeated our experiment;
  2. a measure of accuracy, telling us about the probability of our experimental result being true; and
  3. a measure of confidence, interpreted as a conditional probability similar to those in standard statistical hypothesis tests (i.e. measuring Type I errors or false positives).
The bootstrap was originally designed for purpose (1), and all of the problems identified above relate to trying to use it for purposes (2) and (3). The values derived from the naïve bootstrap need correcting for purposes (2) and (3), and the degree of correction depends on the particular data set being examined (Efron et al. 1996; Goloboff et al. 2003).

The issue of support values depending on the number of bootstrap replicates is also of interest. It is usually recommended that at least 1,000–2,000 bootstrap resamples are taken for estimating confidence intervals, and this generality has been applied to phylogenetic trees (Hedges 1992). However, it is important to recognize that these suggestions relate to the precision of the confidence estimates not to their accuracy. Accuracy refers to how close the estimates are to the true value (i.e. correctness) while precision refers to how variable are the estimates (i.e. repeatability). Accuracy depends on a complex set of characteristics many of which have nothing to do with bootstrap replication. Precision, on the other hand, is entirely to do with the number of bootstrap replicates and the expected accuracy of the estimates. As shown in the next graph, 100 replicates at a conventional level of accuracy produces estimates that are expected to be within ±4% of the "true" values while 2,000 replicates produces estimates ±1%. This needs to be borne in mind when deciding whether to call a particular value "significant support" or not.

The number of bootstrap replicates needed to achieve a specified amount of precision, given statistical testing at two different levels of probability. For example (as shown by the dotted line), 100 bootstrap replicates means that, if the bootstrap value is accurate at the 95% confidence level, then the estimated bootstrap percentage will be precise to ±4.3%. In order to get ±1% precision then nearly 2,000 bootstrap replicates are needed.
There have also been attempts to overcome some of the practical limitations of bootstrapping for large data sets by adopting heuristic procedures, including resampling estimated likelihoods for maximum-likelihood analyses (Waddell et al. 2002) and reduced tree-search effort for the bootstrap replicates. However, approaches using reduced tree-search effort produce even more conservative estimates of branch support, and the magnitude of the effect increases with decreasing bootstrap values (DeBry and Olmstead 2000; Mort et al. 2000; Sanderson and Wojciechowski 2000).


Adell J.C., Dopazo J. 1994. Monte Carlo simulation in phylogenies: an application to test the constancy of evolutionary rates. J. Mol. Evol. 38, 305-309.

Alfaro M.E., Zoller S., Lutzoni F. 2003. Bayes or bootstrap? A simulation study comparing the performance of bayesian markov chain monte carlo sampling and bootstrapping in assessing phylogenetic confidence. Mol. Biol. Evol. 20, 255-266.

Berry V., Gascuel O. 1996. On the interpretation of bootstrap trees: appropriate threshold of clade selection and induced gain. Mol. Biol. Evol. 13, 999-1011.

Bremer K. 1988. The limits of amino acid sequence data in angiosperm phylogenetic reconstruction. Evolution 42, 795-803.

Buckley T.R., Cunningham C.W. 2002. The effects of nucleotide substitution model assumptions on estimates of nonparametric bootstrap support. Mol. Biol. Evol. 19, 394-405.

Buckley T.R., Simon C., Chambers G.K. 2001. Exploring among-site rate variation models in a maximum likelihood framework using empirical data: effects of model assumptions on estimates of topology, branch lengths and bootstrap support. Syst. Biol. 50, 67-86.

Carpenter J., Bithell J. 2000. Bootstrap confidence intervals: when, which, what? A practical guide for medical statisticians. Stat. Med. 19, 1141-1164.

Davis J.I. 1993. Character removal as a means for assessing the stability of clades. Cladistics 9, 201-210.

Davison A.C., Hinkley D.V. 1997. Bootstrap Methods and Their Applications. Cambridge Uni. Press, Cambridge.

DeBry R.W., Olmstead R.G. 2000. A simulation study of reduced tree-search effort in bootstrap resampling analysis. Syst. Biol. 49, 171-179.

Efron B. 1979. Bootstrapping methods: another look at the jackknife. Ann. Stat. 7, 1-26.

Efron B., Halloran E., Holmes S. 1996. Bootstrap confidence levels for phylogenetic trees. Proc. Nat. Acad. Sci. U.S.A. 93, 7085-7090.

Efron B., Tibshirani R.J. 1993. An Introduction to the Bootstrap. Chapman & Hall, London.

Erixon P., Svennblad B., Britton T., Oxelman B. 2003. Reliability of bayesian probabilities and bootstrap frequencies in phylogenetics. Syst. Biol. 52, 665-673.

Faith D.P. 1991. Cladistic permutation tests for monophyly and nonmonophyly. Syst. Zool. 40, 366-375.

Felsenstein J. 1985. Confidence limits on phylogenies: an approach using the bootstrap. Evolution 39, 783-791.

Felsenstein J. 1988. Phylogenies from molecular sequences: inference and reliability. Annu. Rev. Genet. 22, 521-565.

Felsenstein J., Kishino H. 1993. Is there something wrong with the bootstrap on phylogenies? A reply to Hillis and Bull. Syst. Biol. 42, 193-200.

Galtier N. 2004. Sampling properties of the bootstrap support in molecular phylogeny: influence of nonindependence among sites. Syst. Biol. 53, 38-46.

Goldman N. 1993. Statistical tests of models of DNA substitution. J. Mol. Evol. 36, 182-198.

Goloboff P.A., Farris J.S., Källersjö M., Oxelman B., Ramırez M.J., Szumik C.A. 2003. Improvements to resampling measures of group support. Cladistics 19, 324-332.

Hedges S.B. 1992. The number of replications needed for accurate estimation of the bootstrap P value in phylogenetic studies. Mol. Biol. Evol. 9, 366-369.

Hendy M.D., Penny D. 1993. Spectral analysis of phylogenetic data. J. Classific. 10, 5-24.

Hillis D.M., Bull J.J. 1993. An empirical test of bootstrapping as a method for assessing confidence in phylogenetic analysis. Syst. Biol. 42, 182-192.

Holmes S. 2003. Bootstrapping phylogenetic trees: theory and methods. Statist. Sci. 18, 241-255.

Huelsenbeck J.P., Hillis D.M., Jones R. 1996a. Parametric bootstrapping in molecular phylogenetics: applications and performance. In: Ferraris, J.D., Palumbi, S.R. (Eds), Molecular

Huelsenbeck J.P., Hillis D.M., Nielsen R. 1996b. A likelihood ratio test of monophyly. Syst. Biol. 45, 546-558.

Huelsenbeck J.P., Rannala B. 2004. Frequentist properties of bayesian posterior probabilities of phylogenetic trees under simple and complex substitution models. Syst. Biol. 53, 904-913.

Lanyon S.M. 1985. Detecting internal inconsistencies in distance data. Syst. Zool. 34, 397-403.

Larget B., Simon D.L. 1999. Markov chain monte carlo algorithms for the bayesian analysis of phylogenetic trees. Mol. Biol. Evol. 16, 750-759.

Lee M.S.Y. 2000. Tree robustness and clade significance. Syst. Biol. 49, 829-836.

Li W.-H., Zharkikh A. 1994. What is the bootstrap technique? Syst. Biol. 43, 424-430.

Mort M.E., Soltis P.S., Soltis D.E., Mabry M.L. 2000. Comparison of three methods for estimating internal support on phylogenetic trees. Syst. Biol. 49, 160-171.

Nei M., Stevens J.C., Saitou M. 1985. Methods for computing the standard errors of branching points in an evolutionary tree and their application to molecular data from humans and apes. Mol. Biol. Evol. 2, 66-85.

Newton M.A. 1996. Bootstrapping phylogenies: large deviations and dispersion effects. Biometrika 83, 315-328.

Penny D., Foulds L.R., Hendy M.D. 1982. Testing the theory of evolution by comparing phylogenetic trees constructed from five different protein sequences. Nature 297, 197-200.

Rodrigo A.G. 1993. Calibrating the bootstrap test of monophyly. Int. J. Parasitol. 23, 507-514.

Sanderson M.J. 1989. Confidence limits on phylogenies: the bootstrap revisited. Cladistics 5, 113-129.

Sanderson M.J. 1995. Objections to bootstrapping phylogenies: a critique. Syst. Biol. 44, 299-320.

Sanderson M.J., Wojciechowski M.F. 2000. Improved bootstrap confidence limits in large-scale phylogenies, with an example from Neo-Astragalus (Leguminosae). Syst. Biol. 49, 671-685.

Shimodaira H. 2002. An approximately unbiased test of phylogenetic tree selection. Syst. Biol. 51, 492-508.

Sitnikova T., Rzhetsky A., Nei M. 1995. Interior-branch and bootstrap tests of phylogenetic trees. Mol. Biol. Evol. 12, 319-333.

Sneath P.H.A. 1986. Estimating uncertainty in evolutionary trees from Manhattan-distance triads. Syst. Zool. 35, 470–488.

Soltis P.S., Soltis D.E. 2003. Applying the bootstrap in phylogeny reconstruction. Statist. Sci. 18, 256-267.

Suzuki Y., Glazko G.V., Nei M. 2002. Overcredibility of molecular phylogenies obtained by bayesian phylogenetics. Proc. Nat. Acad. Sci. U.S.A. 99, 16138-16143.

Taylor D.J., Piel W.H. 2004. An assessment of accuracy, error, and conflict with support values from genome-scale phylogenetic data. Mol. Biol. Evol. 21, 1534-1537.

Waddell P.J., Kishino H. and Ota, R. 2002). Very fast algorithms for evaluating the stability of ML and Bayesian phylogenetic trees from se- quence data. Genome Informatics 13, 82-92.

Wilcox T.P., Zwickl D., Heath T.A., Hillis D.M. 2002. Phylogenetic relationships of the dwarf boas and a comparison of bayesian and bootstrap measures of phylogenetic support. Mol. Phylogenet. Evol. 25, 361-371.

Zharkikh A., Li W.-H. 1992a. Statistical properties of bootstrap estimation of phylogenetic variability from nucleotide sequences. I. Four taxa with a molecular clock. Mol. Biol. Evol. 9, 1119-1147.

Zharkikh A., Li W.-H. 1992b. Statistical properties of bootstrap estimation of phylogenetic variability from nucleotide sequences. II. Four taxa without a molecular clock. J. Mol. Evol. 35, 356-366.

Zharkikh A., Li W.-H. 1995. Estimation of confidence in phylogeny: the complete-and-partial bootstrap technique. Mol. Phylogenet. Evol. 4, 44-63.

April 13, 2014


In Australia at the time I was born, the most popular first name for boys was "David" and the second most popular was "Andrew". Not unexpectedly, the most popular middle name was "Andrew" and number two was "David". It then comes as no surprise to you that I ended up with this pair of given names.

Names come and go in popularity (these are called fads), and if your parents have no imagination then you will grow up knowing that you are not unique, because half the people in your classroom will have the same name as yourself. You may even end up being numbered (David #1, David #2, etc). What's worse, if you are not careful then you may end up doing the same thing to your own children.

Indeed, having a common name has only one known advantage — no matter where you go in the world everyone can recognize it, although they may not always spell it and pronounce it the way you expect (David, Davide, Dawit ...). Therefore, you will have no problems making restaurant bookings where ever you happen to be (see Leonard S. Bernstein. 1981. Never Make a Reservation in Your Own Name. Rand McNally).

These days in Australia, "David" struggles to be in the top 100 in popularity for boys. However, currently it appears to be in the top 10 in places like Armenia, Austria, Hungary, Italy, Spain and Israel (in 2012 or 2013), as well as the top 20 in Poland and Portugal. This information comes from The Baby Name Wizard. This site has current lists for many countries (Popular Names From Around the World), but has historical data only for the USA.

So, let's look at the U.S. data in more detail. As for Australia, the peak popularity in the USA was from 1955-1965, as shown in the first graph.

Note that the peak is truncated from 1950-1960.
The site's Name Mapper web page has annual data for each state from 1960-2009, which is precisely 50 years. These data show the ranking of names by popularity within each state. The average rank for the name "David" across the 50 states is shown in the next graph. "David" was one of the top 10 names for boys born from 1936-1992, the #1 name in 1960, and it remains inside the top 20 to this day.

We can also look at the data for each state individually, as shown in the next graph, where darker shading represents greater popularity of the name. From the peak in the 1960s there was a steady decrease in almost every state until 1995, after which the popularity has been more erratic. For example, in 1960 "David" was the #1 boys name in 28 of the 50 states (and in the top 5 in every state), but by 1968 it was not #1 anywhere. The last time it was ranked #1 was in Utah in 1970, which was also the last year in which it was in the top 6 in every state.

Note that the states are grouped and colored geographically / culturallu.
Only in California and Texas has the name stayed in the top 10 over the past 50 years. In the other states it has stayed in the top 50 or so, except for North Dakota, where it is currently struggling to stay in the top 100. In Nevada and Alaska it has even made a bit of a comeback in the past 10 years.

We can look at the relationship between the states using a phylogenetic network. The next graph is a NeighborNet (based on the manhattan distance) of the 1960-2009 data for the popularity ranking of "David" as a boy's name. States near each other in the network have a similar naming popularity, while states further apart are progressively more different from each other. The network shows a simple trend of increasing average popularity of "David" from the top-left to the bottom-right.

I have also colored the states using the same color scheme as for the previous graph (ie. geographically / culturally). Note that the orange, red, yellow and blue states are fairly neatly grouped, indicating that their alleged geographical / cultural similarity extends to the popularity of given names ("David" has continued to be popular in all of these states). The purple, brown and green states are not grouped very much, indicating much more diversity in the popularity of "David". For example, "David" has continued to be popular in New York and New Jersey but not in Maine, New Hampshire or Vermont. The extreme disinterest of North Dakotans in the name is very clear.

The fall of "David" is not as bad as that of "James" and "John", which were in the top 3 most popular names in the USA all the way from the 1880s to the 1950s, but which are now in 17th and 27th place, respectively (see the timeline graph in the Name Voyager).

I am not sure what has led to the eclipse of these names, other than the whims of faddishness. For example, in Britain and Ireland the name "Harry" has shot to the top in recent years (guess why!), while it still languishes near #700 in the USA. Otherwise, "Noah" and "Liam" seem to have the most widespread popularity for boys in the western world at the moment.

Footnote: I actually got the name Andrew because it is my father's middle name, and his father's before him, and his father's first name.

April 8, 2014


Alain Cuerrier, Luc Brouillet and Denis Barabe (1998. Numerical and comparative analyses of the modern systems of classification of the flowering plants. Botanical Review 64: 323-355) have provided a genealogy of the various classifications that have been produced for the angiosperms (flowering plants). This is a theoretical construction, intended to express the lines of intellectual influence, either directly expressed by the authors of the classifications, or inferred by comparison of the classifications themselves.

As shown here, it is a classic directed acyclic graph, most of which is tree-like, although some parts are distinctly bushy. Of interest to us, there are also places where hybridizations are indicated.

Cuerrier et al. analyzed the structure of the four modern classifications (by Cronsquist, Dahlgren, Takhtajan, and Thorne) in comparison to their immediate predecessors (by Bessey, Engler, Hallier, and also Gobi). This was a study of affinity relationships, rather than genealogy, and one of their study questions was whether the affinity relationships matched the genealogical ones.

In this regard it is interesting to note that they used clustering and ordination techniques to analyze their quantitative data (comparing the classifications), but they did not use any network techniques. Yet, this would seem to be an obvious strategy, given that they were expecting reticulating relationships.

Unfortunately, none of the datasets shown in the paper is complete, and so I cannot provide a network analysis for them.

The authors summarized their suite of multivariate analyses as an interaction network, as shown next. For each pair of classifications, four statistical tests were performed, and the thickness of the arrows in the network indicates the degree of significant similarity detected: dotted arrow = 2/4 tests show similarity, thin arrow = 3/4 tests, and thick arrow = 4/4 tests.

These semi-quantitative relationships can also be expressed as an unrooted phylogenetic network. I simply took the pairwise similarity scores (0.0, 0.25, 0.5, 0.75, 1.0) and analyzed them using a NeighborNet network. The four modern classifications are highlighted in red.

This more clearly illustrates the various points made by Cuerrier et al. In particular, they note that the intellectual genealogy is not reflected in the affinity relationships of the modern classifications. For example, the Cronquist and Takhtajan classifications are much more similar to that of Hallier than to that of Bessey, whereas Cronquist explicitly cites Bessey as a major influence on his work. Instead, Cronquist's classification is more similar to that of Engler, who does not appear to be genealogically related at all. The distinction between the Thorne and Dahlgren classifications and those of Takhtajan and Cronquist is also obvious.

April 5, 2014


Charles Darwin's sex life is of interest because of his consanguineous marriage (to his first cousin), which seems to have resulted in genetic problems for his children, due to inbreeding (see Charles Darwin's family pedigree network). The children of this marriage have recently been discussed in the book by Tim Berra (Darwin and His Children: His Other Legacy). This book discusses Darwin's children mainly in the context of Darwin's own life. Unfortunately, it does not delve much into his personal relationship with either them or his wife, Emma. His private life remains fairly private.

In particular, the book fails to draw any inference from the obvious fact that there were 10 of these children, plus two possible miscarriages. However, obviously we do learn indirectly about a certain part of Mr Darwin's private life. After all, one does not get a woman pregnant accidentally (no matter what your friends try to tell you) -- there are certain biological procedures that you need to go through, and it is fairly difficult to carry these out accidentally. Clearly, Charles and Emma were familiar with this particular activity, and carried it out successfully on numerous occasions.

Charles Darwin, 2 years before the
birth of his last child
The question is: how many occasions? We know the minimum number, but what about the average rate, for example? The Darwin cottage industry has apparently produced speculations about his sex life before (see Wikipedia), but I have not read about them. Instead, I will provide my own analysis of the situation.


Charles and Emma married on 29 January 1839, when Charles was 29 years and 11 months old and Emma was 30 years and 8 months old. This is pretty late to be starting a family, although not necessarily unusual, and it does have an influence on the calculations.

Emma realized during the following April that she was pregnant (ie. within 3 months); and during the subsequent 18 years she was pregnant a total of 11 more times. On average, there were 500 days between each of the first nine pregnancies, as shown in the first graph. This means that during those 12 years she spent 55% of her days being pregnant and 45% of them not pregnant.

Wikipedia paints an interesting picture of marriages in Victorian Britain (Women in the Victorian era):
When a Victorian man and woman married, the rights of the woman were legally given over to her spouse. Under the law the married couple became one entity where the husband would represent this entity, placing him in control of all property, earnings and money. In addition to losing money and material goods to their husbands, Victorian wives became property to their husbands, giving them rights to what their bodies produced: children, sex and domestic labour. Marriage abrogated a woman's right to consent to sexual intercourse with her husband, giving him 'ownership' over her body. Their mutual matrimonial consent therefore became a contract to give herself to her husband as he desired.The extent to which Emma was involved in the decision to spend more than half of her time pregnant is therefore open to debate. Both her letters and those of her husband do not, as far as I know, reveal any marital difficulties — indeed, quite the contrary. However, Charles' has left us written evidence of his pre-marital ideas about marriage (Darwin’s notes on marriage), which indicate his specific intention to have a family available in his old age.

Note that there are reported to have between two miscarriages between the 9th and 10th births, one in 1852 (when Emma was 44 years old) and one in 1854 (when she was 46). Emma was 48 years and 7 months old when she delivered her final child. Along with the miscarriages, it is worth noting that the final child was born mentally disabled (probably Down's syndrome, for which there is a 1 in 11 chance at age 49), and he died after 18 months. Also, the third child was born after only 36 weeks of pregnancy (instead of the "normal" 40 weeks), and lived for less than a month. Darwin's favorite child was his 2nd (Anne), who unfortunately died of tuberculosis at age 10. The remaining seven children survived to adulthood.

We can also note that the children were born during most periods of the year, as shown in the next graph. However, five of the births were during the 3-month period from early July to late September, implying conception during the period October to December.

In English-speaking countries there is a peak of births in late September, 9 months after the Christmas celebrations (Wellings et al. 1999; Tita et al. 2001). (In Scandinavia, the birth peak is 9 months after the mid-summer celebrations.) Given that two of the births were in this period, we might accuse the Darwins of fitting into this behavioral cliché. However, one of the these two births was the shortened pregnancy, so that conception in that case was on or near to their 3rd wedding anniversary, rather than Christmas. The other conception dates do not fit any pattern that I can see.

All of the above data lead me to the conclusion that most, if not all, of the pregnancies were the result of more-or-less continuously ongoing sexual activity, rather than being the result of deliberate attempts to conceive, or being incidental by-products of celebratory activity. That is, the pregnancies occurred as chance dictated, given the night-time activities being undertaken.

This leads us to the key question of how often these activities took place. We can do some general calculations that might be informative.


We now know that the potentially fertile period of human female ovulation is 12 days out of every 28, and vaginal sex during this period should be avoided if you do not wish to be involved in a pregnancy (Arévalo et al. 1999). Within this window of opportunity there is a 6-day period during which conception is most likely (Dunson et al. 2002; Stirnemann et al. 2013), and if you are trying to conceive a child then sex at least twice during this period is the recommended strategy. (Each egg lasts 1 day, but sperm last for 3 days, so that sex more than 2-3 times doesn't seem to improve your chances.) Clearly, sex once during this 6-day period is a reasonable minimum expectation for conception.

However, the probability of conception even under these minimum circumstances is very dependent on the age of the female involved. (The eggs are produced early in the female's life, and the eggs age along with the woman, so that older eggs have reduced fertility; Broekmans et al. 2009.) For example (Siebler 2009; Sozou & Hartshorne 2012), in her early 20s a healthy fertile woman has a 20–25% probability of conception each month. The average time to achieve conception for this age group is 4 months, and the likelihood of conception within one year is 93–97%. More importantly, in her early 30s (as Emma was when she married) the probability of conception each month drops to 10–15%, so that the average time of conception is 10 months and the likelihood of conception within one year is c.72%. The probability keeps dropping until menopause (where it reaches zero), so that, for example, the likelihood of conception within one year is c.65% for a woman in her late 30s.

Emma, near the time of her marriage
This means that, given her age, Emma had to receive sperm during every ovulation cycle, in order to maintain a 50% chance of getting pregnant within any one year (she got pregnant on average every 9-12 months). If you know the ovulation times, then that rate requires sex 13 times per year. If you don't know the times, or you don't know anything about ovulation cycles (and it seems likely that Victorian women did not), then it requires sex at least once per week in order to hit them all by random chance.

So, I arrive at the conclusion of weekly sex for the Darwins throughout the first 12 years of their marriage, and possibly for 18 years. Calculations seem to be much more difficult after that, due to lack of suitable data.

I have no idea whether this weekly rate was normal for Victorian couples, but it certainly seems to be quite normal in the modern world, for people of their age. As shown in the next graph, people in their 30s and 40s currently report having sex every 4-5 days throughout the year (Mosher et al. 2005; Schneidewind-Skibbe et al. 2008). So, Charles' sex life would fit perfectly into the 21st century.

From Mosher et al. (2005)

Finally, it is interesting to note that Charles started writing what he called his "Big Species Book" shortly after the birth of his final child. Furthermore, he converted this incomplete manuscript into what is now known as On the Origin of Species after the early death of that same child. Other events were involved in these decisions, of course, but his changing family life is unlikely to have been the least important of them.


Arévalo M, Sinai I, Jennings V (1999) A fixed formula to define the fertile window of the menstrual cycle as the basis of a simple method of natural family planning. Contraception 60: 357-360.

Broekmans FJ, Soules MR, Fauser BC (2009) Ovarian aging: mechanisms and clinical consequences. Endocrine Reviews 30: 465-493.

Dunson DB, Colombo B, Baird DD (2002) Changes with age in the level and duration of fertility in the menstrual cycle. Human Reproduction 17: 1399-1403.

Mosher WD, Chandra A, Jones J (2005) Sexual behavior and selected health measures: men and women 15–44 years of age, United States, 2002. Advance Data From Vital and Health Statistics 362. National Center for Health Statistics, Hyattsville, MD.

Schneidewind-Skibbe A, Hayes RD, Koochaki PE, Meyer J, Dennerstein L (2008) The frequency of sexual intercourse reported by women: a review of community-based studies and factors limiting their conclusions. Journal of Sexual Medicine 5: 301-335.

Siebler SJ (2009) How to Get Pregnant. Little, Brown and Co, New York, NY.

Sozou PD, Hartshorne GM (2012) Time to pregnancy: a computational method for using the duration of non-conception for predicting conception. PLoS One 7: e46544.

Stirnemann JJ, Samson A, Bernard JP, Thalabard JC (2013) Day-specific probabilities of conception in fertile cycles resulting in spontaneous pregnancies. Human Reproduction 28: 1110-1116.

Tita AT, Hollier LM, Waller DK (2001) Seasonality in conception of births and influence on late initiation of prenatal care. Obstetrics & Gynecology 97: 976-981.

Wellings K, Macdowall W, Catchpole M, Goodrich J (1999) Seasonal variations in sexual activity and their implications for sexual health promotion. Journal of the Royal Society of Medicine 92: 60-64.

April 1, 2014


Today is All Fool's Day. The tradition apparently started in the Netherlands and northern Germany, where on April 1 people would be sent on a long series of purposeless errands, and thus be made to feel increasingly foolish as the day went on. (This is now known as "a wild goose chase".) The Museum of Hoaxes has a detailed history (The Origin of April Fool’s Day), plus supplementary information about a Dutch poem from 1561 and the first German reference in 1618.

This tradition has been modified in the past 150 years or so, to one where outrageous stories are told, usually in public, to see how many people can be made to believe that they are true. The media are often involved, particularly newspapers and television shows. These "hoaxes" are usually revealed by the end of the day — indeed, if they continue, then they are usually referred to as hoaxes rather than as April fool jokes.

The Museum of Hoaxes has a compilation of what the curator believes to be the Top 100 April Fool's Day Hoaxes of All Time, which makes interesting reading.

Phylogenetics and evolutionary biology are not immune from these activities, of course. I have listed here a few of the jokes perpetrated in recent years on the internet, just in case nothing much happened today and you want to read about something appropriate anyway.

Tetrapod Zoology (April 1 2011)
Science meets the Mokele-Mbembe!

Molecular Phylogenetics and Evolution (April 1 2004)
Molecular phylogenetic analysis of mtDNA sequences from the Yeti

Raptormaniac (April 1 2013)
Hail Volantia

Tetrapod Zoology (April 1 2013)
Welcome to the Squamozoic!

Shit You Didn't Know About Biology (April 1 2012)

The Tree of Life (April 1 2008)
Confessions of an April Fool, and the dope on brain doping

Evolving Thoughts (April 1 2009)
New work on lateral transfer shows that Darwin was wrong

The Genealogical World of Phylogenetic Networks (April 1 2013)
Empedocles, Lucretius and lateral gene transfer

Computational Footnote

There are plenty of computational jokes in the world, mostly involving unsuccessful mathematical proofs, but none of them seem to have much to do with phylogenetics. Is there a message here? For example, physicists, ever the pranksters, typically use the arXiv pre-print repository to post spurious papers on April 1, with some examples noted at MetaFilter for April 2012 (April Fools for physicists).

March 30, 2014


King Digital, the creators of the popular smartphone game Candy Crush Saga were listed on the New York Stock Exchange two days after this game was shown to be NP-hard [1]. Could these two events be somehow related? Anyway, although the King Digital shares are not doing well, the NP-hardness proof still stands. A different NP-hardness proof for Candy Crush actually appeared on the arXiv a few weeks earlier [2], but was based on rules that are slightly different from the usual rules of Candy Crush.

So what is Candy Crush? It is a smartphone / tablet game having a rectangular board filled with different types of candies. A player can score points by swapping two adjacent candies in order to match three or more candies of the same type. This seems to be even more addictive than eating candies, which made the game the most popular game of Facebook, and led to a 568 million dollar profit for King Digital in 2013.

Interestingly, Candy Crush Saga is one of a large family of games that are all based on matching objects. These games all seem to be closely related. Moreover, their genealogy is not tree-like at all, as shown below. Many modern games have been derived by combining ideas from different older games. In other words, the genealogy of such games can best be described by a phylogenetic network.

A phylogenetic network for Bejeweled-type games, taken from [1], which was in turn taken (after modification) from [3].
This network is clearly a rooted, genealogical phylogenetic network (although it does not have a unique root).

So what does the NP-hardness of Candy Crush tell us? Nothing, of course, except that the 97 million people daily playing Candy Crush are pouring all their energy into solving a frivolous, but nevertheless intrinsically hard, problem. This is a pity because, since Candy Crush is NP-hard, one can (at least in theory) encode any NP-complete problem as a Candy Crush episode. This could be used to let all these 97 million people solve more useful NP-complete problems every day. For example, we could encode massive phylogenetic network reconstruction problems as Candy Crush episodes, and use this to construct the Web of Life in a few days!


[1] Luciano Gualà, Stefano Leucci, Emanuele Natale. Bejeweled, Candy Crush and other match-three games are (NP-)hard, (24 March 2014)

[2] Toby Walsh. Candy Crush is NP-hard, (8 March 2014)

[3] Jesper Juul. A casual revolution: reinventing video games and their players. The MIT Press, 2012.

Later note:
It turns out that the figure shown above is not actually taken from [3], in spite of the claim made in [1]. The figure in [3] is re-drawn from [4], and the genealogy as shown in [1] is edited directly from [4], not [3]. The editing consists of deleting all of the many other descendants of Tetris. The original complete figure is available here.

[4] Jesper Juul. Swap adjacent gems to make sets of three: a history of matching tile games. Artifact 1: 205-216, 2007.

March 25, 2014


Consanguineous relationships involve people who are first cousins or more closely related. Apparently, about 15 percent of all marriages worldwide involve consanguineous partners, although this number has been higher in the past (Bittles 2012).

Our interest for this blog is that such relationships emphasize that so-called family trees (pedigrees) are hybridization networks not trees (see Pedigrees and phylogenies are networks not trees). Everyone can trace their maternal and paternal ancestors back into the past to a point where the lineages fuse again, and consanguineous marriages mean that this happens in the recent past rather than the distant past. To this end, we have had posts about Charles Darwin (Charles Darwin's family pedigree network), Henri Toulouse-Lautrec (Toulouse-Lautrec: family trees and networks) and Albert Einstein (Albert Einstein's consanguineous marriage). Not unexpectedly, it is royalty that provide the best-known examples (see Family trees, pedigrees and hybridization networks).

However, many cultures have taken consanguinity even further, as noted by Dobbs (2010):
While virtually every culture in recorded history has held sibling or parent-child couplings taboo, royalty have been exempted in many societies, including ancient Egypt, Inca Peru, and, at times, Central Africa, Mexico, and Thailand [and also Hawaii].The reference to ancient Egypt includes both Cleopatra and Tutankhamun, each of whom was part of a dynasty that apparently adopted the practice of incest. As noted by Wikipedia:
In ancient Egypt, royal women carried the bloodlines and so it was advantageous for a pharaoh to marry his sister or half-sister; in such cases a special combination between endogamy and polygamy is found. Normally the old ruler's eldest son and daughter (who could be either siblings or half-siblings) became the new rulers.

Tutankhamun briefly ruled as Pharaoh from 1333-1323 BCE, at the end of the Amarna period, the 18th Dynasty. His failure to leave an heir ended the direct line of succession, and ultimately resulted in the transition to the 19th Dynasty, started by Rameses I. Tutankhamun seems to have been a rather minor king, becoming ruler at age 9 and dying at 19. He was surrounded by the power struggle that resulted from his father's attempt to found the first monotheistic religion, and being a minor he probably had little influence on the events of the time (Antanovskii 2013).

He became famous in 1922, when his near-intact tomb was discovered. He had been buried in a tomb not intended for royalty, and its location and even existence was quickly forgotten at the time — due to the political turmoil, his successors had deleted nearly all traces of the Amarna kings. In a classic case of irony, this situation made Tutankhamun's tomb safe from the robbers who removed much of the contents of other tombs in the Valley of Kings. Thus, more than 5,000 artifacts were found in his tomb, along with the well-preserved mummies (see the death mask pictured above). This has made Tutankhamun a better-known name ("King Tut") than that of anyone else from his period.

A note on names: Tutankhamun's father was Amenḥotep IV, who tried to replace the polytheistic worship associated with Amun (or Amen) and the other gods of the national pantheon with the monotheistic worship of Aten ("the disk of the sun"). He thus changed his name from Amenhotep ("Amun is satisfied") to Akhenaten ("beneficial to Aten"). His son was named Tutankhaten ("the living spirit of Aten"), but this was changed to Tutankhamun ("the living spirit of Amun") when the state religion was restored during his reign.

The history of the period surrounding Akhenaten and Tutankhamun is particularly confused, as Tutankhamun did not become pharaoh until 2 years after his father's death (Hawass 2010; Gabolde 2011). Nevertheless, the preservation of Tutankhamun's tomb has allowed us to reconstruct a possible genealogy for this period, as shown next.

Hawass et al. (2010) compared the DNA of the mummy of Tutankhamun with that of 10 royal mummies from the same period, ranging from 1,410 to 1,324 BCE. The mummy of the genetically identified father, found in grave No. 55 of the Valley of Kings, is considered to be Akhenaten. The identified mother, found in grave No. 35, was also identified to be the sister of Akhenaten. This is surprising, because only two wives of Akhenaten, Nefertiti and Kiya, are known to have had the title of Great Royal Wife, which the father of the royal heir should bear.

Hawass et al. (2010) also looked for evidence of possible genetic effects of the consanguineous relationship (eg. homozygous genetic disorders):
An accumulation of malformations in Tutankhamun's family was evident. Several pathologies including Köhler disease II were diagnosed in Tutankhamun; none alone would have caused death. Genetic testing for genes specific for Plasmodium falciparum revealed indications of malaria tropica in four mummies, including Tutankhamun's. These results suggest avascular bone necrosis in conjunction with the malarial infection as the most likely cause of death in Tutankhamun. Walking impairment and malarial disease sustained by Tutankhamun is supported by the discovery of canes and an afterlife pharmacy in his tomb.Incestuous marriages were nothing new to the pharaohs of Dynasty 18 (see Mladjov's detailed genealogy). Part of the genealogy of its founding is shown in the next figure. Aahotep I and Sequenenra III were sister and brother, as were Aahmes-Nefertari and Aahmes (or Ahmose II). Aames (or Ahmose III) and Thotmes I were either sister and brother or half-siblings (the records are unclear).

Circles refer to females and squares to males.
Finally, it is worth noting that Marc Gabolde has an alternative explanation for the apparent genetic closeness of King Tutankhamun's parents (see Powell 2013). He suggests that Tutankhamun's mother was not his father's sister, but rather his father's first cousin, Nefertiti. The apparent genetic closeness is then not the result of a single brother-sister mating but due to three successive instances of marriage between first cousins. Nefertiti is recorded to have had six daughters with Akhenaten, but no son.


Antanovskii R (2013) Unmasking Tutankhamun : the figure behind the fame. Heritage Daily – Archaeology.

Bittles AH (2012) Consanguinity in Context. Cambridge University Press.

Dobbs D (2010) The risks and rewards of royal incest. National Geographic Magazine.

Gabolde M (2011) The end of the Amarna Period. BBC History – Ancient History in Depth.

Hawass Z (2010) King Tut's family secrets. National Geographic Magazine.

Hawass Z, et al. (2010) Ancestry and pathology in King Tutankhamun's family. Journal of the American Medical Association 303: 638-647.

Ian Mladjov's Genealogical Tables — The pharaohs of the New Kingdom in Egypt c. 1540-1070 BC.

Powell A (2013) A different take on Tut. Harvard Gazette.

March 23, 2014

Hierarchically arranged information has traditionally been represented as a tree. However, this is not the only way that this information can be pictured. As noted by Manuel Lima (Visualization Metaphors: Old & New):
As one of the most hailed methods of modern information visualization, the treemap has truly become an epitome of the recent growth of the field and one of the most widespread methods for visualizing hierarchies.Isabel Meirelles (Design for Information: An Introduction to the Histories, Theories, and Best Practices Behind Effective Information Visualizations. Rockport Publishers, 2013) provides this illustration as an example of the different ways to represent hierarchies:

So, treemaps display the tree information as a set of nested rectangles — each branch of the tree is given a rectangle, which is then tiled with smaller rectangles representing sub-branches. The main advantage of using a map as a representation is that the size and colour of the rectangles can be used to represent other information about each tree leaf. (Note: This treemap concept should not be confused with Mike Charleston's program TreeMap, which maps the relationships between two phylogenetic trees, nor with MLTreemap, which maps an unidentified DNA sequence onto a phylogenetic tree.)

Modern treemaps were developed in 1991 by Ben Shneiderman, who has conveniently provided a description of the history and initial development of the idea (Treemaps for space-constrained visualization of hierarchies). Not unexpectedly, this idea has been adopted in biology. For example, taxonomic hierarchies are sometimes represented using a treemap, such as in BioNames (which displays the taxonomic groups recognised by the Index to Organism Names database), and the Natural Science Museum of Barcelona (which allows interactive access to the database records via a taxonomic hierarchy). It has also been used to display the gene ontology associated with gene expression data from microarray studies (Visualization and analysis of microarray and gene ontology data with treemaps).

In addition, it has been suggested that treemaps could be used to represent phylogenetic trees (Using treemaps to visualize phylogenetic trees. 6th International Symposium on Biological and Medical Data Analysis, 2005. Lecture Notes in Computer Science 3745: 283-293); and there is an associated computer program. An example is shown below, in which the rectangles are coloured by their taxonomy — the circles highlight two sequences that are misplaced in the tree (ie. their tree location does not match their taxonomy).

This approach to displaying phylogenies has not really caught on (ie. phylogeneticists have stuck to the "node-link" layout). The treemap approach works best with a fixed-level hierarchy, such as the taxonomic hierarchy or the gene ontology hierarchy. In phylogenetics, on the other hand, branch lengths are variable, so that there is no fixed-level hierarchy. Treemaps work well for displaying information about groups that might be recognized in the tree, but not for the tree itself.

Nevertheless, similar methods were suggested long before the invention of computers (two early examples are noted by Manuel Lima, in the blog post linked above). Indeed, we end up with a treemap if we simply cut slices out of the tree, as shown by the next picture (taken from Isabel Meirelles' book), which shows Maximilian Fürbringer's tree of bird relationships from 1888 (published in Untersuchungen zur Morphologie und Systematik der Vögel). On the left is the side view of the tree, and on the right are three slices through the tree branches (as viewed from above). This produces a circular treemap rather than a rectangular one, which is admittedly a less efficient use of the visualization space.

Finally, we can consider the relationship of these ideas to phylogenetic networks. A network is not a nested hierarchy, but instead involves a collection of over-lapping sets. This can be represented as a venn diagram, for example, but not as a treemap. This form of visualization has also been a long-standing suggestion in phylogenetics. The final picture shows Georg August Goldfuss' "system of animals" from 1817 (published in Ueber de Entwicklungstufen). It is a set of nested egg-shaped sets, expressing his ideas about affinity relationships, with one set over-lapping several of the others, representing a non-nested series of relationships. There is nothing new under the sun!

March 18, 2014


There is nothing in the etymology of the words 'genealogy' and 'phylogeny' that necessarily implies that they must be tree-like. Indeed, all genealogies are networks. For example, a human family "tree" is a tree only if it includes one sex alone. Otherwise, it must be a network when traced backwards from any single individual through both parents, because the lineages must eventually coalesce in a pair of shared common ancestors. This must happen if there is a single origin for Homo sapiens (ie. the species is monophyletic). The coalescence may not occur for thousands of years in the past, or it may be quite recent.

So, all pedigrees of sexually reproducing species involve conjoined lineages at both "ends", one in the common ancestor and one in the contemporary offspring.

Given the extent of inbreeding among royal families, this ancestral coalescence is quite likely to be recent among monarchs. For example, the most recent common ancestors of all of the currently reigning monarchs of Europe are John William Friso, Prince of Orange (1687-1711), and his wife, Marie Louise of Hesse-Kassel, Princess consort of Orange (1688-1765). This situation has existed since the abolition of the Albanian monarchy in 1939 (this particular monarchy was not related to the house of Orange).

Marie Louise (left) and her two children.
There used to be a Wikipedia page listing the contemporary descendants of this royal Dutch couple, but it has been deleted. It is, however, still available in the Internet Archive WayBack Machine (Royal descendants of John William Friso, Prince of Orange). This page shows that the lineages of all of the current monarchs coalesce in this couple in 7-11 generations. This is true of all 10 current monarchs (in Belgium, Denmark, Liechtenstein, Luxembourg, Monaco, the Netherlands, Norway, Spain, Sweden, the United Kingdom), many former monarchies (13 or so), many so-called pretenders or claimants (at least 21), plus two royal consorts. Interestingly, the progenitor couple achieved this set of family relationships even though they had only one daughter (Princess Amalia of Nassau-Dietz) and one son (William IV, Prince of Orange), who was born six weeks after his father's death by drowning.

Family trees were originally devised as a way for nobles to assert their nobility, by tracing their direct male ancestry from some "important" progenitor (see the picture below). The female lineages were usually ignored in such ancestries, with each woman appearing alone, solely as an isolated wife and mother. This was, of course, modelled on the genealogies listed in the christian Bible, in both Genesis 5 and 11, in which females are mentioned but only males appear to be named. However, the ancestral relationships of the current European monarchs do involve females as part of the direct lines of descent, in all cases (ie. none of the direct lines of descent can be traced solely through males).

On the left is part of a genealogy of Christ (from c. 1130-1205);
on the right is a genealogy of the House of Habsburg (c. 1540).
Reproduced from the Visual Complexity blog.
Thus, in the modern world, we should be constructing family networks not family trees, with all of the male and female lineages sharing equal prominence. This will make it clear that genealogies are networks not trees. This assumes, of course, that enough historical information can be collected to locate the actual points of coalescence. This is unlikely to be so for the likes of you and me, but the nobility seem to be able to do it quite regularly.

Family networks that reticulate within a few generations are not necessarily good things, of course. Sex-linked recessive traits such as heamophilia B are widespread among the royalty of Europe (Stevens 1999, Rogaev et al. 2009), as are autosomal dominant traits such as variegate porphyria (Cox et al. 2005). These diseases are much rarer amongst commoners.

A similar situation applies to phylogenies showing species relationships. If there is a single origin to life, then tracing phylogenies backwards in time must lead to the eventual coalescence of all lineages. Any species whose ancestry involves hybridization, introgression or horizontal gene transfer must form a network. Parts of this network might be tree-like if isolated from the rest, but the whole phylogeny cannot be anything other than a network.

Consider the following points:

A network is a series of overlapping groups
A tree is a set of nested groups

Each evolutionary event defines a group (all of the descendants of the ancestor in which the event occurred)

Dichotomous speciation leads to a tree, by definition
Other processes will lead to a network, by definition

We know that in biology there are both vertical (speciation) and horizontal (reticulation) evolutionary processes. Therefore, no biological data fit a tree perfectly (unless the data are carefully selected to do so). A network analysis will allow you to evaluate the relative contribution of the horizontal and vertical processes that have occurred.


Cox TM, Jack N, Lofthouse S, Watling J, Haines J, Warren MJ (2005) King George III and porphyria: an elemental hypothesis and investigation. Lancet 366: 332-335.

Rogaev EI, Grigorenko AP, Faskhutdinova G, Kittler ELW, Moliaka YK (2009) Genotype analysis identifies the cause of the "Royal Disease". Science 326: 817.

Stevens R. (1999) The history of hemophilia in the royal families of Europe.  British Journal of Haematology 105: 25-32.

March 16, 2014


When I look out the window of my workroom at home, several minutes can easily pass without a person being in view. This is quite typical of Swedish countryside. During the times I have been in the Netherlands, however, I have never experienced even one whole minute without a person walking, or more likely cycling, into view. The Netherlands is the most densely populated country in Europe (excluding all of the really tiny countries), with about 500 people per square kilometre of actual land area.

In spite of this, the Netherlands, according to the Statistics Division of the United Nations, managed to be the largest worldwide exporter and re-exporter of fruit and vegetables (including citrus) between 2009 and 2012. The data show that the Netherlands handled 14.6% of the world's total, followed by Spain (12.1%), China (10.9%), Mexico (9.7%), the United States (8.3%), Canada (5%), France (4.4%), Belgium (3.7%), Italy (2.8%) and Germany (1.9%).

Given this unexpected fact, it seems worthwhile to list some of the other facts about the Netherlands that I bet you never knew. Many of the data come from the Eurostat database and the Statistics Netherlands database, but also from miscellaneous other sources on the web.

Bulb season at the Keukenhof Gardens, in the Netherlands.
In this period the Netherlands exported 4,600 million kilograms of vegetables with a market value of 4,200 million Euros. It is worth noting that a considerable amount of the export is actually re-export, particularly of products imported from south-east Asia. Nevertheless, about 24,000 hectares of the Netherlands is devoted to vegetables in total, plus about 19,000 ha dedicated to fruit.

Hardly anyone realizes that the Netherlands is the world’s biggest producer of onions, with 1,353,000 tonnes in 2012. For comparison, Spain produced only 1,169,700 tonnes. The Netherlands is also the leading exporter of button mushrooms (40% of the export market) followed by China, France, Spain, Hong Kong, Taiwan, Indonesia and South Korea. The U.S.A. is the largest consumer, accounting for one third of World production.

The Dutch are the biggest exporters of seed in the world, exporting some 1,500 million Euros worth every year. About 15,000 hectares of land is given over to nurseries and perennial plants, and about 3,000 ha to floricultural crops. Much of the horticultural production is under glass (c. 10% of the area). For example, within the European Union, Spain has about 65,000 hectares of commercial glasshouses, and Italy has c. 35,000 ha; but the Netherlands is third, with c. 10,000 ha.

However, it is bulbs for which the Netherlands is most famous, both as cut flowers and as whole plants. The Netherlands has 86% of the world area for tulip production, with about 10,800 hectares. It also has about 75% of the lily production area (4,280 ha), out of a Dutch total of 23,500 hectares devoted to bulbs. Sales total 1,320 million tulip bulbs per year (1,300 million as cut flowers) out of a world total of 4,320 million (2,300 million as cut flowers).

Tulip season in South Holland. Note the line of tourist camper vans.
The introduction of the tulip to Europe is usually attributed to Ogier de Busbecq, who sent the first tulip bulbs and seeds to Vienna in 1554 from Turkey in the Ottoman Empire. The Turks still tell the story of the first recipients trying to fry and eat the bulbs, rather than growing them! Tulip popularity and cultivation in the Netherlands probably dates from 1593, when the Flemish botanist Carolus Clusius planted his collection of tulip bulbs at the Hortus Botanicus in Leiden. To this day, the area immediately north of Leiden is the heart of the Dutch bulb industry (see the picture above).

However, in spite of all of this plant production, most of the farmland in the Netherlands is actually given over to animals, not plants. Only about 30% of the farmland is dedicated to plants, while 56% is reserved for grazing livestock, as shown in the next graph.

Annual area of Dutch farmland this century. Note the log scale.
Given all of this primary production, it is not unexpected that the Netherlands has the best balance of trade for raw materials within the European Union, with a surplus of 4,500 million Euros per year (see the first network below). This is followed by Sweden (3,000 million). The only other EU countries with a positive balance are Denmark, Latvia, Romania, Ireland and Estonia. Last are Italy and Germany, each with a deficit of -12,000 million Euros per year.

NeighborNet (based on manhattan distance) of the 2001-2012 data for international trade of raw materials
for the member countries of the European Union. Countries near each other in the network have a similar
balance of trade, while countries further apart are progressively more different from each other.
Furthermore, the Netherlands has the second best overall balance of trade (i.e. including manufacturing) in the European Union, with a surplus of 37,000 million Euros per year (see the next network). This is way behind Germany (153,000 million), but just ahead of Ireland (35,500 million). The only other countries to have consistently had a positive balance of trade this century are Sweden, Belgium and Denmark. The United Kingdom has fared much the worst, with a deficit of -116,000 million Euros per year.

NeighborNet (based on the manhattan distance) of the 2001-2012 data for total international trade
for the member countries of the European Union.  Countries near each other in the network have a similar
balance of trade, while countries further apart are progressively more different from each other.
The Netherlands is also famous for having 50% of its alleged land area less than 1 metre above sea level (see the map below); and indeed 20% of the land is actually below sea level. Throughout the latter region, the water flows uphill into the rivers and canals, a feat that is not usually achieved anywhere else on the planet. This has traditionally been accomplished with windmills, of course.

This situation has occurred because during the first millenium AD much of the land was washed out into the North Sea, and during the second millenium the Dutch tried to get it back again. In particular, Lake Flevo became the South Sea, and was then reclaimed as Lake IJssel. These days, rather massive sea-dykes are used to keep the water at bay.

The Netherlands with respect to the Amsterdam Ordinance Level (NAP).
So, where are the Dutch managing to do all of this agricultural production? I suspect that the creators of Dr Who invented the Tardis after a visit to the Netherlands, since clearly the Netherlands is larger on the inside than it appears to be on the outside.

March 11, 2014


Biologists have this idea that when any of us uses a formal name then we should all be talking about the same thing. To this end, various cods of nomenclature have been proposed and agreed to over the centuries, notably those based on hierarchical ranking (eg. International Code of Nomenclature for Algae, Fungi, and Plants; International Code of Zoological Nomenclature; International Code of Nomenclature of Bacteria; International Code of Nomenclature for Cultivated Plants). Others have not been universally agreed to, and are not yet being used (eg. BioCode; PhyloCode).

Of the latter group, the Phylocode is not dead, but it is certainly hibernating. As explained by Mike Keesey (The PhyloCode Has a Deadline ):
The PhyloCode (more verbosely, the International Code of Phylogenetic Nomenclature) is a proposed nomenclatural code, intended as an alternative to the rank-based codes. It was first drafted in April 2000, and at that time the starting date was given as "1 January 200n". On this date the code would be enacted and published along with a companion volume, which would provide the first definitions under the code, establishing best practices and defining the most commonly used clade names across all fields of biology.Well, the '00s came and went without the code being enacted. The hold-up was not the code itself, which has been at least close to its final form since 2007. (The last revision, in January 2010, was minor.) And it hasn't been the software for the registration database, which has been completed. The hold-up was the companion volume, which turned out to be a much more daunting project than expected.There is a new progress report for Phylonyms, the companion volume to the PhyloCode. There will be at most 268 entries. Currently 186 of those (over two thirds) have already been accepted. The rest are at various stages of review. The contract with University of California Press calls for the manuscript to be submitted by September 1, 2014.Reticulation

Of interest to us here at this blog is how the Phylocode treats reticulate evolution. In the rank-based nomenclatural codes (eg. ICN, ICZN), reticulate evolution is ignored. Each named group at any given rank is mutually exclusive, so that each taxon can be part of only one of the named groups. This naming scheme can be used to represent hierarchical relationships but not reticulate ones.

With regard to the Phylocode, Philip Cantino explicitly addressed this issue at the Botany 2008 conference (The taxonomic treatment of hybrid derivatives under the ICBN and the PhyloCode):
By convention, ranked taxa must be either nested or mutually exclusive, but clades that include species of hybrid origin may be partially overlapping. Consequently, reticulate evolution presents a challenge for phylogenetic systematists using traditional rank-based taxonomy and nomenclature, where a species can belong to only one taxon at a given rank. Assignment of a species derived from an intersectional (or intersubgeneric or intergeneric) hybrid to only one of its parental sections (or subgenera or genera) renders the other parental taxon at the same rank paraphyletic. When classifying such hybrids using a ranked hierarchy, one must reject either the convention that an organism can only belong to one taxon at a given rank or the convention that paraphyletic groups should not be formally recognized. Phylogenetic nomenclature accurately reflects the complex patterns of descent that result from hybridization, in that a species of hybrid origin belongs to all of the named clades that contain each of its parents. Thus, the expectation that named supraspecific taxa be monophyletic is maintained in spite of hybridization.Putting aside the obvious suggestion that we could allow named groups to be paraphyletic (which they can be under the rank-based codes but not the Phylocode), the suggestion that organisms can belong to more than one named group (which they can under the Phylocode but not the other codes) is an interesting departure from tradition. It explicitly recognizes the existence of fuzzy groups, which can overlap.


The Phylocode has little to say explicitly about reticulation, but what it does say is clear:
Note 2.1.3. Clades are often either nested or mutually exclusive; however, phenomena such as speciation via hybridization, species fusion, and symbiogenesis can result in clades that are partially overlapping.Note 2.2.1. Here and elsewhere in this code, "phylogenetic tree" is used loosely to include any directed graph, specifically those with additional connections representing phenomena such as hybridization (see Note 2.1.3).Note 9.3.2. The application of a phylogenetic definition, and thus also of a phylogenetically defined clade name, requires a hypothesized phylogeny. To accommodate phenomena such as speciation via hybridization, species fusion, and symbiogenesis (see Note 2.1.3), the hypothesized phylogeny that serves as the context for the application of a phylogenetically defined name need not be strictly diverging.Chapter VI. Provisions for Hybrids
Article 16.
16.1. Hybrid origin of a clade may be indicated by placing the multiplication sign (×) in front of the name. The names of clades of hybrid origin otherwise follow the same rules as for other clades.
16.2. An organism that is a hybrid between named clades may be indicated by placing the multiplication sign between the names of the clades; the whole expression is then called a hybrid formula.
Recommendation 16.2A. In cases in which it is not clear whether a set of hybrid organisms represents a clade (as opposed to independently produced hybrid individuals that do not form a clade), authors should consider whether a name is really needed, bearing in mind that formulae, though more cumbersome, are more informative.In many ways, the sentiments expressed here about phylogenetics are the same as those engendered in the recent announcement of the NSF Genealogy of Life program (GoLife) (see NSF and reticulating phylogenies) — a genealogy does not have to be tree-like.


In one sense, we should applaud the creators of the Phylocode for explicitly addressing an issue that has traditionally been ignored by the creators of the other codes (who have ignored phylogeny), as well as by tree-based phylogeneticists (who seem to think that phylogenies consist only of nested monophyletic groups).

Previous suggestions for dealing with hybrids look a bit like an attempt to sweep all of the problems together into separate piles, and then simply labeling them "problem piles" (see How should we treat hybrids in a taxonomic scheme?). This is very much what is done, for example, under the International Code of Nomenclature for Algae, Fungi, and Plants. Here, hybrids are treated as separate taxa, and are named as such using a "hybrid formula" that applies to distinct "nothotaxa".

Furthermore, species separately derived from the same ancestral gene pool are considered to be distinct species, and are named appropriately. However, hybrids derived independently from crosses of the same two species appear to be treated in botany as being the same taxon, and thus share the same name. For example, the ICN states: "Elymus ×laxus is the correct name applicable to all hybrids between E. farctus and E. repens" and "the correct nothospecific designation for all hybrids between Euphorbia amygdaloides and E. characias is E. ×martini". Multiple origins are not considered.

Unfortunately, while the Phylocode does better than this, the potential consequences of the Phylocode rules may be somewhat messy. For example, introgression is an extensive phenomenon in zoology and especially botany, and if we were to take the Phylocode literally then a huge number of populations would have multiple species names. Moreover, horizontal gene transfer creates relationships between distant taxa, so that species would have names in two unrelated groups (eg. an animal name and a viral name). Finally, symbiogenesis means that all of the eukaryotes would have both a eukaryote name and a proteobacterium name (since that is where their mitochondrion probably originated), and all of the plants would also have a cyanobacterium name (since that is where their chloroplast probably originated).

On one hand, fuzzy groups are a reality in phylogenetics, as a result of reticulate evolutionary histories. On the other hand, there is a good practical reason why the traditional codes of nomenclature are based on mutually exclusive groups. The only complete and accurate representation of group relationships is the phylogeny itself, and trying to name groups that represent only parts of the phylogeny is a poor substitute for that diagram. This is the dilemma faced by the Phylocode, that in practice it is trying to substitute names for relationships.

March 9, 2014


My wife and I recently bought a new old car (it is new for us, but it was 1 year old when we bought it). Being a scientist, part of the choosing procedure involved me trying to find out which cars in my price range might be recommended by their owners. There are several organizations who annually try to find out the same thing, and so I had a look at some of their data. I thought that I might share some of the results with you.

I live in Europe, and so the two data sources that were of most interest are the Vi Bilägare AutoIndex survey, in Sweden, and the Auto Express Driver Power Survey, in the United Kingdom. Every year, these surveys ask car owners how they have fared recently with their near-new cars (up to 5 years old). For the data here, we are concerned solely with the data aggregated by manufacturer (rather than for individual car models).

For the analysis, I have chosen the data for the years 2011-2013 inclusive, because they were available from both organizations, and I rescaled the numbers to the common range 0-100. Several car manufacturers could not be included because of missing data from some years: Alfa Romeo, Chevrolet, Jaguar, Land Rover, Porsche, and Smart. That still leaves 25 manufacturers in the dataset.

As usual, I used the manhattan distance and a neighbor-net network to produce the graph. Car manufacturers near each other in the network have similar scores across the two countries and three years, while manufacturers that are further apart are progressively more different from each other.

The average scores vary from 77-88, going from Fiat (at the bottom of the graph) to Lexus (at the top). There is general agreement between the two surveys and the three years, with some notable exceptions.

For example, Skoda stands out in the network because it scores much better in the U.K. than in Sweden. On the other hand, Mini scores much better in Sweden than in the U.K., which explains its involvement in the only major network reticulation.

Nissan, Hyundai and Seat also score somewhat better in the U.K. than in Sweden, while BMW and Suzuki score better in Sweden; however, these patterns are not obvious in the network.

The gradient in scores from the top of the network to the bottom shows some interesting patterns. For example, the three French manufacturers are at the bottom of the graph (Citroën, Peugeot and Renault), along with the two American-owned companies (Ford and Opel / Vauxhall). The two Korean-owned manufacturers are together in the middle (Hyundai and Kia), while the Japanese-owned companies are scattered from top to bottom, along with the European-owned companies.

Not unexpectedly, you pay for what you get. The makers of the most expensive cars are all gathered at the top of the graph (from Volvo and Audi upwards). However, the position of Saab and Volkswagen may surprise some people. Sadly, both manufacturers have a current reputation for designing excellent cars but building rather poor ones.

Finally, our motor mechanic recommended that we buy a Ford rather than a Kia. Clearly, the car owners do not agree with his assessment. Sadly, the car we were replacing was a Peugeot, and we completely agree that it deserves its location in the graph.

March 4, 2014

Splits graphs are produced by distance-based network methods such as NeighborNet and Split Decomposition, by character-based methods such as Median Networks and Parsimony Splits, and by tree-based methods such as Consensus Networks and SuperNetworks. They represent sets of node clusters that may overlap. If the clusters are nested then the graph will be tree-like, but if they overlap then the graph will show complex reticulation patterns. In the latter case, there is no simple way to summarize the patterns as a set of "groups" of nodes, although there is clearly a strong tendency in the literature for practitioners to try to do so.

I have written before about How to interpret splits graphs, in which the edges in the graph represent separation between two clusters of nodes in the network (ie. they split the graph in two). Recognizing groups of nodes should therefore be based on the splits. Ideally, each group of nodes should represent a split in the network, preferably a well-supported split.

However, if the split pattern is complex then recognizing groups of nodes will also be complex. This can be seen in the following splits graph, which is taken from the paper by Robert M. Ross, Simon J. Greenhill and Quentin D. Atkinson (2013. Population structure and cultural geography of a folktale in Europe. Proceedings of the Royal Society B 280: 20123065). The network shows the relationships among 32 ethnolinguistic cultures based on the characteristics of one of their folktales.

This network is not very tree-like, and yet the authors recognize five main ethnolinguistic groups (shown in different colors). Inspection of these groups reveals:
  • The light-orange group represents a well-supported split in the graph, and is thus uncontroversial; but none of the other groups are represented by a single split.
  • The pink group represents two splits, one clustering English, Irish, Scottish and Danish, and one clustering Danish, Latvian and German. These splits are incompatible with only one other minor split, and so the group is relatively uncontroversial.
  • The green group also represents two splits, one clustering Armenian and Turkish and one clustering Turkish and Greek. These are well-supported splits, with only minor incompatibility with other splits, and so perhaps this group is also uncontroversial.
  • The purple group is supported by a single split only if Greek is included in the group. Clearly, this conflicts with the green grouping. However, without Greek there is not much in the way of splits that support the purple grouping.
  • There is a very poorly supported split that unites the dark-orange group only if Bulgarian and Czech are included in the group. There are three well-supported splits that combine to support the group provided that Bulgarian is included. In both cases this conflicts with the purple grouping.
So, at least two of the recognized groups can be considered doubtful, as groups, based on the network alone. The authors' motivation for their groupings is at least partially based on geographical considerations:
The NeighbourNet in figure 2 represents graphically the pattern of regional clustering in folktale variation. The five clusters we identify provide insights into possible cultural spheres of influence in Europe since the folktale’s inception.Nevertheless, it seems unwise to recognize all of the five colored regions of the network as "groups" or "clusters" of nodes, since it is not obvious that the network actually supports them all as groups. Perhaps we should call them "neighborhoods" or some other similar term, so as not to be misleading. We could define a neighborhood as a collection of nodes in close proximity in the splits graph but not necessarily representing any unique combination of well-supported splits.

March 2, 2014


Few people had heard of phylogenetics before 1970. It was during that decade that explicit methods for constructing phylogenetic trees came to prominence, although such methods had first appeared in the late 1950s. These methods appeared first in systematics, based on parsimony (1970s), and then in genetics, based on likelihood (1980s). These days, phylogenetics is seen as ubiquitous in biology, but it is interesting to consider whether this idea can be quantified.

Joseph Hughes (2011.TreeRipper web application: towards a fully automated optical tree recognition software. BMC Bioinformatics 12:178) had a go at this by trying to extract information from the PubMed bibliographic database. Here, I have expanded on this approach.

I searched PubMed for the string phylogen*, thus including words like "phylogeny" and "phylogenetics", as well as unusual variations on these words. I searched both the full bibliographic record (including the abstract) as well as restricting the search to the Title field. I did this for every calendar year from 1970–2012 inclusive (the 2013 data are currently still incomplete in the database).

The results are shown in the first graph, and the second graph shows the details of the title search alone. The data are expressed as a percentage of the total number of PubMed records for each year.

So, less than 2% of the current papers in biology mention phylogenetics in their title or abstracts. This does not, of course, mean that the paper doesn't mention the topic at all, as it could do so under some other name (eg. "evolutionary tree", "genealogy", etc), or do so in a way that does not make it into the abstract. Still, it seems to me that this is a rather low number.

The erratic nature of the data before 1975 is probably a by-product of the quality of the PubMed data for that time. However, the clear upper asymptote in the data this century is not artifactual, but real. The average maximum value for the "All" data is ~1.54%, reached in 2009, while the average for "Title only" is ~0.17%, reached in 2004. This seems to imply that phylogenetics has now saturated the market, and is as ubiquitous as it will be, unless something new comes along to change it.

The initial rise in usage of the phylogenetic methods coincided with the release of computer programs that implemented them. Wagner78 was released for mainframe computers in 1978, followed by Phylip in 1980. Phylip was the first to be ported to microcomputers; but it was the release of the PC version of PAUP (v. 2.4) in December 1985 that came to dominate the next 10 years. Hennig86, the successor to Wagner78, was released in 1988.

However, the rapid growth in usage coincided with the growth of molecular genetics. The patent applications for PCR were filed in 1985, and the first paper based on it was also published that year. The technology started to be used for human diagnostics during 1986, and PCR became a basic research tool in molecular biology from c.1989. (Science selected PCR as the major scientific development of 1989.) The journal Molecular Biology and Evolution was founded in 1983, and Molecular Phylogenetics and Evolution in 1992.

The inflection point in the graph is c.1999, which indicates where the slow-down in growth occurred. Coincidentally, it was in 1999 that the Journal of Molecular Evolution announced that it would henceforth exclude molecular phylogenetics (and research on the origin of life), except in cases that have "a special significance and impact." Phylogenetics was now seen as a tool of evolutionary analysis rather than an end in itself.

By this stage, bayesian methods were being proposed, and MrBayes was released in 2001, rapidly becoming the predominant program. However, this was simply a transformation of the existing methodology, rather than being a major new component of data analysis in the way the very first programs were. Furthermore, the rise in usage of genome data seems also to be a transformation, rather than a major addition to data collection the way sequence data were.

Thus, it took 30 years (c. 1978–2008) for the phylogenetics revolution to be complete. Mind you, it had already taken 150 years from 1859 for quantitative methods to first be proposed.

February 26, 2014


A few weeks ago I discussed the phylogenetic analysis of the tale of Little Red Riding Hood (The phylogenetics of Little Red Riding Hood). In that case, I pointed out that historical reconstructions require a rooted tree, and I discussed various possible methods for rooting the unrooted trees produced by the data analyses.

This is not the only time that phylogenetics has been applied to myths or tales. For example, d'Huy (2013a) has studied the prehistoric Polyphemus tale belonging to the European and North Amerindian areas, and d'Huy (2013b) has studied the mythological motif of the Cosmic Hunt linked to the Big Dipper constellation (typical for northern and central Eurasia and for the Americas but unknown on other continents). In the first case a binary matrix of 98 characteristics for 44 versions of the tale was used, and in the latter 93 characteristics for 47 versions. Both of these studies have rooted trees.

In the latter case, a novel method of rooting the tree was used. The unrooted tree was successively rooted with each of the likely versions of the tale as outgroup. In each case the ancestral tale (the protomyth) was reconstructed and the ancestral states of the tale's characteristics (called mythemes) were determined. The author then "selected the version that holds the majority of the wide shared mythemes (>50%) as the better root."

Unfortunately, this produced an unexpected root, as shown in the tree below. The colors in the tree refer to various geographical groupings of the tale versions.

So, I re-analyzed the data using the rooting methods that I previously applied to the Red Riding Hood analysis:
  • For the bayesian analysis, I used MrBayes (2 runs, 4 chains, 1,000,000 generations, sampling frequency 1000, 25% burnin) with a relaxed clock (with independent gamma rates model for the variation of the clock rate across lineages).
  • For the neighbor-joining tree I used the BioNJ algorithm in PAUP*, and found the midpoint root.
  • For the parsimony analysis, I used a 200-replicate parsimony-ratchet search via PAUP*, calculated the branch lengths of the majority-rule consensus tree with ACCTRAN optimization, and found the midpoint root.
These three alternative roots are also shown on the tree. They seem more likely than the published root.

Geographically, the root chosen by the author's method is within the red group (tales from Asia), based on the idea that "arguments in favour of localization of protypical Cosmic Hunt in Asia seem persuasive (Berezkin 2005)." Unfortunately, this a priori argument seems to have excluded any testing of the possibility that more than one version is the sister to the remaining tales — that is, only single outgroups were considered.

On the other hand, all three of the alternative roots group the tales into two major clades. For the bayesian-clock root the two clades have distinct animal motifs, a herbivore and a carnivore, respectively. These clades do not correspond to any of the three variants recognized by Berezkin (2005).

The bayesian-clock root puts the red-colored (Asia) versions of the tale into one of the two major clades, as it also does with the orange group (Africa), which makes this root more consistent with the geographical groupings — that is, all of the geographical groups are in only one of the two major clades, except for the purple group (American coast-plateau / British Columbia). Both the Parsimony and NJ roots do the same thing, but as well as the purple group they also split the pink group (northeastern America) between the two major clades, which reduces their geographical consistency compared to the bayesian-clock root.

The bayesian-clock root does not support the suggestion that the Cosmic Hunt myth originated in Asia. Indeed, the bayesian tree does not support any particular geographical location. Furthermore, the polyphyly of the purple group presents an intriguing aspect of the tale's history.


Yuri Berezkin (2005) The cosmic hunt: variants of a Siberian—North-American myth. Folklore 31: 79-100.

Julien d'Huy (2013a) Polyphemus (Aa. Th. 1137): a phylogenetic reconstruction of a prehistoric tale. Nouvelle Mythologie Comparée 1: 1-21.

Julien d'Huy (2013b) A cosmic hunt in the Berber sky: a phylogenetic reconstruction of a Palaeolithic mythology. Les Cahiers de l’AARS 16: 93-106.

February 24, 2014


Today is the second anniversary of starting this blog, and this is post number 222. Thanks to all of our visitors over the past two years — we hope that the next year will be as productive as this past one has been.

I have summarized here some of the accumulated data, in order to document at least some of the productivity.

As of this morning, there have been 104,211 pageviews, with a median of 129 per day. The blog has continued to grow in popularity, with a median of 70 pageviews per day in the first year and 189 per day in the second year. The range of pageviews was 69-812 per day during this past year, and 3-667 the previous year. The daily pattern for the two years is shown in the first graph.

Line graph of the number of pageviews through time, up to today.
The largest values are off the graph. The green line is the half-way mark.
The inset shows the mean (blue) and standard deviation of the daily number of pageviews.
The erratic nature of the daily variation is apparently all too typical of blogs, and there appears to be no good explanation for it.  So, we might take this as a good example of the stochastic nature of the web.

There are a few general patterns in the data, the most obvious one being the day of the week, as shown in the inset of the above graph. The posts have usually been on Mondays and Wednesdays, and these two days have had the greatest mean number of pageviews.

Some of the more obvious dips include times such as Christmas - New Year; and the biggest peaks are associated with mentions of particular blog posts on popular sites. There also continue to be a few instances of "rogue" visits. These tend to be visits from sites such as Referer and Vampirestat.

The posts themselves have varied greatly in popularity, as shown in the next graph. It is actually a bit tricky to assign pageviews to particular posts, because visits to the blog's homepage are not attributed by the counter to any specific post. Since the current two posts are the ones that appear on the homepage, these posts are under-counted until they move off the homepage, (after which they can be accessed only by a direct visit to their own pages, and thus always get counted). On average, 30% of the blog's pageviews are to the homepage, rather than to a specific post page, and so there is considerable under-counting.

Scatterplot of post pageviews through time, up to last week; the line is the median.
Note the log scale, and that the values are under-counted (see the text).
It is good to note that the most popular posts were scattered throughout the two years. Keeping in mind the initial under-counting, the top collection of posts (with counted pageviews) have been:
8 The Music Genome Project is no such thing
Charles Darwin's unpublished tree sketches
Carnival of Evolution, Number 52
The acoustics of the Sydney Opera House
Why do we still use trees for the dog genealogy?
Faux phylogenies
Who published the first phylogenetic tree?
Evolutionary trees: old wine in new bottles?
Network analysis of scotch whiskies
Tattoo Monday IV
Metaphors for evolutionary relationships
Phylogenetics with SpongeBob
Tattoo Monday 4,552
1,051This is quite a different list to the same time last year. Posts 129, 42 and 172 continue to receive visitors almost every day.

The audience for the blog continues to be firmly in the USA. Based on the number of pageviews, the visitor data are:
United States
United Kingdom
0.8%You will note that this list is dominated by English-speaking countries. The blog does have a link to Google Translate to help other people, but it is clear that the audience is made up almost entirely of those people who are comfortable with English (or Australian, at any rate).

Finally, if anyone wants to contribute, then we welcome guest bloggers. This is a good forum to try out all of your half-baked ideas, in order to get some feedback, as well as to raise issues that have not yet received any discussion in the literature. If nothing else, it is a good place to be dogmatic without interference from a referee!

February 18, 2014


Over the past two years I have published a number of posts in which I have used a data-display network as a multivariate data summary, comparable to an ordination (eg. PCA) or a cluster analysis (eg. UPGMA). This is a form of exploratory data analysis.

Here, I wish to point out that a multivariate data summary is not always necessary, even when the data are multivariate in form.

As an example, I will use the official census data on retail book sales in the USA. The monthly data are provided by the United States Census Bureau for the years 1992-2013 at:
The data include census code 4512, which covers "Book Stores, General", "Specialty Book Stores" and "College Book Stores". The data notes say: "Estimates are shown in millions of dollars, and are based on data from the Monthly Retail Trade Survey, Annual Retail Trade Survey, and administrative records." I downloaded the data on 17 February 2014.

These data are multivariate. For example, if each year is taken as a sample object, then there are data for 12 variables for each sample (one for each month). Any multivariate data analysis can therefore be applied to this dataset.

In the usual manner, I have used the manhattan distance and a neighbor-net network. Years that are closely connected in the network are similar to each other based on the 12 monthly sales figures, and those that are further apart are progressively more different from each other.

However, all that the data show is a gradient clockwise from the top. That is, sales rose from 1992, reached a peak in 2007, and then declined again. That is, the data form a simple time series, and all that is actually needed is to plot them that way.

So, this same pattern could be displayed more simply by graphing the yearly averages, as shown in the next graph. A network is complete over-kill in this case. I presume that the recent decrease in retail book sales has something to do with the rise of e-book sales.

Finally, we could also plot the monthly sales, while we are at it. The peaks in late summer and at Christmas as very distinct. Presumably people are buying books to read in summer, and to give away at Christmas.

Finally, note that not all time series can be plotted in a simple manner. If the time patterns are complex, then a multivariate analysis, such as a network, will probably be of some use as a data display.

February 16, 2014

In a previous blog post I noted that there are many images of phylogenetic trees on the internet but there are very few for phylogenetic networks, and so I provided a Network road sign. Here, I provide three more images, plus a tree (which is actually a commercial t-shirt design).

In a previous blog post I noted that there are many images of phylogenetic trees on the internet but there are very few for phylogenetic networks, and so I provided a Network road sign. Here, I provide three more images, plus a tree (which is actually a commercial t-shirt design).