The Genealogical World of Phylogenetic Networks

Biology, computational science, and networks in phylogenetic analysis


XML feed

Last update

5 min 13 sec ago

July 29, 2014


This post is just to let everyone know that Dan Gusfield's long-awaited book on the interface between phylogenetics and population genetics is now available.

The book is targeted for mathematically inclined readers. It has a few contributions from Charles H. Langley, Yun S. Song and Yufeng Wu. The title is described as "a portmanteau word derived from the single-crossover recombination of the words 'recombination' and 'combinatorics'."

Hardcover 448 pp; ISBN: 9780262027526; $60.00 £30.95
More information is available from The MIT Press.

This new book joins these previous contributions to the genre:

Image from Celine Scornavacca.

July 27, 2014


On pages 72-73 of the book Guide to Urban Moonshining: How to Make and Drink Whiskey (written by Colin Spoelman and edited by David Haskell, 2013, published by Harry N. Abrams), there is an illustration of something called the "American whiskeys family tree". This is reproduced in in the article The Bourbon Family Tree for GQ magazine, from where I sourced the copy here.

The author describes it as follows:
This chart shows the major distilleries operating in Kentucky, Tennessee, and Indiana, grouped horizontally by corporate owner, then subdivided by distillery. Each tree shows the type of whiskey made, and the various expressions of each style of whiskey or mash bill, in the case of bourbons. For instance, Basil Hayden's is a longer-aged version of Old Grand-Dad, and both are made at the Jim Beam Distillery. So, while the vertical axis is indeed a time scale, the trees are only marginally family trees in the genealogical sense. This is much more an attempt to  illustrate the corporate ownership of American whiskey, which is made principally from corn (and thus is generically called bourbon, although in Tennessee they seem to rarely use this word). The main distinctions among the brands are (i) whether the non-corn part is made from rye, a little bit more rye, or wheat, and (ii) the length of time it is aged between distillation and sale.

The reticulations among the trees apparently refer to blends. The ghost lineages at the right are described thus:
Willett, formerly only a bottler as Kentucky Bourbon Distillers, has been distilling its own product for about a year; I include the brands that it bottles from other sources for reference.

July 22, 2014


I have written before about the expected genetic problems associated with inbreeding, including consanguinity and incest (relationships between people who are first cousins or closer). Conventionally, the evolutionary advantage of sexual over non-sexual reproduction is considered to be the creation of genetic diversity through heterozygosity. Inbreeding, by reducing heterozygosity, then seems to negate the advantages of sexual reproduction — it leads to the propagation of deleterious recessive alleles and thus inbreeding depression. So, there is a clear evolutionary dimension to the fact that incest avoidance is nearly universal in humans.

The best known exceptions to this situation are among royalty, including the family "trees" of the ancient Egyptian 18th Dynasty (see Tutankhamun and extreme consanguinity) and the Egyptian Ptolemaic dynasty (see Cleopatra, ambition and family networks), which were hybridization networks rather than conventional trees. The presence of consanguinity and incest among royal families then requires a biological explanation. As noted by van den Berghe & Mesher (1980):
Royal incest is best explained in terms of the general sociobiological paradigm of inclusive fitness ... Royal incest (mostly brother-sister; less commonly father-daughter) represents the logical extreme of hypergyny. Women in stratified societies maximize fitness by marrying up; the higher the status of a woman, the narrower her range of prospective husbands. This leads to a direct association between high status and inbreeding.The benefits of inclusive fitness refer to the increased number of offspring in future generations that result from increasing the reproductive success of close relatives. This is achieved via choice of mate. In other words, close relatives share genes, and the success of any relative in leaving offspring is a success for all relatives. Therefore, evolutionary fitness is a combination of individual fitness plus the fitness of close relatives. Inbreeding may reduce individual fitness but can increase inclusive fitness, as noted by Puurtinen (2011):
Theoretical work has shown that inclusive fitness benefits can favor close inbreeding even when this results in substantial reduction in offspring fitness. These models have identified the boundary level of inbreeding depression limiting the evolution of inbreeding among first-order relatives, that is, between full siblings, or between parents and offspring.So, there is a stable level of inbreeding in those populations that practice mate choice for optimal inbreeding. For example, the genetic risks of close inbreeding can be more than accounted for by the production of a highly related heir who has access to a wide choice of mates. Nevertheless:
For a wide range of realistic inbreeding depression strengths, mating with intermediately related individuals maximizes inclusive fitness.In other words, mating with very close relatives is unlikely to evolve via natural selection because it is not an optimal strategy; and we must thus look to a sociological component to incest (such as retaining wealth within the family), as well as a biological one.

In this context, it is interesting to note exceptions to the usual restriction of incest to the aristocracy. The society of Graeco-Roman Egypt (from c. 300 BCE to 300 CE) provides the best-documented case (eg. see Hopkins 1980; Shaw 1992; Parker 1996; Scheidel 1997; Huebner 2007; Remijsen & Clarysse 2008). [This era starts with the Ptolemaic dynasty, which marks the collapse of Egyptian rule of Egypt.] During this time a significant proportion of all marriages noted in official Roman census declarations were between full brothers and sisters. That is, the Roman-era Egyptians did not limit this type of inbreeding to any small group, but spread it across several social classes (mainly Greek settlers rather than native Egyptians).

As noted by Schiedel (1997):
According to official census returns from Roman Egypt (first to third centuries CE) preserved on papyrus, 23·5% of all documented marriages in the Arsinoites district in the Fayum (n=102) were between brothers and sisters. In the second century CE, the rates were 37% in the city of Arsinoe and 18·9% in the surrounding villages. Documented pedigrees suggest a minimum mean level of inbreeding equivalent to a coefficient of inbreeding of 0·0975 in second century CE Arsinoe. Undocumented sources of inbreeding and an estimate based on the frequency of close-kin unions indicate a mean coefficient of inbreeding of F=0·15-0·20 in Arsinoe and of F=0·10-0·15 in the villages at the end of the second century CE. These values are several times as high as any other documented levels of inbreeding.For comparison, the inbreeding F values for these family relationships are:
parent-offspring = siblings
uncle-niece = double first cousins
first cousins
first cousins once removed
second cousins 0.500
However, inbreeding depression seems not to have been a notable problem during this historical time. As noted by John Hawkes:
There is not a single mention in the evidence that links sibling marriage to negative genetic effects or unhappy marriages.This does not mean that there were no problems, but merely that any problems were not documented, as noted by Scheidel (1997):
Even in the absence of explicit references to inbreeding depression from Roman Egypt, there is no compelling reason to assume that brother–sister marriage could have remained entirely without negative consequences for the Arsinoites. It is however possible that, due to a low incidence of lethal recessives, such effects were considerably weaker than in some western samples. The census returns do not suggest lower levels of fertility or smaller numbers of children among sibling couples ...The practice seems to have stopped solely because it was contrary to Roman Law:
Before a.d. 212 the Romans had accepted discrepancies between their own legal practice and prevailing local customs and traditions in the Eastern provinces. Papyri from Roman Egypt, the Talmud, and the Romano-Syrian law book indeed reveal legal procedures which differed significantly from Roman law in matters such as marriage, guardianship, paternal authority, sales, and debts. The Constitutio Antoniana, however, made all free men and women of the Roman Empire into Roman citizens, and so Roman law became applicable to all inhabitants of Egypt. Brother-sister marriages cease to be documented in our Roman census returns from the early third century on. Our last [incest] testimony dates to a.d. 229.

Hopkins K (1980) Brother-sister marriage in Roman Egypt. Comparative Studies in Society and History 22: 303-354.

Huebner SR (2007) "Brother-sister" marriage in Roman Egypt: a curiosity of humankind or a widespread family strategy? Journal of Roman Studies 97: 21-49.

Parker S (1996) Full brother-sister marriage in Roman Egypt: Another look. Cultural Anthropology 11: 362-376.

Puurtinen M (2011) Mate choice for optimal (k)inbreeding. Evolution 65: 1501-1505.

Remijsen S, Clarysse W (2008) Incest or adoption? Brother-sister marriage in Roman Egypt revisited. Journal of Roman Studies 98: 53-61.

Scheidel W (1997) Brother-sister marriage in Roman Egypt. Journal of Biosocial Science 29: 361-371.

Shaw BD (1992) Explaining incest: brother-sister marriage in Graeco-Roman Egypt. Man 27: 267-299.

July 20, 2014


I have commented before on fact that the general public associates an inappropriate "March of Progress" image with the concept of "evolution" (see Haeckel and the March of Progress, and especially Tattoo Monday VIII - the March of Progress). It therefore seems worthwhile to gather a few examples together in the one place. Most of these are abbreviated versions of the image in the book Early Man by Francis C. Howell (1965. Time-Life International, New York). There were originally 14 images (see the version here), but the modern versions have a half or fewer images.

July 16, 2014

We all worked hard during the workshop. Here is our fearless leader, in deep thought:

While some of the younger participants enjoyed drawing on the walls:

Professor Whitfield has come up with a great new model of evolution: phylogenetic windmills:

There was not only work, but also time to relax and enjoy the beautiful Dutch summer weather:

And not to forget the delicious Dutch food:

But really, most of the time we were busy touching the data, which you can find on this website:

For more photos, see the Touching the Data website.

July 11, 2014


We have now completed the workshop.

Since the first report, we have had three more talks. First, Mukul Bansal outlined the relationship between phylogenetic networks and reconciliation analysis, and the way in which the latter can be used to construct the former. Starting from an estimated species tree, the tree for each locus is optimized for fit to the species tree, which helps locate any areas of extensive gene flow (ie. reticulation). This can be done using a large number of loci and an even larger number of taxa.

Celine Scornavacca provided details of some of the fundamental limitations of network analysis.The most important of these is unidentifiability of network topologies -- there are classes of network topologies that cannot be distinguished based on the information that is currently used, so that we cannot guarantee that a unique optimal network will be found during an analysis. Branch lengths may help with this situation, but cannot guarantee to resolve it.

Jim Whitfield covered the advantages and potential problems of using genomic-scale data for phylogenetic analysis. The basic problem is the increased scope for error in moving to the genome data (genome assembly problems, gene homology issues, alignment difficulties), although the potential advantages are extensive.

Most importantly, we spent two days "touching" some data. The participants broke into smaller groups of continuously varying size, each of which focussed on a particular dataset (as supplied by some of the participants). These data were evaluated in many different ways, to assess the characteristics of the data as well as to evaluate the data-analysis methods. This not only allowed us to identify the current state of the art with respect to phylogenetic networks, but it also allowed computationalists to improve their understanding of biological data and how biologists proceed to analyze it, as well as allowing biologists to obtain immediate feedback with respect to their data-analysis issues.

Production of phylogenetic networks seems to have come a long way in the past few years, although there is still no single "one-stop shopping" software tool to use. Practical issues getting programs to perform on all computer types were identified, along with data-format issues. Nevertheless, all of the participants seemed to find that this was a very valuable exercise, as a means of focussing interactions among themselves.

Finally, we considered both European and U.S. funding for network research, in the latter case assisted by David Mindell (from the N.S.F.). In particular, we identified sources of funding for future workshops (either in the south of France or the north-eastern U.S.A.).

The canal-boat cruise turned out well, in spite of the somewhat uncooperative weather. The football, of course, has turned out to be rather disappointing for the hosts, although they have one more game to play.

July 8, 2014


We have now completed two days of the workshop. We have had a relaxed approach to progress, and are thus currently running behind the nominal schedule. Nevertheless, we are progressing splendidly.

We had three talks on the first day and one today. I tried to kick things off by asking a series of what I consider to be unanswered questions from observing practitioners and computationalists in action, although apparently several members of the audience already had their own answers to some of these. The bottom line is that phylogenetic analysis focuses on data patterns while interpretation focuses on processes / mechanisms, and this constitutes a large part of the apparent separation of practitioners and computationalists.

Steven Kelk and Luay Nakhleh introduced the diversity of computational approaches that we already have. These presentations neatly complemented each other, providing a valuable summary of the field as well as an overview of current limitations and future prospects. This topic was taken up later by various members of the audience, as one of the inherent problems for practitioners is how to navigate through the methods to choose a suitable one -- there are methods based on parsimony, likelihood and bayesian analysis, and methods that tackle de novo network construction, gene tree / species tree reconciliation, gene tree scoring, and network presentation.

This topic was followed up today by presentations introducing some of the currently available software. Some of these have progressed significantly in recent years, notably PhyloNet and Dendroscope, and there are some relatively new ones, as well as even newer ones in the pipeline. Based on the literature, these programs are being dramatically under-used compared to their actual usefulness.

This morning Scot Kelchner introduced us to the application of Zen Buddhism to science in general and phylogenetics in particular. This went down much better than he seemed to be expecting -- there were apparently a lot of  "Zen" people in the room. The basic idea is not to get trapped by preconceived expectations, especially arbitrary categorical notions, when interpreting the output of a phylogenetic analysis. You can consult The Nine-Headed Dragon River, by Peter Matthiessen, if you would like further information.

Finally, we got to the topic implied by the workshop's title: Touching the Data. We had a brief run-through of the pre-existing datasets stored with this blog (see the upper right-hand corner), which cover some of the diversity of what practitioners have provided to date in the way of usable datasets with "known" phylogenetic patterns.

By far the most interesting, however, was the presentation of some recent datasets made available by members of the workshop, notably Axel Janke (bear species), Scot Kelcher (bamboo species) and Mattis List (Indo-European languages) (Jim Whitfield will present his datasets tomorrow morning). These datasets generated much interest, as they provide a diversity of different possible applications for phylogenetic networks. The idea from here on in the workshop is to address what can currently be done with these datasets and what we might like to do with them if the tools were available. This will help focus the participants on specific practical issues, which should lead to the progress that we hope to achieve.

It has rained most of the day, which is actually unusual -- intermittent rain is more common in this climate. We are currently waiting for the football to start: Germany versus Brazil. Tomorrow will be the Netherlands versus Argentina. It is risky being in this country this week! The current local betting is for an all-European final,an assessment that involves no cultural bias whatsoever.

July 6, 2014


This week we have returned to Leiden (in the Netherlands), for another workshop sponsored by the Lorentz Center. The previous workshop, in October 2012, is discussed in this prior blog post: Workshop: The Future of Phylogenetic Networks.

The full title of the new workshop is: Utilizing Genealogical Phylogenetic Networks in Evolutionary Biology: Touching the Data. As before, it has been organized by Steven Kelk, Leo van Iersel, Leen Stoogie and myself. The program and abstracts can be found here. It runs for the whole week 7 July – 11 July 2014.

The workshop differs significantly from the previous workshop in two ways: it is intended to be a much smaller and more focused workshop, and it is intended to be practical rather than theoretical. The basic aim is to get biologists and computational people to sit down in small groups and actually talk about real phylogenetic data, so that each side of the phylogenetics "coin" gets to understand a bit better what is going on on the other side. To this end, we have gathered together some of the experts in the field specifically of evolutionary / genealogical networks (rather than data-display networks), as this is the area that needs the greatest future development. We have also gathered together some real-world datasets involving apparent reticulating evolution, which will be the focus of discussion. These datasets are available here and also here.

The weather is predicted to be changeable during the workshop, which is to be expected in northern Europe even in summer — that is why everyone else has gone to southern Europe.

I am hoping to add some blog posts based on what happens at the workshop, as it proceeds.

July 2, 2014


I recently wrote a manuscript comparing the tree-likeness of phylogenetic data in biology and anthropology (see Are phylogenetic patterns the same in anthropology and biology?). While doing so, I also made a comparison of genotype and phenotype data within biology.

The comparison is based on maximum-parsimony analyses of the data, using the (ensemble) Retention Index (RI) as the measure of tree-likeness. If RI = 1 then all of the characters are compatible with the same tree, whereas if RI = 0 then none of them are pairwise compatible. As the graph shows, the genotype data are considerably less tree-like than are the phenotype data (mean RI ≈ 0.5 versus 0.7, respectively).

It would be interesting to know whether other people have observed this pattern. If it is general, then what causes it? Are the phenotype characters being chosen (subconsciously or not) because they show nested grouping patterns (which lend themselves automatically to a tree representation)? Or do the genotype data inherently have more stochastic variation? Does this mean that we should always be using phylogenetic networks for the representation of genotype data?

You can read the manuscript if you want the details of the analyses. Briefly, the initial collections of datasets were taken from Collard et al. (Evolution and Human Behavior 27: 169-184; 2006) — the graphed data are taken from the paper as I never managed to get the original datasets from the authors. I then supplemented this information with phenotype datasets from TreeBase (total of n=31) and miscellaneous genotype datasets from the literature (n=15). All of the datasets refer to vertebrates and insects (with one phenotype dataset from spiders). My parsimony analyses used the parsimony ratchet and PAUP*.

June 28, 2014


It was 14 years ago that the Millennium started, but there are therefore still 986 years left to solve the following seven phylogenetic network Millennium problems. These are not necessarily the most important problems to solve from a biological point of view, but are challenging computational problems that have (at least) some biological relevance. The problems are all about phylogenetic networks, except for Problems 2 and 7 which are about the closely related topic of agreement forests. Solving these problems will not be rewarded with $1,000,000 but only with eternal fame.

In each of these problems, a phylogenetic network on X is a directed acyclic graph with a single root and no vertices that have only one incoming and only one outgoing arc, and in which each leaf is labelled by an element of X and each element of X labels one leaf.

Problem 1. Is the Hybridization Number problem fixed-parameter tractable (FPT) if the input is an unrestricted set of nonbinary trees and the only parameter is the hybridization number? Hybridization Number is the following problem. Given a finite set X, a collection T of rooted (possibly nonbinary) phylogenetic trees on X and a natural number k, decide if there exists a rooted phylogenetic network on X that displays all trees from T and has reticulation number at most k. See e.g. (van Iersel, Kelk, 2013) for more detailed definitions.

Problem 2. Does there exist a polynomial-time 2-approximation algorithm for MAF on two binary trees? Maximum Agreement Forest (MAF) on two binary trees can be defined as follows. Given a finite set X and two rooted binary phylogenetic trees on X, what is the minimum number number of components in a forest on X that can be obtained from each of the input trees by deleting vertices, deleting edges and suppressing indegree-1 outdegree-1 vertices? For a 2.5-approximation see (Shi, You, Feng, 2014).

Problem 3. Is there an FPT algorithm for finding a level-k phylogenetic network consistent with a given dense set of rooted triplets, if k is the parameter? A rooted triplet is a phylogenetic tree with three leaves. A set of rooted triplets is called dense if it contains at least one triplet for each combination of three leaves. A network is level-k if it can be turned into a tree by deleting at most k edges per biconnected component. This problem is known to be solvable in polynomial time if k is fixed, see (Habib and To 2012).

Problem 4. Is Tree Containment polynomial-time solvable or NP-hard for reticulation visible networks? Tree Containment is the problem of deciding if a given phylogenetic network displays a given tree. A phylogenetic network is called reticulation visible if from each reticulation (vertex with indegree greater than one) there exists a path that does not pass through any other reticulations and ends in a leaf. Tree Containment is known to be NP-hard for general networks and for some restricted classes of networks; see (Kanj, Nakhleh, Than, Xia, 2008) and (van Iersel, Semple, Steel 2010).

Problem 5. Is there a constant-factor approximation algorithm for computing the softwired parsimony score of a binary tree-child network and a binary character? Given a network and a character state (0 or 1) for each leaf, the softwired parsimony score is the minimum number of state-changes in any tree (on all leaves) displayed by the network, over all possible assignments of states to the internal vertices. A phylogenetic network is called tree-child if each non-leaf vertex has at least one child that is not a reticulation. This problem does not have a constant factor approximation for general networks or for other (less severely) restricted classes of networks, unless P = NP (Fischer, van Iersel, Kelk, Scornavacca 2013).

Problem 6. Given k > 1, what is the maximum value of p such that for any set of rooted triplets there exists some level-k phylogenetic network on n leaves that is consistent with at least a fraction p of the input triplets? For k = 0 the maximum is p = 1/3 and for k = 1 it is roughly 0.48, see (Byrka, Gawrychowski, Huber, Kelk 2009).

Problem 7. Is there an O(c^n) algorithm for Maximum Acyclic Agreement Forest (MAAF) on two binary phylogenetic trees with c < 2? An acyclic agreement forest is an agreement forest (see above) for which the following directed graph D is acyclic. D has a vertex for each component of the forest and there is an arc from component A to component B if in at least one of the input trees there is a directed path from the root of A to the root of B. It is known that there exist an O*(2^n) algorithm for this problem (van Iersel, Kelk, Lekic, Stougie, 2013).

June 24, 2014


I recently published a post on Evolution and timelines, in which I pointed out that presenting historical data as a timeline is a very poor way of representing an evolutionary history. Evolutionary history is much better presented as a phylogeny, which will be either a tree or a network. However, this does not mean that all histories that are presented as a tree, for example, necessarily represent a phylogeny.

I have encountered a few examples of history-as-tree that seem to have very little connection to a phylogeny. That is, the relationships among the objects are presented along the branches of a tree, but the relationships along the branches seem to be little more than a timeline. So, the whole structure is simply a series of interconnected timelines.

Consider this first example, which is a poster purporting to show for the USA:
the evolution of jazz in its more than one hundred year history. From Archaic to Avant Garde, from blues to bebop, from radio to fusion, from spirituals to swing, from Armstrong to Zawinul, the jazz pedigree presents the diverse history and development of jazz in a clear way.
Perhaps it is the strong central trunk that gives it away as a non-phylogeny. The side-branches do group the jazz performers roughly by genre, but that is all they do. The actual title is a bit more accurate about the content — it is a "Story" rather than a phylogeny.

This poster is accompanied by a European counterpart with an even stronger central trunk. It is labeled as a "Community", but it still claims to "display the history and development of European jazz".

As another example, in 1946, the magazine P.M.published a tree by Ad Reinhardt with a sardonic view of modern American art. [Thanks to Joachim Dagg for alerting me to this example.]

At least there is no central trunk this time, but the clustering of artists along the branches seems to have less to do with phylogenetic history than with artistic genre (and satire). There was a follow-up example 15 years later, in which the sardonic humor plays much the strongest role in the relationships represented.

Finally, here is an example of a timeline that really should be represented using a phylogenetic tree. It is difficult to believe that the group of professions illustrated form a transformational series, as implied by the timeline that is actually shown. Most of the entrepreneur groups depicted actually still exist to this day, rather than being extinct, and so we have here a history of variational evolution, instead of a transformation.

June 22, 2014


Phylogenetics plays no part in games like Trivial Pursuit, but the web offers more opportunities. The Fun Trivia web site, for example, offers a page on Phylogenetics. You should try it, and see how well you do.

The answers (and explanations) are quire good, but the wording of some of the questions leaves a lot to be desired.

June 17, 2014


One possible use of blog posts is as first drafts of ideas that might make their appearance in a refereed publication at a later date. Thus, many of my blog posts have appeared in one form or another in my recent publications. Here I have listed the ones that I can remember using, just in case anyone wants a citable reference for the information in these posts.

A. Morrison DA (2013) Phylogenetic networks are fundamentally different from other kinds of biological networks. In W.J. Zhang (ed.) Network Biology: Theories, Methods and Applications (Nova Science Publishers, New York) pp. 23-68.

    9 Biological versus phylogenetic networks
  13 Network measures and phylogenetic networks
  23 An explanation of graph types
  25 Networks and bootstraps as tree-support criteria
  34 Networks of affinity rather than genealogy
  36 Networks of genealogy
  53 Are mathematical constraints biologically realistic?
  54 Some odd network definitions and terms
  63 Human races, networks and fuzzy clusters
  69 Is this the first network from conflicting datasets?
  70 Why do we still use trees for the Neandertal genealogy?
  72 Networks and most recent common ancestors
  74 Open questions about evolutionary networks, part 1
  75 Open questions about evolutionary networks, part 2
  76 Open questions about evolutionary networks, part 3
  88 When is there support for a large phylogeny?
  90 Explanation of the names for phylogenetic networks
  94 Phylogenetic position of turtles: a network view
  99 How networks differ from bootstrapped trees
107 We should present bayesian phylogenetic analyses using networks
115 Is there a philosophy of phylogenetic networks?

B. Morrison DA (2014) Phylogenetic networks — a new form of multivariate data summary for data mining and exploratory data analysis. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 4: in press.

  29 Network analysis of scotch whiskies
  50 Phylogenetic network of the FIFA World Cup
  61 How to interpret splits graphs
101 Distortions and artifacts in Principal Components Analysis analysis of genome data
103 Networks can outperform PCA ordinations in phylogenetic analysis
114 Network analysis of Genesis 1:3
119 Network of ancient Thai bronze Buddha images
134 A network analysis of Simon and Garfunkel
159 Networks and human inter-population variation
172 The acoustics of the Sydney Opera House

C. Morrison DA (2014) Next generation sequencing and phylogenetic networks. EMBnet.journal: Bioinformatics in Action 20: e760.

191 Next Generation Sequencing and phylogenetic networks

D. Morrison DA (2014) Phylogenetic networks: a review of methods to display evolutionary history. Annual Research and Review in Biology 4: 1518-1543.

    2 The first phylogenetic network (1755)
  21 The second phylogenetic network (1766)
  34 Networks of affinity rather than genealogy
  36 Networks of genealogy
  67 Metaphors for evolutionary relationships
  89 Relationship trees drawn like real trees
168 Who first used the term "phylogenetic network"?
182 Affinity networks updated
183 Reticulation patterns and processes in phylogenetic networks
187 What are evolutionary networks currently used for?

E. Morrison DA (2014) Rooted phylogenetic networks for exploratory data analysis. Advances in Research 2: 145-152.

  43 Rooted networks for exploratory data analysis

F. Morrison DA (2014) Is the Tree of Life the best metaphor, model or heuristic for phylogenetics? Systematic Biology 63: 628-638.

  23 An explanation of graph types
  34 Networks of affinity rather than genealogy
  36 Networks of genealogy
  58 Who published the first phylogenetic tree?
  89 Relationship trees drawn like real trees
143 Resistance to network thinking
144 Destroying the Tree of Life?
147 Should phylogenetic modelling proceed from simple to complex or vice versa?
171 Conflicting placental roots: network or tree?
182 Affinity networks updated

June 15, 2014


In a previous blog post (Tattoo Monday VIII), I noted that the usual "March of Progress" image that the general public associates with the concept of "evolution" is originally based on the frontispiece to Thomas Henry Huxley's book Evidence as to Man's Place in Nature (1863. Williams & Norgate, London). A century later, this image was expanded and updated in the book Early Man by the anthropologist Francis C. Howell (1965. Time-Life International, New York) — this picture, with labels, can be viewed here.

What is perhaps less well known is that Ernst Haeckel also made a contribution to this genre. Shown here are the frontispiece and title page of Haeckel’s Natürliche Schöpfungsgeschichte (1868. Verlag von Georg Reimer, Berlin), usually translated as "The History of Creation". This book was Haeckel's attempt to introduce the idea of evolution to the German-speaking general public, after his detailed specialist two-volume book Generelle Morphologie der Organismen (1866. Verlag von Georg Reimer, Berlin). This previous book was difficult to read, and was also full of invective against doubters and supposed opponents; so a more readable approach was needed (the original text itself was apparently derived from one of his student's notes taken during Haeckel's lectures!).

The frontispiece lithograph (by Gustav Müller) is labeled as "The family group of the Catarrhines". It was notoriously supposed to demonstrate (as explained on page 555 of the book) "the highly important fact" that the "lowest humans" stand "much nearer" to the "highest apes" than to the "highest human". The various images are labeled (from "highest" to "lowest"):
  1. "Indo-German"
  2. "Chinese"
  3. "Fuegian"
  4. "Australian Negro"
  5. "African Negro"
  6. "Tasmanian"
  7. gorilla
  8. chimpanzee
  9. orangutang
  10. gibbon
  11. proboscis monkey
  12. mandrill.
The book was a best seller, and remained in print until the 1920s. Fortunately, the frontispiece was quickly changed. For example, in the 4th edition (1873) the frontispiece was a collage of various calcareous sponges, and in the 8th edition (1889) it was a picture of Haeckel himself (as it also was for the 5th and all subsequent editions). The book actually went through 12 editions, with the number and composition of the figure plates changing several times, in addition to the changes to the frontispiece.

June 10, 2014


Any history can be represented as a timeline, but a timeline diagram does not necessarily show an evolutionary history. Unfortunately, this does not stop people from putting the word "evolution" on their timeline diagrams.

A timeline simply represents the timing of certain events. These events are presumably related in some way, but they do not necessarily refer to the history of a set of objects, or even concepts, as we might expect for an evolutionary history. Here is classic example of a perfectly valid timeline that refers to a disparate set of objects / concepts.

Apparently we are expected to infer from this timeline that McDonald's attitude to providing the public with information about the nutritional value of their fast-food products has changed over the decades. But the idea that this changed attitude might involve some sort of evolutionary process is stretching an analogy a bit too far. The timeline certainly represents a journey, as claimed, but not an evolutionary one.

For most members of the general public, "evolution" is a story of the transformation of some object or idea through time, with each stage replacing the previous one. This is a simple story with a beginning, a middle and (possibly) an end. The story can usually be presented as a timeline, of course, with each stage of the transformation arranged in the correct time order. For a biologist, this is a transformation series, representing "transformational evolution", which follows the history of a single lineage through time (ie. a history chain).

There are plenty of examples of this use of a timeline to represent transformational evolution. For instance, consider corporate logos, such as those of these two well-known beverage manufacturers. Each new logo replaced the previous one, thus providing an analogy to evolution of a single object.

The word "evolution" as used here is not one that a biologist would use, but many other people would do so. Bank notes in the USA show a similar phenomenon — in this case, the people involved appear to get younger through time! [The same thing happens on the $100 bill, as well.]

We can even take the idea of transformational evolution and use it for prediction, as was done by Takeshi Fukuda in 2002:

However, biologists do not see the evolution of organisms in this way, at all. For them, evolution is a process of variation, with lots of new forms appearing and some old ones disappearing. So, rather than an ordered series of forms, each one replacing the previous one through time, biologists see an increasing diversity of forms that is counter-acted by loss of forms (ie. extinction). This is "variational evolution" rather than transformational evolution.

Variational evolution is usually represented using a phylogeny, which will be a network or a tree, depending on the particular history, rather than a timeline chain. A phylogeny shows the relationships among a wide variety of objects, many of which will exist (or have existed) at the same time. There may have been replacement of some objects by others, but in general it is the diversity of objects existing at the same time that is of principal interest.

The issue here is that a timeline is a poor way of representing variational evolution. A timeline enforces a linear ordering of relationships, solely because "time's arrow" has one direction only. But a linear temporal order cannot reflect the complex evolutionary relationships among the objects.

Consider this example from McDonald's in Canada. There is a clear timeline here but it does not refer to transformational evolution — instead, it refers to variational evolution. These breakfast items have not necessarily replaced each other, and thus their evolutionary relationships are more complex than can be represented by a timeline.

Indeed, many of these breakfast items are still on the menu today, including: Egg McMuffin, Scrambled Eggs, Hash Browns, Hot Cakes and Sausage, Sausage McMuffin, Sausage McMuffin with Egg, Breakfast Burritos (Sausage), Bagel (Bacon, Egg Cheese, Steak, Egg Cheese), and the Fruit 'N Yoghurt Parfait.

Here is another seemingly simple image from McDonald's but with the same complexity problem — it is variational not transformational.

And finally, here is a much more complex history from Apple computers:

A timeline shows the timing of certain events, which do not necessarily involve replacement. It might be a useful way to represent transformational evolution, but it is a poor way to represent variational evolution. A phylogeny is much more appropriate.

June 8, 2014


Some of you may have noticed that the Who is Who in Phylogenetic Networks database is currently down while it is being moved to a new server. It will soon be back, but in the meantime here is a graph showing the number of publications in the database in its most recent iteration. The data are grouped according to their authors, with the top 20 most prolific authors separated on the left. The next 9 authors are included solely to get myself onto the graph.

There are no real surprises here; and there are plenty of other authors in the database with fewer publications to their name. The database data will be updated when the web site returns, in which case I might update this graph.

June 3, 2014


Over the past two years or so of blogging, I have presented a number of empirical examples in which I have used splits graphs as general multivariate data summaries, rather than using them for the analysis of what we might call strictly phylogenetic data. I have listed these analyses at the bottom of this post.

There have been two reasons for doing these analyses. First, I wish to emphasize that unrooted networks are a form of data display rather than being evolutionary diagrams. That is, they do not display evolutionary history, in the same manner as is intended for rooted phylogenetic trees, for example. Unrooted networks can be a valuable tool for exploring phylogenetic data, but they do not display a phylogeny. They are a form of exploratory data analysis.

Second, these networks form part of a much larger class of methods for the analysis of multivariate data. Indeed, I believe that they are a very valuable part of this class. One way to illustrate this has been to analyze a whole series of datasets that have little to do with phylogenetic analysis. That is, the data are not necessarily even related to a historical trend. This illustrates just what can be done with these methods.

I have now formalized this point of view in a peer-reviewed publication:
Morrison DA (2014) Phylogenetic networks — a new form of multivariate data summary for data mining and exploratory data analysis. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 4(4): in press (Online Early). doi:10.1002/widm.1130Abstract:
Exploratory data analysis (EDA) involves both graphical displays and numerical summaries of data, intended to evaluate the characteristics of the data as well as providing a form of data mining. For multivariate data, the best-known visual summaries include discriminant analysis, ordination and clustering, particularly metric ordinations such as Principal Components Analysis. However, these techniques have limiting mathematical assumptions that are not always realistic. Recently, network techniques have been developed in the biological field of phylogenetics that address some of these limitations. They are now widely used in biology under the name phylogenetic networks, but they are actually of general applicability to any multivariate dataset. Phylogenetic networks are fast and relatively easy to calculate, which makes them ideal as a tool for EDA. This review provides an overview of the field, with particular reference to the use of what are called splits graphs. There are several types of splits graph, which summarize the multivariate data in different ways. Example analyses are presented based on the neighbor-net graph, which seems to be the most generally useful of the available algorithms. This should encourage the more widespread use of these networks whenever a summary of a multivariate dataset is required.

If you don't have subscription access to the journal, you can contact me for a PDF copy.

Blog posts with multivariate data summaries:

Datasets involving temporal patterns

Network analysis of Genesis 1:3
Network of ancient Thai bronze Buddha images
Language history and language weirdness
Pacific rock art - ordinations and networks
The network history of the Carnival of Evolution
The rise and fall of "David"

Datasets with no phylogenetic pattern

Eurovision Song Contest 2006: a network analysis
Network analysis of scotch whiskies
Network analysis of Bordeaux wine critics
Network analysis of Bordeaux wine critics II
A network analysis of Médoc wines
Eurovision Song Contest 2012: a network analysis
Phylogenetic network of the FIFA World Cup
Astrocladistics: a network analysis
Network analysis of McDonald's fast-food
Is there good and bad fast-food?
The mysterious rankings in Forbes' Celebrity 100
Network analysis of Michelin starred restaurants
Network analysis of New York neighborhoods
A network analysis of Simon and Garfunkel
Network analysis of Manhattan apartment buildings
A network analysis of the Bundesliga
Networks of the "Sight & Sound" film polls
A network analysis of London's theatres in 1965
The acoustics of the Sydney Opera House
A network of New Zealand's livestock regions
A network analysis of airplane disasters
World ice hockey champions — a network
Fast-food maps — a network analysis
Single-malt scotch whiskies — a network
Which cars are good, really?
The Netherlands is more than just tulips and sea-dykes
Automated natural language processing
Cancer rates and diagnosis

Theoretical considerations

Distortions and artifacts in Principal Components Analysis analysis of genome data
Networks can outperform PCA ordinations in phylogenetic analysis
Multivariate data displays are not always necessary

June 1, 2014


Most people don't want to think about cancer, but everyone over the age of 40 should do so, and do so regularly. This is because early diagnosis dramatically increases your chance of survival, and early diagnosis is almost entirely up to you — you will be the first person to notice the symptoms. To put it bluntly, you can ignore the first sings of cancer and thus live for another five years or so, or you can got to a doctor and thus live for another twenty. The choice is yours, not Fate's.

Cancer is a disease of old age. Back in the dim dark past, people usually lived only until about 40 years of age, and so cancer was not a big problem. It is doubtful that it was a major cause of death amongst humans. But as we slowly but surely have increased our life expectancy, cancer has become more and more of an issue. For example, for people in the USA in 2010, cancer was the No. 1 cause of death for people aged between 40 and 80, as shown in this table. Indeed, you will note that for females cancer was in one of the top two spots for all age groups.

The incidence of cancer varies dramatically among different organs, and this variation has itself changed over time. In the USA, lung cancer has been the biggest cause of cancer deaths for males since the 1950s and for females since the 1980s, as shown in the next graph. In both cases the death rate has decreased since the recent active attempts to reduce the smoking epidemic. [Note: medically it is considered to be an epidemic, just as obesity is currently considered an epidemic in the Western world.]

The second biggest cause of death for males is prostate cancer, and for females it is breast cancer, followed in both cases by cancer of the colon and/or rectum. In all of these cases the death rate is decreasing through time, ostensibly due to changes in risk factors but most importantly the introduction of screening.

Screening is particularly important for cancer of the reproductive organs. There is only one "major" cancer-related organ for males (the prostate), but for females there are three: the breasts, the uterus and the ovaries. As you can see in the next graph, the stage at which the cancer is first diagnosed varies quite a lot among these organs.

To explain the stages: (i) localized means that the cancer is confined to one part of the organ concerned; (ii) regional means that the cancer has spread throughout the organ; and (iii) distant means that it has spread from there to other nearby organs. For effective treatment, and therefore maximum probability of survival, the cancer needs to be detected before it has reached the third stage. You will note that this is particularly problematic for ovarian cancer, which is usually not diagnosed until this stage.

Interestingly, death rates due to cancer are not randomly distributed in space, as shown in the next graph for the states of the USA. The data analyzed were for death rate (per 100,000 people) for 2006-2010 for the most common types of cancer (breast, colorectum, lung, non-hodgkin lymphoma, pancreas, prostate). I used the manhattan distance to evaluate the multivariate relationships in the data, and displayed this using a NeighborNet.

The graph shows the relationships among the different states. States near each other in the network have similar death rates for the different cancers, while states further apart are progressively more different from each other.

The states mostly form a gradient of increasing cancer death rates from the top-left towards the bottom of the network. Utah stands out because it has much the lowest death rates for colorectum, lung, and pancreas cancers amongst both men and women. DC stands out because it has nearly twice the rate of prostate cancer deaths than the other locations, presumably due to the distinctly older-male biased population of Washington city.

Finally, we can look at some international data. These data are solely for ovarian cancer, involving the 1-year survival rate after diagnosis. The data refer to survival in three different age classes of women (15-49, 50-69, 70-99 years old) for the three different stages at diagnosis. They were analyzed in the same way as above to produce the network.

On average, the Canadian survival is the highest, followed by the Norwegians, with the females from NSW (in Australia) faring the worst (particularly in the oldest age group). However, these data are rather mixed. For example, the Danish survival is worse than average for the oldest age group in the localised stage but better than average in the regional stage.

There is clearly a long way to go in the diagnosis and treatment of cancers.

Sources of data

Siegel R, Ma J, Zou Z, Jemal A (2014) Cancer Statistics, 2014. CA: A Cancer Journal for Clinicians 64: 9-29.

Maringe C, Walters S, Butler J, Coleman MP, Hacker N, Hanna L, Mosgaard BJ, Nordin A, Rosen B, Engholm G, Gjerstorff ML, Hatcher J, Johannesen TB, McGahan CE, Meechan D, Middleton R, Tracey E, Turner D, Richards MA, Rachet B, ICBP Module 1 Working Group (2012) Stage at diagnosis and ovarian cancer survival: evidence from the International Cancer Benchmarking Partnership. Gynecologic Oncology 127: 75-82.

[Declaration of interest: I have had skin cancer for nearly 30 years, which is a product of growing up in Australia, the country with the highest rate of skin cancer in the world; and recently three female members of my family have been diagnosed with cancer of their reproductive organs. So, I'm thinking about it even if you haven't been.]

May 27, 2014


Complex networks are found in all parts of biology, graphically representing biological patterns and, if they are directed networks, also their causal processes. Directed networks are currently used to model various aspects of biological systems, such as gene regulation, protein interactions, metabolic pathways, ecological interactions, and evolutionary histories.

Two types of networks can be distinguished, and this distinction seems to me to be very important. Most networks are what might be called observed networks, in the sense that the nodes and edges represent empirical observations. For example, a food web consists of nodes representing animals with connecting edges representing who eats whom. Similarly, in a gene regulation network the genes (nodes) are connected by edges showing which genes affect the functioning of which other genes. In all cases, the presence of the nodes and edges in the graph is based on experimental data. These are collectively called interaction networks or regulation networks.

However, when studying historical patterns and processes not all of the nodes and edges can be observed. So, instead, they are inferred as part of the data-analysis procedure. That is, we infer the patterns as well as the processes; and we can call these inferred networks. In this case, the empirical data may consist solely of the leaf nodes, and we infer the other nodes plus all of the edges. For example, every person has two parents, and even if we do not observe those parents we can infer their existence with confidence, as we also can for the grandparents, and so on back through time with a continuous series of ancestors. Alternatively, we may also observe some of the internal nodes of the network, such as when we do record the parents and grandparents because they are contemporaneous (ie. their generations overlap). This type of pattern can be represented as a genealogical network, when referring to individual organisms, or a phylogenetic network when referring to groups (populations, species, or larger taxonomic groups).

What, then, are the things often referred to as "evolutionary networks" but which are clearly not phylogenetic networks? They are of the first type, the interaction networks. In an evolutionary network the observed nodes are directly connected to each other to represent some aspect of evolution. This aspect may have some component of phylogeny to it, but there is more to the study of evolution than solely phylogenetic history.

For example, directed LGT (dLGT) networks connect nodes representing contemporary organisms with edges that represent inferred lateral gene transfer. That is, the evolutionary networks show gene sharing. This is obviously related to the phylogeny of the organisms, but the network does not display the phylogeny itself. This first example (from Ovidiu Popa, Einat Hazkani-Covo, Giddy Landan, William Martin, Tal Dagan. 2011. Directed networks reveal genomic barriers and DNA repair bypasses to lateral gene transfer among prokaryotes. Genome Research 21: 599-609) shows "32,028 polarized lateral recipient–donor protein-coding gene transfer events" inferred from "the completely sequenced genomes of 657 prokaryote species".

The concept of a gene-sharing network as an evolutionary network has also been applied to viruses and their relatives, for example, as shown by this next diagram (from Natalya Yutin, Didier Raoult, Eugene V Koonin. 2013. Virophages, polintons, and transpovirons: a complex evolutionary network of diverse selfish genetic elements with different reproduction strategies. Virology Journal 10: 158).

The question, then, is what to make of diagrams that combine both a phylogenetic tree and this type of evolutionary network, such as is done in the Minimal Lateral Network. This next example is from linguistics rather than biology (from Johann-Mattis List, Shijulal Nelson-Sathi, Hans Geisler, William Martin. 2013. Networks of lexical borrowing and lateral gene transfer in language and genome evolution. Bioessays 36: 141-150), and it superimposes the sharing network and the phylogenetic tree. (For a discussion in the context of LGT, see also Tal Dagan. 2011. Phylogenomic networks. Trends in Microbiology 19: 483-491).

In this diagram, the tree explicitly represents the phylogenetic history of the languages while the evolutionary network represents possible borrowings of words, with thicker lines representing more borrowed words. Clearly, the network also contains phylogenetic information of some sort. For example, the connection of the root of the Romance languages to English reflects the conquest of Britain by the French-speaking Normans, which modified the Old-German heritage of Old English. However, the diagram as a whole is a hybrid, rather than being a coherent phylogenetic network in the simplest sense (ie. a reticulation network).

To see this clearly, note that the phylogenetic tree is not fully resolved and that the evolutionary network does suggest possible resolutions for several of polychotomies, such as the relationship of Armenian and Greek, the relationship of Albanian to the Romance languages, and the relationship of the Gaelic languages to the Romance languages. So, in some cases the evolutionary network helps resolve the phylogenetic tree rather than forming a reticulating network.

It would be possible to derive a phylogenetic network from this minimal lateral network, but as it stands it is a combination of a phylogenetic tree and a so-called evolutionary network.

May 25, 2014


We haven't had any phylogenetic tree tattoos on this blog for a while, so here is a new collection of Charles Darwin's best-known sketch from his Notebooks (the "I think" tree) (for other examples, see Tattoo Monday III, Tattoo Monday V, and Tattoo Monday VI).