The Genealogical World of Phylogenetic Networks

Biology, computational science, and networks in phylogenetic analysis


XML feed

Last update

19 min 19 sec ago

August 19, 2014


Phylogeneticists treat the tree image as having special meaning for themselves. Conceptually, the tree is used as a metaphor for phylogenetic relationships among taxa, and mathematically it is used as a model to analyze phenotypic and genotypic data to uncover those relationships. Irrespective of whether this metaphor / model is adequate or not, it has a long history as part of phylogenetics (Pietsch 2012). Of particular interest ihas been Charles Darwin's reference to the "Tree of Life" as a simile, since that is clearly the key to the understanding of phylogenetics by the general public.

The principle on which phylogenetic trees are based seems to be the same as that for human genealogies. That is, phylogenies are conceptually the between-species homolog of within-species genealogies. As far as Western thought is concerned, human genealogies make their first important appearance in the Bible, with a rather specific purpose. The Bible contains many genealogies, mostly presented as chains of fathers and sons. For example, Genesis 5 lists the descendants of Adam+Eve down to Noah and his sons, which can be illustrated as a pair of chains (as shown in the first figure); and the rest of Genesis gets from there down to Moses' family, for which the genealogy can be illustrated as a complex tree.

The genealogy as listed in Genesis 5.
Cain's lineage was terminated by the Flood.
However, the theologically most important genealogies are those of Jesus, as recorded in Matthew 1:2-16 and Luke 3:23-38. Matthew apparently presents the genealogy through Joseph, who was Jesus' legal father; and Luke apparently traces Jesus' bloodline through Mary's father, Eli. These two lineages coalesc in David+Bathsheba, and from there they have a shared lineage back to Abraham. Their importance lies in their attempt to substantiate that Jesus' ancestry fulfils the biblical prophecies that the Messiah would be descended from Abraham (Genesis 12:3) through Isaac (Genesis 17:21) and Jacob (Genesis 28:14), and that he would be from the tribe of Judah (Genesis 49:8), the family of Jesse (Isaiah 11:1) and the house of David (Jeremiah 23:5).

That is, these genealogies legitimize Jesus as the prophesied Messiah. Following this lead, subsequent use of genealogies has commonly been to legitimize someone as a monarch, so that royal genealogies have been of vital political and social importance throughout recorded history (see the example in the next figure). This importance was not lost on the rest of the nobility, either, so that documented genealogies of most aristocratic families allow us to identify the first-born son of the first-born son, etc, and thus legitimize claimants to noble titles — genealogies are a way for nobles to assert their nobility.

The genealogy of the current royal family of Sweden.
The lineage of the recent monarchs is highlighted as a chain, with an aborted side-branch dashed.
If we focus solely on the line of descent involved in legitimization, then genealogies can be represented as a chain (as shown in the genealogy above). However, if we include the rest of the paternal lines of descent then family genealogies can be represented as a tree. However, if we include some or all of the maternal lineages as well, then family genealogies can be represented as a network. For example, the biblical genealogies only rarely name women, but where females are specifically named the genealogies actually form a reticulated network. Jacob produced offspring with both Rachel and Leah, who were his first cousins; and Isaac and Rebekah were first cousins once removed. Even Moses was the offspring of parents who were, depending on the biblical source consulted, either nephew-aunt, cousins, or first cousins once removed. These relationships cannot be represented in a tree. (See also the genealogy of the Spanish branch of the Habsburgs, who were kings of Spain from 1516 to 1700.)

This idea of genealogical networks was straightforward to transfer from humans to other species. Originally, biologists stuck pretty much to the idea of a chain of relationships among organisms, as presented in the early part of Genesis. Human genealogies were traced upwards to Adam and from there to God, and thus species relationships were traced upwards to God via humans. However, by the second half of the 1700s both trees and networks made their appearance as explicit suggestions for representing biological relationships. In particular, Buffon (1755) and Duchesne (1766) presented genealogical networks of dog breeds and strawberry cultivars, respectively.

However, these authors did not take the conceptual leap from within-species genealogies to between-species phylogenies. Indeed, they seem to have explicitly rejected the idea, confining themselves to relationships among "races". It was Charles Darwin and Alfred Russel Wallace, a century later, who first took this leap, apparently seeing the evolutionary continuum that connects genealogies to phylogenies. In this sense, they both took ideas that had been "in the air" for several decades, but previously applied only within species, and applied them to the origin of species themselves. [See note below.] Both of them, however, confined themselves to genealogical trees rather than using networks. It seems to me that it was Pax (1888) who first put the whole thing together, and produced inter-species phylogenetic networks (along with some intra-species ones).

In this sense, the biblical Tree of Life has only a peripheral relevance to phylogenetics. Darwin used it as a rhetorical device to arouse the interest of his audience, but it was actually the biblical genealogies that were of most practical importance to his evolutionary ideas. Apart from anything else, the original biblical tree was actually the lignum vitae (Tree of Eternal Life) not the arbor vitae (Tree of Life). Similarly, the tree from which Adam and Eve ate the forbidden fruit was the lignum scientiae boni et mali (Tree of Knowledge of Good and Evil), not the arbor scientiae (Tree of Knowledge) that was subsequently used as a metaphor for human knowledge.

Note.  Along with phylogenetic trees, Darwin and Wallace did not actually originate the idea of natural selection, which had previously been discussed by people such as James Hutton (1794), William Charles Wells (1818), Patrick Matthew (1831), Edward Blyth (1835) and Herbert Spencer (1852). However, this had been in relation to within-species diversity, whereas Wallace and Darwin applied the idea to the origin of between-species diversity (i.e. the origin of new species).


Buffon G-L de. 1755. Histoire naturelle générale et particulière, tome V. Paris: Imprimerie

Duchesne A.N. 1766. Histoire naturelle des fraisiers. Paris: Didot le Jeune & C.J. Panckoucke.

Pax F.A. 1888. Monographische übersicht über die arten der gattung Primula. Bot. Jahrb. Syst. Pflanzeng. Pflanzengeo. 10:75–241.

Pietsch T.W. 2012. Trees of life: a visual history of evolution. Baltimore: Johns Hopkins University Press.

August 17, 2014


These illustrations are from Alper Uzun's Biocomicals web site.

Bioinformaticians' dream

Bioinformaticians' reality

August 12, 2014


Sampling bias refers to a statistical sample that has been collected in such a way that some members of the intended statistical population are less likely to be included than are others. The resulting biased sample does not necessarily represent the population (which it should), because the population members were not all equally likely to have been selected for the sample.

This affects scientific work because all scientific questions are about the population not the sample (ie. we infer from the sample to the population), and we can only answer these questions if the samples we have collected truly represent the populations we are interested in. That is, our results could be due to to the method of sampling but erroneously be attributed to the phenomenon under study instead. Bias needs to be accounted for, but it cannot be assessed by looking at the sampled data alone. Bias can only be addressed via the sampling protocol itself.

In genome sequencing, sampling bias is often referred to as ascertainment bias, but clearly it is simply an example of the more general phenomenon. This is potentially a big problem for next generation sequencing (NGS) because there are multiple steps at which sampling is performed during genome sequencing and assembly. These include the initial collection of sequence reads, assembling sequence reads into contigs, and the assembly of these into orthologous loci across genomes. (NB: For NGS technologies, sequence reads are of short lengths, generally

August 10, 2014


In many games of chance the odds of winning or losing remain constant during play, such as in the street coin-game Two-Up and for the casino Roulette wheel. At the other extreme, the odds of winning are sometimes determined by the players to a much greater extent, such as in the card game Poker. This is why poker is such a popular form of gambling — all players are under the delusion that the advantage lies with them alone.

In between these extremes, there are games of chance where the odds of winning vary depending on the circumstances. If a player can identify these circumstances, then they can increase their wagers when the circumstances are favorable and decrease them when they are unfavorable, thus maximizing their chances of making a profit. This is called Advantage Gambling, and it is amenable to formal mathematical analysis. These analyses have kept a number of mathematicians gainfully employed over the centuries.

Some well-known examples of advantage gambling are the use of Arbitrage Bets in sports betting, and of Card Counting in card games. This blog post is about the latter, especially as applied to the casino card-game of Blackjack. [There are also many similar games played both inside and outside casinos, such as Twenty-one, Vingt-et-un, Spanish 21, Pontoon, etc.]

In blackjack the player is betting their card hand against that of the dealer (not any other player). The basic idea is to be dealt a hand of cards whose face values sum to a final score that is higher than that of the dealer's hand without exceeding a sum of 21. There are many variants throughout the world, although they tend to be minor variations on a single basic theme (as described by Wikipedia). In general. the dealer follows a strict set of rules specifying how many cards they can be dealt, while the player has a free choice regarding their own hand.

Clearly, the composition of the cards being dealt must change throughout a series of hands being dealt, because the deck of cards (or more usually several decks) gradually becomes exhausted. If the cards have been shuffled so that the random order of the cards is very even then there will be little change in composition through time, but if the random order is clustered (as it can be by random chance) then the composition of the cards remaining to be dealt may favor either the dealer or the player.

This favoritism happens because the dealer has to follow a fixed strategy, and certain cards favor that strategy. In particular, the dealer must always be dealt another card when their hand sums to a total in the range 12-16 (and sometimes 17). If the card dealt is a 10, J, Q or K (all of which have a value of ten) then the dealer's sum will exceed 21, and the player will win. Thus, if there is a high proportion of these cards remaining in the deck then the dealer is at a disadvantage relative to the player, who can chose not to take the extra card. On the other hand, if there is a high proportion of low cards remaining (especially 4s, 5s and 6s) then the dealer will not be disadvantaged.

In general play, the casino dealer will have an advantage of 0.5-1%, depending on the precise rules of play and how many decks of cards are in use simultaneously. So, in the long term the casino will make a profit, which is why they are in the gambling business in the first place. However, they make a smaller profit from blackjack than from any of their other games (for example, in roulette the casino's advantage is usually 5.3% in the USA and 2.7% in Europe), and this means that for blackjack the advantage gambler doesn't have to move the advantage very far for it to be in their favor instead of the casino's.

There is a Basic Strategy in blackjack, which stipulates what the gambler should do when their hand has any specified total against that of the dealer's — that is, whether they should Stand, Hit, Double Down, or Split. This was first explained by Roger Baldwin, Wilbert E. Cantey, Herbert Maisel and James P. McDermott in 1956 (Optimum strategy in blackjack. Journal of the American Statistical Association 51: 429-439); and Wikipedia provides a simple exposition. For the gambler, this strategy will lose the least amount of money to the casino in the long term (ie. lose only the 0.5-1% referred to above), as determined by mathematical analysis.

The advantage gambler wants to change these odds. The most common advantage play for blackjack is card counting, and it can change the advantage to be up to 2% in the gambler's favor. The essential idea is to keep a running track of whether the remaining undealt cards are biased towards small values (2, 3, 4, 5, 6) or large values (10, J, Q, K, A). To do this, a pre-specified value is added to the running total for each of the small cards that have already been dealt (and therefore can't still be in the deck), and a pre-specified value is subtracted for each of the large cards. The value of the running count will then indicate how much the advantage is in favor of the gambler. The gambler can then bet according to the size of their advantage.

There is nothing unique about this: "anyone who aspires to play Bridge, Stud Poker, Rummy, Gin, Pinochle, or Go Fish knows that you must keep track of the played cards" (Norman Wattenberger. 2009. Modern Blackjack: an Illustrated Guide to Blackjack Advantage Play). It requires no especial mathematical ability, although you do have to pay attention, and not forget what your count currently is (this is far simpler than playing bridge, where to play well you need to keep precise information about the remaining cards). Blackjack has apparently increased in popularity over the last 40 years because it is one of the few casino games that can consistently be won using expert play (maybe also video poker). However, the casinos will not unexpectedly try to stop you from winning via card counting.

The idea of counting cards in blackjack has been around since at least the 1950s, but the first popular text on the subject was Edward O. Thorp's book Beat the Dealer: a Winning Strategy for the Game of Twenty-One (1962). Since then, oodles of card counting systems have been devised, which differ in how many points are to be added to or subtracted from the running total for each card that is dealt. They range from relatively easy to implement to unnecessarily difficult.

We can look at the relationship between the different counting systems using a phylogenetic network. The data for 24 of these systems are available at Norman Wattenberger's Card Counting page (see also Popular Card Counting Strategies). The above graph is a NeighborNet (based on the manhattan distance) of these data. Systems near each other in the network have a similar assignment of points to cards, while systems further apart are progressively more different from each other. The network shows a simple trend of increasing complexity of the systems from the top-right to the bottom-left. [Note that some of the systems use the same points, and thus appear at the same place in the network, but these do differ in other ways.]

This trend correlates quite well with the perceived ease of use of the systems, with the hardest ones to use being highlighted in red in the network and the medium ones in blue. The hardest ones do seem to be the most successful at predicting good betting situations. However, the consensus seems to be that the most complex systems are not that much better than some of the simpler ones — these are slightly less powerful but far easier to use. That is, the differences in difficulty are much greater than are the differences in performance, and so the complex ones are rarely recommended these days.

The powerful but simple systems include KISS III, K-O, REKO and Red Seven. Indeed, K-O appears to be becoming one of the most popular card counting systems. However, the older Hi-Lo is probably the most used counting strategy in existence.

Other games

Actually, consistently winning at blackjack is now old hat. What is far more interesting is trying to be an advantage gambler at games like lotto and the lotteries. Advantage gambling at lotto turns out sometimes to be an investment strategy rather than a gamble. For example, there have been times when the prize money has actually been greater than the cost of the betting tickets required to cover all of the needed number combinations (see The International Lotto Fund) and other times when the prize distribution has made each ticket worth more than it costs (see Massachusetts' Cash WinFall). My favorite, though, is trying to work out how to use advantage gambling for scratch lotteries, the gambling that usually has the worst chance of winning (see this article about Joan Ginther, who has clearly tried).

August 5, 2014


Data-display networks are a means of visualizing complex patterns in multivariate data. One particular use is for displaying the patterns in a set of trees. For example, Consensus Networks and SuperNetworks are splits graphs that display the patterns common to some specified subset of a collection of trees (eg. a set of equally optimal trees, or a set of trees sampled by a bayesian or bootstrap analysis). Alternatively, Parsimony Networks try to simultaneously display all of the trees in a collection of most-parsimonious trees for a single dataset.

Another display method for multiple trees is what has been called a Cloudogram (see the post Cloudograms and data-display networks). These superimpose the set of all trees arising from an analysis, so that dark areas in such a diagram will be those parts where many of the trees agree on the topology, while lighter areas will indicate disagreement.

Yet another method for combining trees into a graph while retaining all of the original information from the source trees is the Tree Alignment Graph (TAG), an idea introduced by Stephen A. Smith, Joseph W. Brown and Cody E. Hinchliff (2013. Analyzing and synthesizing phylogenies using tree alignment graphs. PLoS Computational Biology 9: e1003223).

The authors note:
These methods address the problem of identifying common nodes and edges across sets of phylogenetic trees and constructing a data structure that efficiently contains this information while retaining original source information ... Mapping trees into a TAG exploits the fact that rooted phylogenetic trees are in fact a specific type of graph: they are directed, acyclic, and require that each node has, at most, one parent. By relaxing these requirements, we can combine multiple trees into a common graph, while minimizing changes to the semantic interpretations of nodes and edges in the trees. Because they contain nodes and edges directly analogous to those from their source trees, TAGs have the desirable quality of retaining the full identifiability of the original source trees they contain. Additionally, because they are not restricted to the bifurcating model of evolution, TAGs may represent conflict among source trees as reticulations in the graph.The basic principal is illustrated in the first figure (about). Internal nodes represent collections of terminal nodes, and arcs (directed edges) represent their relationships. Nodes and arcs are added to the growing TAG, each of which represents one relationship shown in one of the original trees. TAG A in the figure shows the result of combining the black, blue and orange trees, while TAG B shows the result of then adding the gray and green trees to TAG A (the arcs are colour-coded). The resulting TAG is thus a database of all of the original information, which can then be queried in any way to provide summaries of the data. In particular, standard network summaries can be used, such as node degree, which will highlight parts of the TAG with interesting characteristics.

The authors provide two empirical examples of applications. The one shown here involves 100 bootstrap trees for 640 species representing the majority of known lineages from the Angiosperm Tree of Life dataset (chloroplast, mitochondrial, and ribosomal data). The TAG is shown lightly in the background. Superimposed on this, the nodes are coloured to represent the effective number of parent nodes, and their size represents node bootstrap support. Highly supported nodes with a low number of effective parents (large blue nodes) are frequently recovered and confidently placed in the source trees, while highly supported nodes with a low number of effective parents (large and pink or orange) are frequently resolved in the source trees but their placement varies among bootstrap replicates. So, the three largest problem areas as illustrated in the TAG correspond to the Malpighiales, Lamiales and Ericales.

For comparison, a NeighborNet analysis of the same data is shown in the blog post When is there support for a large phylogeny? This simply shows an unresolved blob.

August 3, 2014


Cheese making is about 8,000 years old, and there are now about 1,000 distinct types of cheese throughout the world. As with most ancient crafts, the art of making cheese is to get the microbes to do most of the work for you.

To this end, there has been much interest in the microbial communities that occur in cheese rinds (the bit around the outside). Different communities are expected to be associated with different styles of cheese, since the production process can be quite different. This is shown in the first figure, which emphasizes that much of the difference between cheeses is due to different maturation procedures.

From Wolfe et al. (2014).
Recently, Wolfe BE, Button JE, Santarelli M, and Dutton R (2014. Cheese rind communities provide tractable systems for in situ and in vitro studies of microbial diversity. Cell 158: 422-433) had a look at the dominant genera of bacteria and microfungi in the rind communities of 137 different types of cheese. They don't actually tell us much about which cheeses these were, merely claiming:
We attempted to evenly sample across rind type (24 bloomy rind cheeses, 52 washed rind cheeses, and 61 natural rind cheeses) and geographic regions (87 European cheeses across 9 countries; 50 American cheeses across 13 states from the West Coast to the east Coast). We also attempted to sample across different milk types (77 cow milk, 34 goat milk, 21 sheep milk, and 5 mixed milk) and milk treatments (99 raw milk, 38 pasteurized).Based on sequencing the bacterial 16S and fungal ITS loci, the authors identified 14 bacterial and 10 fungal genera (moulds and yeasts) that occurred with an average abundance of >1%, as shown in the next figure.

The 137 rind samples with their bacterial (middle row) and fungal (bottom row) genera indicated
by different colours. The order of the samples was determined by UPGMA clustering (top row).
The authors also used shotgun metagenomic sequencing to identify a range of genes in the microorganisms. They present a phylogeny of one particular gene (shown in the next figure) that shows a close relationship between some of the cheese microbes and marine bacteria:
The widespread distribution and high abundance of marine-associated gamma-Proteobacteria, enriched in both washed and bloomy rind cheeses, was an unexpected finding in our survey of taxonomic diversity ... One possible source of these marine microbes is the sea salt used in cheese production.[Note: the other cheese rind bacterium shown in the phylogeny, Brevibacterium linens, is the one responsible for the unbelievable smell of washed-rind cheeses such as Epoisses, Münster and Limburger. It is also responsible for personal-hygiene issues such as foot odour. You can imagine how it first got into cheese making!]

However, Ropars J, Cruaud C, Lacoste S, and Dupont J (2012. A taxonomic and ecological overview of cheese fungi. International Journal of Food Microbiology 155: 199-210), in a related study, have pointed out the usual problem with microbial phylogenies: gene trees are frequently incongruent. So, the gene phylogeny shown above is not likely to be the species phylogeny. It would thus be of great interest to investigate the full microbial network, rather than looking at a single tree.

July 29, 2014


This post is just to let everyone know that Dan Gusfield's long-awaited book on the interface between phylogenetics and population genetics is now available.

The book is targeted for mathematically inclined readers. It has a few contributions from Charles H. Langley, Yun S. Song and Yufeng Wu. The title is described as "a portmanteau word derived from the single-crossover recombination of the words 'recombination' and 'combinatorics'."

Hardcover 448 pp; ISBN: 9780262027526; $60.00 £30.95
More information is available from The MIT Press.

This new book joins these previous contributions to the genre:

Image from Celine Scornavacca.

July 27, 2014


On pages 72-73 of the book Guide to Urban Moonshining: How to Make and Drink Whiskey (written by Colin Spoelman and edited by David Haskell, 2013, published by Harry N. Abrams), there is an illustration of something called the "American whiskeys family tree". This is reproduced in in the article The Bourbon Family Tree for GQ magazine, from where I sourced the copy here.

The author describes it as follows:
This chart shows the major distilleries operating in Kentucky, Tennessee, and Indiana, grouped horizontally by corporate owner, then subdivided by distillery. Each tree shows the type of whiskey made, and the various expressions of each style of whiskey or mash bill, in the case of bourbons. For instance, Basil Hayden's is a longer-aged version of Old Grand-Dad, and both are made at the Jim Beam Distillery. So, while the vertical axis is indeed a time scale, the trees are only marginally family trees in the genealogical sense. This is much more an attempt to  illustrate the corporate ownership of American whiskey, which is made principally from corn (and thus is generically called bourbon, although in Tennessee they seem to rarely use this word). The main distinctions among the brands are (i) whether the non-corn part is made from rye, a little bit more rye, or wheat, and (ii) the length of time it is aged between distillation and sale.

The reticulations among the trees apparently refer to blends. The ghost lineages at the right are described thus:
Willett, formerly only a bottler as Kentucky Bourbon Distillers, has been distilling its own product for about a year; I include the brands that it bottles from other sources for reference.

July 22, 2014


I have written before about the expected genetic problems associated with inbreeding, including consanguinity and incest (relationships between people who are first cousins or closer). Conventionally, the evolutionary advantage of sexual over non-sexual reproduction is considered to be the creation of genetic diversity through heterozygosity. Inbreeding, by reducing heterozygosity, then seems to negate the advantages of sexual reproduction — it leads to the propagation of deleterious recessive alleles and thus inbreeding depression. So, there is a clear evolutionary dimension to the fact that incest avoidance is nearly universal in humans.

The best known exceptions to this situation are among royalty, including the family "trees" of the ancient Egyptian 18th Dynasty (see Tutankhamun and extreme consanguinity) and the Egyptian Ptolemaic dynasty (see Cleopatra, ambition and family networks), which were hybridization networks rather than conventional trees. The presence of consanguinity and incest among royal families then requires a biological explanation. As noted by van den Berghe & Mesher (1980):
Royal incest is best explained in terms of the general sociobiological paradigm of inclusive fitness ... Royal incest (mostly brother-sister; less commonly father-daughter) represents the logical extreme of hypergyny. Women in stratified societies maximize fitness by marrying up; the higher the status of a woman, the narrower her range of prospective husbands. This leads to a direct association between high status and inbreeding.The benefits of inclusive fitness refer to the increased number of offspring in future generations that result from increasing the reproductive success of close relatives. This is achieved via choice of mate. In other words, close relatives share genes, and the success of any relative in leaving offspring is a success for all relatives. Therefore, evolutionary fitness is a combination of individual fitness plus the fitness of close relatives. Inbreeding may reduce individual fitness but can increase inclusive fitness, as noted by Puurtinen (2011):
Theoretical work has shown that inclusive fitness benefits can favor close inbreeding even when this results in substantial reduction in offspring fitness. These models have identified the boundary level of inbreeding depression limiting the evolution of inbreeding among first-order relatives, that is, between full siblings, or between parents and offspring.So, there is a stable level of inbreeding in those populations that practice mate choice for optimal inbreeding. For example, the genetic risks of close inbreeding can be more than accounted for by the production of a highly related heir who has access to a wide choice of mates. Nevertheless:
For a wide range of realistic inbreeding depression strengths, mating with intermediately related individuals maximizes inclusive fitness.In other words, mating with very close relatives is unlikely to evolve via natural selection because it is not an optimal strategy; and we must thus look to a sociological component to incest (such as retaining wealth within the family), as well as a biological one.

In this context, it is interesting to note exceptions to the usual restriction of incest to the aristocracy. The society of Graeco-Roman Egypt (from c. 300 BCE to 300 CE) provides the best-documented case (eg. see Hopkins 1980; Shaw 1992; Parker 1996; Scheidel 1997; Huebner 2007; Remijsen & Clarysse 2008). [This era starts with the Ptolemaic dynasty, which marks the collapse of Egyptian rule of Egypt.] During this time a significant proportion of all marriages noted in official Roman census declarations were between full brothers and sisters. That is, the Roman-era Egyptians did not limit this type of inbreeding to any small group, but spread it across several social classes (mainly Greek settlers rather than native Egyptians).

As noted by Schiedel (1997):
According to official census returns from Roman Egypt (first to third centuries CE) preserved on papyrus, 23·5% of all documented marriages in the Arsinoites district in the Fayum (n=102) were between brothers and sisters. In the second century CE, the rates were 37% in the city of Arsinoe and 18·9% in the surrounding villages. Documented pedigrees suggest a minimum mean level of inbreeding equivalent to a coefficient of inbreeding of 0·0975 in second century CE Arsinoe. Undocumented sources of inbreeding and an estimate based on the frequency of close-kin unions indicate a mean coefficient of inbreeding of F=0·15-0·20 in Arsinoe and of F=0·10-0·15 in the villages at the end of the second century CE. These values are several times as high as any other documented levels of inbreeding.For comparison, the inbreeding F values for these family relationships are:
parent-offspring = siblings
uncle-niece = double first cousins
first cousins
first cousins once removed
second cousins 0.500
However, inbreeding depression seems not to have been a notable problem during this historical time. As noted by John Hawkes:
There is not a single mention in the evidence that links sibling marriage to negative genetic effects or unhappy marriages.This does not mean that there were no problems, but merely that any problems were not documented, as noted by Scheidel (1997):
Even in the absence of explicit references to inbreeding depression from Roman Egypt, there is no compelling reason to assume that brother–sister marriage could have remained entirely without negative consequences for the Arsinoites. It is however possible that, due to a low incidence of lethal recessives, such effects were considerably weaker than in some western samples. The census returns do not suggest lower levels of fertility or smaller numbers of children among sibling couples ...The practice seems to have stopped solely because it was contrary to Roman Law:
Before a.d. 212 the Romans had accepted discrepancies between their own legal practice and prevailing local customs and traditions in the Eastern provinces. Papyri from Roman Egypt, the Talmud, and the Romano-Syrian law book indeed reveal legal procedures which differed significantly from Roman law in matters such as marriage, guardianship, paternal authority, sales, and debts. The Constitutio Antoniana, however, made all free men and women of the Roman Empire into Roman citizens, and so Roman law became applicable to all inhabitants of Egypt. Brother-sister marriages cease to be documented in our Roman census returns from the early third century on. Our last [incest] testimony dates to a.d. 229.

Hopkins K (1980) Brother-sister marriage in Roman Egypt. Comparative Studies in Society and History 22: 303-354.

Huebner SR (2007) "Brother-sister" marriage in Roman Egypt: a curiosity of humankind or a widespread family strategy? Journal of Roman Studies 97: 21-49.

Parker S (1996) Full brother-sister marriage in Roman Egypt: Another look. Cultural Anthropology 11: 362-376.

Puurtinen M (2011) Mate choice for optimal (k)inbreeding. Evolution 65: 1501-1505.

Remijsen S, Clarysse W (2008) Incest or adoption? Brother-sister marriage in Roman Egypt revisited. Journal of Roman Studies 98: 53-61.

Scheidel W (1997) Brother-sister marriage in Roman Egypt. Journal of Biosocial Science 29: 361-371.

Shaw BD (1992) Explaining incest: brother-sister marriage in Graeco-Roman Egypt. Man 27: 267-299.

July 20, 2014


I have commented before on fact that the general public associates an inappropriate "March of Progress" image with the concept of "evolution" (see Haeckel and the March of Progress, and especially Tattoo Monday VIII - the March of Progress). It therefore seems worthwhile to gather a few examples together in the one place. Most of these are abbreviated versions of the image in the book Early Man by Francis C. Howell (1965. Time-Life International, New York). There were originally 14 images (see the version here), but the modern versions have a half or fewer images.

July 16, 2014

We all worked hard during the workshop. Here is our fearless leader, in deep thought:

While some of the younger participants enjoyed drawing on the walls:

Professor Whitfield has come up with a great new model of evolution: phylogenetic windmills:

There was not only work, but also time to relax and enjoy the beautiful Dutch summer weather:

And not to forget the delicious Dutch food:

But really, most of the time we were busy touching the data, which you can find on this website:

For more photos, see the Touching the Data website.

July 11, 2014


We have now completed the workshop.

Since the first report, we have had three more talks. First, Mukul Bansal outlined the relationship between phylogenetic networks and reconciliation analysis, and the way in which the latter can be used to construct the former. Starting from an estimated species tree, the tree for each locus is optimized for fit to the species tree, which helps locate any areas of extensive gene flow (ie. reticulation). This can be done using a large number of loci and an even larger number of taxa.

Celine Scornavacca provided details of some of the fundamental limitations of network analysis.The most important of these is unidentifiability of network topologies -- there are classes of network topologies that cannot be distinguished based on the information that is currently used, so that we cannot guarantee that a unique optimal network will be found during an analysis. Branch lengths may help with this situation, but cannot guarantee to resolve it.

Jim Whitfield covered the advantages and potential problems of using genomic-scale data for phylogenetic analysis. The basic problem is the increased scope for error in moving to the genome data (genome assembly problems, gene homology issues, alignment difficulties), although the potential advantages are extensive.

Most importantly, we spent two days "touching" some data. The participants broke into smaller groups of continuously varying size, each of which focussed on a particular dataset (as supplied by some of the participants). These data were evaluated in many different ways, to assess the characteristics of the data as well as to evaluate the data-analysis methods. This not only allowed us to identify the current state of the art with respect to phylogenetic networks, but it also allowed computationalists to improve their understanding of biological data and how biologists proceed to analyze it, as well as allowing biologists to obtain immediate feedback with respect to their data-analysis issues.

Production of phylogenetic networks seems to have come a long way in the past few years, although there is still no single "one-stop shopping" software tool to use. Practical issues getting programs to perform on all computer types were identified, along with data-format issues. Nevertheless, all of the participants seemed to find that this was a very valuable exercise, as a means of focussing interactions among themselves.

Finally, we considered both European and U.S. funding for network research, in the latter case assisted by David Mindell (from the N.S.F.). In particular, we identified sources of funding for future workshops (either in the south of France or the north-eastern U.S.A.).

The canal-boat cruise turned out well, in spite of the somewhat uncooperative weather. The football, of course, has turned out to be rather disappointing for the hosts, although they have one more game to play.

July 8, 2014


We have now completed two days of the workshop. We have had a relaxed approach to progress, and are thus currently running behind the nominal schedule. Nevertheless, we are progressing splendidly.

We had three talks on the first day and one today. I tried to kick things off by asking a series of what I consider to be unanswered questions from observing practitioners and computationalists in action, although apparently several members of the audience already had their own answers to some of these. The bottom line is that phylogenetic analysis focuses on data patterns while interpretation focuses on processes / mechanisms, and this constitutes a large part of the apparent separation of practitioners and computationalists.

Steven Kelk and Luay Nakhleh introduced the diversity of computational approaches that we already have. These presentations neatly complemented each other, providing a valuable summary of the field as well as an overview of current limitations and future prospects. This topic was taken up later by various members of the audience, as one of the inherent problems for practitioners is how to navigate through the methods to choose a suitable one -- there are methods based on parsimony, likelihood and bayesian analysis, and methods that tackle de novo network construction, gene tree / species tree reconciliation, gene tree scoring, and network presentation.

This topic was followed up today by presentations introducing some of the currently available software. Some of these have progressed significantly in recent years, notably PhyloNet and Dendroscope, and there are some relatively new ones, as well as even newer ones in the pipeline. Based on the literature, these programs are being dramatically under-used compared to their actual usefulness.

This morning Scot Kelchner introduced us to the application of Zen Buddhism to science in general and phylogenetics in particular. This went down much better than he seemed to be expecting -- there were apparently a lot of  "Zen" people in the room. The basic idea is not to get trapped by preconceived expectations, especially arbitrary categorical notions, when interpreting the output of a phylogenetic analysis. You can consult The Nine-Headed Dragon River, by Peter Matthiessen, if you would like further information.

Finally, we got to the topic implied by the workshop's title: Touching the Data. We had a brief run-through of the pre-existing datasets stored with this blog (see the upper right-hand corner), which cover some of the diversity of what practitioners have provided to date in the way of usable datasets with "known" phylogenetic patterns.

By far the most interesting, however, was the presentation of some recent datasets made available by members of the workshop, notably Axel Janke (bear species), Scot Kelcher (bamboo species) and Mattis List (Indo-European languages) (Jim Whitfield will present his datasets tomorrow morning). These datasets generated much interest, as they provide a diversity of different possible applications for phylogenetic networks. The idea from here on in the workshop is to address what can currently be done with these datasets and what we might like to do with them if the tools were available. This will help focus the participants on specific practical issues, which should lead to the progress that we hope to achieve.

It has rained most of the day, which is actually unusual -- intermittent rain is more common in this climate. We are currently waiting for the football to start: Germany versus Brazil. Tomorrow will be the Netherlands versus Argentina. It is risky being in this country this week! The current local betting is for an all-European final,an assessment that involves no cultural bias whatsoever.

July 6, 2014


This week we have returned to Leiden (in the Netherlands), for another workshop sponsored by the Lorentz Center. The previous workshop, in October 2012, is discussed in this prior blog post: Workshop: The Future of Phylogenetic Networks.

The full title of the new workshop is: Utilizing Genealogical Phylogenetic Networks in Evolutionary Biology: Touching the Data. As before, it has been organized by Steven Kelk, Leo van Iersel, Leen Stoogie and myself. The program and abstracts can be found here. It runs for the whole week 7 July – 11 July 2014.

The workshop differs significantly from the previous workshop in two ways: it is intended to be a much smaller and more focused workshop, and it is intended to be practical rather than theoretical. The basic aim is to get biologists and computational people to sit down in small groups and actually talk about real phylogenetic data, so that each side of the phylogenetics "coin" gets to understand a bit better what is going on on the other side. To this end, we have gathered together some of the experts in the field specifically of evolutionary / genealogical networks (rather than data-display networks), as this is the area that needs the greatest future development. We have also gathered together some real-world datasets involving apparent reticulating evolution, which will be the focus of discussion. These datasets are available here and also here.

The weather is predicted to be changeable during the workshop, which is to be expected in northern Europe even in summer — that is why everyone else has gone to southern Europe.

I am hoping to add some blog posts based on what happens at the workshop, as it proceeds.

July 2, 2014


I recently wrote a manuscript comparing the tree-likeness of phylogenetic data in biology and anthropology (see Are phylogenetic patterns the same in anthropology and biology?). While doing so, I also made a comparison of genotype and phenotype data within biology.

The comparison is based on maximum-parsimony analyses of the data, using the (ensemble) Retention Index (RI) as the measure of tree-likeness. If RI = 1 then all of the characters are compatible with the same tree, whereas if RI = 0 then none of them are pairwise compatible. As the graph shows, the genotype data are considerably less tree-like than are the phenotype data (mean RI ≈ 0.5 versus 0.7, respectively).

It would be interesting to know whether other people have observed this pattern. If it is general, then what causes it? Are the phenotype characters being chosen (subconsciously or not) because they show nested grouping patterns (which lend themselves automatically to a tree representation)? Or do the genotype data inherently have more stochastic variation? Does this mean that we should always be using phylogenetic networks for the representation of genotype data?

You can read the manuscript if you want the details of the analyses. Briefly, the initial collections of datasets were taken from Collard et al. (Evolution and Human Behavior 27: 169-184; 2006) — the graphed data are taken from the paper as I never managed to get the original datasets from the authors. I then supplemented this information with phenotype datasets from TreeBase (total of n=31) and miscellaneous genotype datasets from the literature (n=15). All of the datasets refer to vertebrates and insects (with one phenotype dataset from spiders). My parsimony analyses used the parsimony ratchet and PAUP*.

June 28, 2014


It was 14 years ago that the Millennium started, but there are therefore still 986 years left to solve the following seven phylogenetic network Millennium problems. These are not necessarily the most important problems to solve from a biological point of view, but are challenging computational problems that have (at least) some biological relevance. The problems are all about phylogenetic networks, except for Problems 2 and 7 which are about the closely related topic of agreement forests. Solving these problems will not be rewarded with $1,000,000 but only with eternal fame.

In each of these problems, a phylogenetic network on X is a directed acyclic graph with a single root and no vertices that have only one incoming and only one outgoing arc, and in which each leaf is labelled by an element of X and each element of X labels one leaf.

Problem 1. Is the Hybridization Number problem fixed-parameter tractable (FPT) if the input is an unrestricted set of nonbinary trees and the only parameter is the hybridization number? Hybridization Number is the following problem. Given a finite set X, a collection T of rooted (possibly nonbinary) phylogenetic trees on X and a natural number k, decide if there exists a rooted phylogenetic network on X that displays all trees from T and has reticulation number at most k. See e.g. (van Iersel, Kelk, 2013) for more detailed definitions.

Problem 2. Does there exist a polynomial-time 2-approximation algorithm for MAF on two binary trees? Maximum Agreement Forest (MAF) on two binary trees can be defined as follows. Given a finite set X and two rooted binary phylogenetic trees on X, what is the minimum number number of components in a forest on X that can be obtained from each of the input trees by deleting vertices, deleting edges and suppressing indegree-1 outdegree-1 vertices? For a 2.5-approximation see (Shi, You, Feng, 2014).

Problem 3. Is there an FPT algorithm for finding a level-k phylogenetic network consistent with a given dense set of rooted triplets, if k is the parameter? A rooted triplet is a phylogenetic tree with three leaves. A set of rooted triplets is called dense if it contains at least one triplet for each combination of three leaves. A network is level-k if it can be turned into a tree by deleting at most k edges per biconnected component. This problem is known to be solvable in polynomial time if k is fixed, see (Habib and To 2012).

Problem 4. Is Tree Containment polynomial-time solvable or NP-hard for reticulation visible networks? Tree Containment is the problem of deciding if a given phylogenetic network displays a given tree. A phylogenetic network is called reticulation visible if from each reticulation (vertex with indegree greater than one) there exists a path that does not pass through any other reticulations and ends in a leaf. Tree Containment is known to be NP-hard for general networks and for some restricted classes of networks; see (Kanj, Nakhleh, Than, Xia, 2008) and (van Iersel, Semple, Steel 2010).

Problem 5. Is there a constant-factor approximation algorithm for computing the softwired parsimony score of a binary tree-child network and a binary character? Given a network and a character state (0 or 1) for each leaf, the softwired parsimony score is the minimum number of state-changes in any tree (on all leaves) displayed by the network, over all possible assignments of states to the internal vertices. A phylogenetic network is called tree-child if each non-leaf vertex has at least one child that is not a reticulation. This problem does not have a constant factor approximation for general networks or for other (less severely) restricted classes of networks, unless P = NP (Fischer, van Iersel, Kelk, Scornavacca 2013).

Problem 6. Given k > 1, what is the maximum value of p such that for any set of rooted triplets there exists some level-k phylogenetic network on n leaves that is consistent with at least a fraction p of the input triplets? For k = 0 the maximum is p = 1/3 and for k = 1 it is roughly 0.48, see (Byrka, Gawrychowski, Huber, Kelk 2009).

Problem 7. Is there an O(c^n) algorithm for Maximum Acyclic Agreement Forest (MAAF) on two binary phylogenetic trees with c < 2? An acyclic agreement forest is an agreement forest (see above) for which the following directed graph D is acyclic. D has a vertex for each component of the forest and there is an arc from component A to component B if in at least one of the input trees there is a directed path from the root of A to the root of B. It is known that there exist an O*(2^n) algorithm for this problem (van Iersel, Kelk, Lekic, Stougie, 2013).

June 24, 2014


I recently published a post on Evolution and timelines, in which I pointed out that presenting historical data as a timeline is a very poor way of representing an evolutionary history. Evolutionary history is much better presented as a phylogeny, which will be either a tree or a network. However, this does not mean that all histories that are presented as a tree, for example, necessarily represent a phylogeny.

I have encountered a few examples of history-as-tree that seem to have very little connection to a phylogeny. That is, the relationships among the objects are presented along the branches of a tree, but the relationships along the branches seem to be little more than a timeline. So, the whole structure is simply a series of interconnected timelines.

Consider this first example, which is a poster purporting to show for the USA:
the evolution of jazz in its more than one hundred year history. From Archaic to Avant Garde, from blues to bebop, from radio to fusion, from spirituals to swing, from Armstrong to Zawinul, the jazz pedigree presents the diverse history and development of jazz in a clear way.
Perhaps it is the strong central trunk that gives it away as a non-phylogeny. The side-branches do group the jazz performers roughly by genre, but that is all they do. The actual title is a bit more accurate about the content — it is a "Story" rather than a phylogeny.

This poster is accompanied by a European counterpart with an even stronger central trunk. It is labeled as a "Community", but it still claims to "display the history and development of European jazz".

As another example, in 1946, the magazine P.M.published a tree by Ad Reinhardt with a sardonic view of modern American art. [Thanks to Joachim Dagg for alerting me to this example.]

At least there is no central trunk this time, but the clustering of artists along the branches seems to have less to do with phylogenetic history than with artistic genre (and satire). There was a follow-up example 15 years later, in which the sardonic humor plays much the strongest role in the relationships represented.

Finally, here is an example of a timeline that really should be represented using a phylogenetic tree. It is difficult to believe that the group of professions illustrated form a transformational series, as implied by the timeline that is actually shown. Most of the entrepreneur groups depicted actually still exist to this day, rather than being extinct, and so we have here a history of variational evolution, instead of a transformation.

June 22, 2014


Phylogenetics plays no part in games like Trivial Pursuit, but the web offers more opportunities. The Fun Trivia web site, for example, offers a page on Phylogenetics. You should try it, and see how well you do.

The answers (and explanations) are quire good, but the wording of some of the questions leaves a lot to be desired.

June 17, 2014


One possible use of blog posts is as first drafts of ideas that might make their appearance in a refereed publication at a later date. Thus, many of my blog posts have appeared in one form or another in my recent publications. Here I have listed the ones that I can remember using, just in case anyone wants a citable reference for the information in these posts.

A. Morrison DA (2013) Phylogenetic networks are fundamentally different from other kinds of biological networks. In W.J. Zhang (ed.) Network Biology: Theories, Methods and Applications (Nova Science Publishers, New York) pp. 23-68.

    9 Biological versus phylogenetic networks
  13 Network measures and phylogenetic networks
  23 An explanation of graph types
  25 Networks and bootstraps as tree-support criteria
  34 Networks of affinity rather than genealogy
  36 Networks of genealogy
  53 Are mathematical constraints biologically realistic?
  54 Some odd network definitions and terms
  63 Human races, networks and fuzzy clusters
  69 Is this the first network from conflicting datasets?
  70 Why do we still use trees for the Neandertal genealogy?
  72 Networks and most recent common ancestors
  74 Open questions about evolutionary networks, part 1
  75 Open questions about evolutionary networks, part 2
  76 Open questions about evolutionary networks, part 3
  88 When is there support for a large phylogeny?
  90 Explanation of the names for phylogenetic networks
  94 Phylogenetic position of turtles: a network view
  99 How networks differ from bootstrapped trees
107 We should present bayesian phylogenetic analyses using networks
115 Is there a philosophy of phylogenetic networks?

B. Morrison DA (2014) Phylogenetic networks — a new form of multivariate data summary for data mining and exploratory data analysis. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 4: in press.

  29 Network analysis of scotch whiskies
  50 Phylogenetic network of the FIFA World Cup
  61 How to interpret splits graphs
101 Distortions and artifacts in Principal Components Analysis analysis of genome data
103 Networks can outperform PCA ordinations in phylogenetic analysis
114 Network analysis of Genesis 1:3
119 Network of ancient Thai bronze Buddha images
134 A network analysis of Simon and Garfunkel
159 Networks and human inter-population variation
172 The acoustics of the Sydney Opera House

C. Morrison DA (2014) Next generation sequencing and phylogenetic networks. EMBnet.journal: Bioinformatics in Action 20: e760.

191 Next Generation Sequencing and phylogenetic networks

D. Morrison DA (2014) Phylogenetic networks: a review of methods to display evolutionary history. Annual Research and Review in Biology 4: 1518-1543.

    2 The first phylogenetic network (1755)
  21 The second phylogenetic network (1766)
  34 Networks of affinity rather than genealogy
  36 Networks of genealogy
  67 Metaphors for evolutionary relationships
  89 Relationship trees drawn like real trees
168 Who first used the term "phylogenetic network"?
182 Affinity networks updated
183 Reticulation patterns and processes in phylogenetic networks
187 What are evolutionary networks currently used for?

E. Morrison DA (2014) Rooted phylogenetic networks for exploratory data analysis. Advances in Research 2: 145-152.

  43 Rooted networks for exploratory data analysis

F. Morrison DA (2014) Is the Tree of Life the best metaphor, model or heuristic for phylogenetics? Systematic Biology 63: 628-638.

  23 An explanation of graph types
  34 Networks of affinity rather than genealogy
  36 Networks of genealogy
  58 Who published the first phylogenetic tree?
  89 Relationship trees drawn like real trees
143 Resistance to network thinking
144 Destroying the Tree of Life?
147 Should phylogenetic modelling proceed from simple to complex or vice versa?
171 Conflicting placental roots: network or tree?
182 Affinity networks updated

June 15, 2014


In a previous blog post (Tattoo Monday VIII), I noted that the usual "March of Progress" image that the general public associates with the concept of "evolution" is originally based on the frontispiece to Thomas Henry Huxley's book Evidence as to Man's Place in Nature (1863. Williams & Norgate, London). A century later, this image was expanded and updated in the book Early Man by the anthropologist Francis C. Howell (1965. Time-Life International, New York) — this picture, with labels, can be viewed here.

What is perhaps less well known is that Ernst Haeckel also made a contribution to this genre. Shown here are the frontispiece and title page of Haeckel’s Natürliche Schöpfungsgeschichte (1868. Verlag von Georg Reimer, Berlin), usually translated as "The History of Creation". This book was Haeckel's attempt to introduce the idea of evolution to the German-speaking general public, after his detailed specialist two-volume book Generelle Morphologie der Organismen (1866. Verlag von Georg Reimer, Berlin). This previous book was difficult to read, and was also full of invective against doubters and supposed opponents; so a more readable approach was needed (the original text itself was apparently derived from one of his student's notes taken during Haeckel's lectures!).

The frontispiece lithograph (by Gustav Müller) is labeled as "The family group of the Catarrhines". It was notoriously supposed to demonstrate (as explained on page 555 of the book) "the highly important fact" that the "lowest humans" stand "much nearer" to the "highest apes" than to the "highest human". The various images are labeled (from "highest" to "lowest"):
  1. "Indo-German"
  2. "Chinese"
  3. "Fuegian"
  4. "Australian Negro"
  5. "African Negro"
  6. "Tasmanian"
  7. gorilla
  8. chimpanzee
  9. orangutang
  10. gibbon
  11. proboscis monkey
  12. mandrill.
The book was a best seller, and remained in print until the 1920s. Fortunately, the frontispiece was quickly changed. For example, in the 4th edition (1873) the frontispiece was a collage of various calcareous sponges, and in the 8th edition (1889) it was a picture of Haeckel himself (as it also was for the 5th and all subsequent editions). The book actually went through 12 editions, with the number and composition of the figure plates changing several times, in addition to the changes to the frontispiece.