The Genealogical World of Phylogenetic Networks

Biology, anthropology, computational science, and networks in phylogenetic analysis


XML feed

Last update

5 min 53 sec ago

August 30, 2015


Last week I blogged about Spinach and the iron fallacy. I analysed an early set of data by Thomas Richardson (1848), who calculated the amount of iron in combusted ash for various vegetables and fruits, and showed that spinach is not at all unusual in its constituents. The idea that spinach is rich in iron is untrue, and the story about a mis-placed decimal point seems to be nothing more than an urban myth.

In the meantime, Joachim Dagg, at the Natural History Apostilles blog, has reanalysed Richardson's data and revealed that The first source for the spinach-iron myth is likely to have been a somewhat inappropriate attempt to combine his data for the percent iron values in relation to the ash with the percent values of the ashes in in relation to the fresh matter.

So, I have recalculated the phylogenetic network using these "adjusted" values. I used the percent values of the chemical constituents in relation to the pure ash (raw ash minus carbonic acid, charcoal and sand), and combined them with the percent values of the ashes. The issue here is that radish roots and leaves have the largest ash values, followed by cherry stems and spinach. This leads to an over-statement of the chemical contents. In particular, the iron content moves spinach from being ranked sixth to second (behind radish foliage, which is not usually eaten).

August 25, 2015


During one of the discussion sessions at the recent Phylogenetic Network Workshop, in Singapore, the need was re-iterated for "gold standard" empirical datasets, in order to aid the development and validation of algorithms for phylogenetic networks.

The current collection of such datasets is located on this blog, at:, it is still quite a small database, as so far it has been based solely on my own ability to locate suitable datasets that are freely available (see the comments in Public availability of phylogenetic data).

I would therefore like to remind everyone that if you have, or know of, suitable empirical datasets then please contact me.

The database is currently hierarchically arranged as follows:

Datasets where the history is a tree
  Datasets where the history is known from experimentation
  Datasets where the history is known from retrospective observation
Datasets where the history is reticulated
  Datasets where the history is known from experimentation
  Datasets where the reticulation is inferred
    Lateral Gene Transfer

The basic requirement for a "gold standard" dataset that contains one or more reticulations (ie. there is gene flow) is that the evidence for the reticulation(s) is independent of the particular dataset. That is, there should be either experimental data, or at least another independent dataset, confirming the gene flow. This is quite a tough criterion, particularly for lateral gene transfer, but it is a necessary quality criterion.

Finally, the database requires the processed data (eg. a multiple sequence alignment), rather than the original raw data (see the comments in Releasing phylogenetic data).

August 23, 2015


A few weeks ago, the Natural History Apostilles blog ran a series of posts on the origins of the well-known spinach-is-rich-in-iron fallacy. This is more complex than expected. Spinach was originally alleged to have been incorrectly claimed to be rich in iron due to a mis-placed decimal point in a set of comparative data. In fact, this explanation itself seems to be untrue (read the posts).

In the blog posts, Joachim Dagg traced the origins of the alleged explanation, in detail, looking at (almost) all of the relevant historical data. One of the earliest sources of data on spinach turns out to be itself something of a mystery:
Thomas Richardson (1848) Beiträge zur chemischen Kenntnis der Vegetabilien. Annalen der Chemie und Pharmacie LXVII Bd. 3.This was a single-page fold-out table (without page number) included at the end of volume 67 of the journal. In modern electronic copies, it has been erroneously attached to the last article in that issue.

The table contains values for a range of compounds in the ash produced from a variety of plants and their parts. These data are ripe for a visualization.

As usual, we can use a phylogenetic network as a form of exploratory data analysis, to compare all of the plants in a single diagram. I first normalized the data (since the compounds have very different ranges), and then used the manhattan distance to calculate the similarity of the plants based on their constituents. This was followed by a Neighbor-net analysis to display the between-plant similarities as a phylogenetic network. So, plants (or their parts) that are closely connected in the network are similar to each other based on their chemistry, and those that are further apart are progressively more different from each other.

As you can see, spinach is not particularly unusual in its chemical constituents. Indeed, it is radish, leek and asparagus that are the most unusual.

August 18, 2015


I thought that I should draw your attention to the current issue of the journal Systematic Biology, which contains more contributions about reticulate relationships than I have seen there before.

These include:

♦ Andrew R. Francis and Mike Steel (2015) Which phylogenetic networks are merely trees with additional arcs? Systematic Biology 64: 768-777. doi:10.1093/sysbio/syv037

A theoretical paper discussed by Leo in a previous blog post (Networks vs augmented trees).

♦ Jonathan Brassac and Frank R. Blattner (2015) Species-level phylogeny and polyploid relationships in Hordeum (Poaceae) inferred by next-generation sequencing and in silico cloning of multiple nuclear loci. Systematic Biology 64: 792-808. doi:10.1093/sysbio/syv035

Contains a tree of relationships among the diploid species, with the tetraploid and hexaploid species manually added as reticulations, to create a hybridization network.

♦ Noah W. M. Stenz, Bret Larget, David A. Baum, and Cécile Ané (2015) Exploring tree-like and non-tree-like patterns using genome sequences: an example using the inbreeding plant species Arabidopsis thaliana (L.) Heynh. Systematic Biology 64: 809-823. doi:10.1093/sysbio/syv039

Contains a series of trees but no network. Nevertheless, the authors' analyses "identify instances of introgression and detect one clear case of reticulation among ecotypes that have come into contact".

♦ David A. Morrison (2015) Aristotle's Ladder, Darwin's Tree: The Evolution of Visual Metaphors for Biological Order, by J. David Archibald. Systematic Biology 64: 892-895. doi:10.1093/sysbio/syv038

A book review that castigates the book's author for hardly mentioning networks when writing about phylogenetic metaphors. There is a table summarizing some of the relevant publication history.

There is also one paper that possibly should be about networks but doesn't actually mention them.

♦ Thomas C. Giarla and Jacob A. Esselstyn (2015) The challenges of resolving a rapid, recent radiation: empirical and simulated phylogenomics of Philippine shrews. Systematic Biology 64: 727-740. doi:10.1093/sysbio/syv029

The authors collected data on "hundreds of ultraconserved elements and whole mitochondrial genomes" from multiple individuals of several species of shrews (Crocidura). They conclude that "the low support we obtained for backbone relationships ... reflects a real and appropriate lack of certainty. Our results illuminate the challenges of estimating a bifurcating tree in a rapid and recent radiation, providing a rare empirical example of a nearly simultaneous series of speciation events".

A NeighborNet analysis of the provided mitochondrial data is shown in the first figure. Clearly, all it says is that the individuals group into species, but there is no information in the data about the relationships among the species.

A NeighborNet analysis of the SNPs from the ultraconserved elements is shown in the second figure. This network is not that different, in that it does little more than group the individuals into species, with little information about relationships.

However, note also that the largest reticulation involves sp_FMNH146788 and mindorus_FMNH221890. These two samples are not closely related in the mitochondrial network. This hints that the sp_FMNH146788 sample may be a genotypic mixture, due perhaps to hybridization or introgression. The authors treat the specimen as representing a "heretofore undescribed taxon that shares introgressed mitochondrial DNA with true C. ninoyi."

August 16, 2015

Bioinformatics lies at the nexus of the biological sciences and the computational sciences. Therefore it is sometimes worth comparing these two disciplines.

Marcus Beck at the R is My Friend blog has looked at doctoral dissertation lengths via the digital archives at the University of Minnesota. His data are shown in this box plot. You can search through it for your own favorite discipline (click on the image to make it larger).

He also has several other graphical views in his blog post, including data on masters theses.

August 11, 2015


Most computational approaches to historical linguistics, be it those producing networks or those producing trees, make use of lexical data. There are several reasons for this preference. Lexical data is much easier to handle than abstract grammatical data. Many linguists also think that lexical data is more representative of language evolution in general, and thus offers a much better starting point for inferences. Whether one likes the preference for lexical data or not, it seems to be worthwhile in this context to reflect a bit more about the nature of lexical data and the complexities of lexical change. This may help to get a clearer picture of the differences between language history and biological evolution.

What Makes a Word? In a very simple language model, the lexicon of a language can be seen as a bag of words. A word, furthermore, is traditionally defined by two aspects: its form and its meaning. Thus, the French word arbre can be defined by its written form arbre or its phonetic form [ɑʁbʁə], and its meaning "tree". This is reflected in the famous sign model of Ferdinand de Saussure (Saussure 1916), which I have reproduced in [A] in the graphic below. In order to emphasize the importance of the two aspects, linguists often say that form and meaning of a word are like two sides of the same coin (see [B] in the graphic below). But we should not forget that a word is only a word if it belongs to a certain language! From the perspective of the German or the English language, for example, the sound chain [ɑʁbʁə] is just meaningless. So, instead of two major aspects of a word, we may better talk of three major aspects: form, meaning, and language. As a result, our bilateral sign model becomes a trilateral one, as I have tried to illustrate in [C] in the graphic below.

What is Lexical Change?If there was no lexical change, the lexicon of languages would remain stable during all times. Words might change their forms by means of regular sound change, but there would always be an unbroken tradition of identical patterns of denotation. Since this is not the case, the lexicon of all languages is constantly changing. Words are lost, when the speakers cease to use them, or new words enter the lexicon when new concepts arise, be it that they are borrowed from other languages, or created from native material via different morphological processes. Such processes of word loss and word gain are quite frequent and can sometimes even be observed directly by the speakers of a language when they compare their own speech with the speech of an elder or a younger generation.

An even more important process of lexical change, especially in quantitative historical linguistics, is the process of lexical replacement. Lexical replacement refers to the process by which a given word A which is commonly used to express a certain meaning x ceases to express this meaning, while at the same time another word B, which was formerly used to express a meaning y, is now used to express the meaning x. The notion of lexical replacement is thus nothing else than a shift in the perspective on semantic change (as one major dimension of lexical change, see below). While semantic change is usually described from a semasiological perspective, i.e. from the perspective of the form, lexical replacement describes semantic change from an onomasiological perspective, i.e. from the perspective of the meaning.

Three Dimensions of Lexical ChangeGévaudan (2007)distinguishes three dimensions of lexical change: the morphological dimension, the semantic dimension, and the stratic dimension. The morphological dimension points to changes in the outer form of the words which are not due to regular sound change. As an example of this type of change, consider English birth and its ancestral form Proto-Germanic *ga-burdi "birth" — while the meaning of the word did not change (or at least only slightly), the English word apparently lost the prefix ga-. This prefix is still present in the German Geburt "birth", but it was lost without leaving a trace in English.

The loss of prefixes is not the only way in which words can change during language evolution. We also find that prefixes or suffixes are added, as, for example, in French soleil "sun", which goes back to Latin soliculus "small sun, sunny" which is itself a derivation of Latin sol "sun". The semantic dimension is illustrated by changes like the one from Proto-Germanic *sælig "happy" to English silly.

The stratic dimension refers to changes involving the exchange of words betweenlanguages, that is, processes of borrowing, in which a word is transferred from one stratum of a language to another. An example for this type of change is English mountain which was borrowed from Old French montaigne "mountain".

Note that these three dimensions of lexical change correspond directly to the three major aspects constituting a linguistic sign (or a word) that I mentioned above: The morphological dimension changes the form of a word, the semantic dimension changes its meaning, and the stratic dimension its language. Thus, the three dimensions of lexical change, as proposed by Gévaudan (2007), find their direct reflection in the major dimensions according to which words can vary.

During language evolution, lexical change processes interact in all three dimensions, and yield complex patterns which may be very hard to uncover for historical linguists. As an example of this complexity, consider the development of Proto-Indo-European *bʰreu̯Hg̑-* "to use", as depicted in the graphic below, which was originally designed by Hans Geisler (Heinrich-Heine University, Düsseldorf), who kindly allowed me to reproduce it here. In the graphic, changes in the stratic dimension are illustrated with the help of dotted arcs (the legend labels this as "borrowed from"), and changes in the morphological dimension are indicated by double arcs (labelled as "derived from"). The semantic dimension is not specifically labelled as such, but one can easily detect it by comparing the meanings of the words.

Modeling Lexical ChangeIf we look at different historical relations from the perspective of the three dimensions of lexical change, it becomes obvious that the terminology we use in linguistics is rather fuzzy. I mentioned this in an earlier post, where I pointed to the different shades of cognacy, which were never really settled in a satisfying way in historical linguistics. If we look at this again from the perspective of the three dimensions, it is much easier to become clear about the origin of these different historical relations between words.

If we investigate the different uses of the term "cognacy", for example, it becomes obvious that the differences result from controling for one or more of the three dimensions of lexical change. The traditional Indo-Europeanist notion of cognacy, for example, controls the stratic dimension by requiring stratic continuity (no borrowing), but at the same time it is indifferent regarding the other two dimensions. Cognacy à la Swadesh (especially Swadesh 1955), as we know it from the popular computational approaches which model lexical change as a process of cognate loss and gain, is indifferent regarding morphological continuity, but controls the semantic and the stratic dimensions by only considering words that have the same meaning and have not been borrowed (at least in theory).

In the table below, I have attempted to illustrate in which way the different terms, including the biological terms of homology, orthology, paralogy, and xenology, cover processes by controling each for one or more of the three dimensions of lexical change (with "+" indicating that continuity is required, "-" indicating that change is required, and "+/-" indicating indifference.) Contrasting the different dimensions of lexical change with the terminology used to refer to different relations between words shows not only the arbitrariness of the traditional linguistic terminology (why do we only cover two out of 3 * 3 * 3 = 27 different possible types? why do we only control by requiring continuity, not change? etc.), but also the fundamental difference between biological and linguistic terminology.

Concluding RemarksSo far, all computational methods that have been proposed for historical linguistics are based on the strict Swadesh type of wordlist encoding, which in the end controls for the semantic and stratic dimensions of lexical change and is indifferent regarding morphology. Such an encoding is per seinconsistent, since there is no reason to assume that morphological change would be less frequent or less indicative of language history than any of the other types.

The reason why linguists tend to control for meaning when creating their datasets is mostly due to problems of sampling: it is much easier to draw a set of words from a couple of languages by starting from a given set of meanings. However, it may be useful to relax this criterion, since the restricted sets of only about 200 meanings on average necessarily hide vivid and interesting processes of lexical change.

The reasons why linguists control for borrowing are only historical, and in many cases also not feasible, since our evidence for borrowing may be limited, especially in cases where the majority of speakers is bilingual (which is more often the rule than the exception in the languages of the world). It seems much more fruitful to revive our network thinking in linguistics and to invest into the development of high quality datasets with a less arbitrary exclusion of certain dimensions of lexical change, and transparent computational methods which do not exclusively stick to the tree model.

  • Gévaudan, P. (2007) Typologie des lexikalischen Wandels [Typology of lexical change]. Tübingen: Stauffenburg.
  • Swadesh, M. (1955)  Towards greater accuracy in lexicostatistic dating. International Journal of American Linguistics. Vol. 21(2), pp. 121- 137.
  • Saussure, F. de (1916) Cours de linguistique générale [Course on general linguistics]. Lausanne: Payot.

August 9, 2015


The United States government likes to keep an eye on its populace, as we all know, and they keep track of numbers, as well as people. Sometimes, they release these numbers, so that we can have a look at them.

The National Center for Health Statistics, which is part of the Centers for Disease Control and Prevention, is an organization that regularly releases its data, particularly those compiled in the National Vital Statistics System. One such dataset that might be of interest is that on Marriages and Divorces.

This dataset has two tables (one for marriages and one for divorces), each provided with a convenient breakdown by state. It covers the years 1990, 1995, and 1999-2011 inclusive; and the data are rates, expressed as "per 1,000 total population residing in area."

I we simply average the data for the whole country, the graph looks like the following. Basically, the divorce rate has remained approximately constant, while the marriage rate has decreased during the current century. The actual number of marriages per year, across the country, decreased from 3.1 million in 1990 to 2.1 million in 2009-2011.

We can now look at whether the marriage trend is consistent across all of the states. As usual, we can use a phylogenetic network as a form of exploratory data analysis, to compare all of the states in a single diagram. I first used the gower similarity to calculate the similarity of the states based on the marriage rates for all of the years. This was followed by a Neighbor-net analysis to display the between-state similarities as a phylogenetic network. So, states that are closely connected in the network are similar to each other based on their marriage rates, and those that are further apart are progressively more different from each other.

The states are neatly arranged in the network in decreasing order of marriage rate from top to bottom-left. I have labeled only the those states with the highest rates.

The result for Nevada surprises no-one who has seen the honeymoon behavior of Americans — the high rate refers to those visitors getting married in Las Vegas, the self-proclaimed "Entertainment Capital of the World". The claim itself may be doubtful (Paris, for example, gets more tourists per year), but the large number of non-residents getting married in Las Vegas is not in doubt. Similarly, Hawaii is a well-known holiday destination for honeymooners, some of whom don't get married until they get there; so this rate does not reflect the behavior of the locals alone.

However, for the other labeled states the rate does seem to reflect the behavior of the residents. It is an interesting mix of states from around the country, although several of the states are from the South, while others have a large Mormon population.

Finally, we can look at whether the decline in marriage rate is repeated across the states. I have plotted the data only for the five states with the highest rates. Note that the vertical axis is on a logarithmic scale.

You will note the steep reduction in the number of people traveling to Nevada to get married, but not so for Hawaii, which has actually increased somewhat. The other states reflect the fact that there has been a general decline in marriage rate throughout the USA since the turn of the century.

August 5, 2015

One of the most fundamental computational problems related to phylogenetic networks is the following Tree Containment problem. Given a phylogenetic network and a phylogenetic tree, does the network display the tree? (Basically meaning that the tree can be obtained from the network by deleting nodes and branches.)

This problem was shown to be NP-hard in this paper in 2008. So, not only is it difficult to reconstruct phylogenetic networks, it is even difficult to check if a given network is consistent with certain gene trees or the estimated species tree.

In this paper in 2010, Charles Semple, Mike Steel and I studied for which classes of networks this problem remains hard and for which ones it becomes easy. In particular, we showed that the problem becomes polynomial-time solvable on so-called binary tree-child networks.

However, we were not able to extend our algorithm to a more general class of networks called reticulation visible networks, which were later called stable networks by others. A network is reticulation visible if, for each reticulation r, there exists a leaf x such that, if one would delete r, there would be no more directed path from the root to x. The idea behind this class of networks is that the leaf x gives us some information about the reticulation r. And how can we possibly expect to reconstruct reticulations if we don't have any information about them? Moreover, the class of reticulation visible networks seems to be much larger than the class of tree-child networks.

We advertised this open problem as Problem 4 in a list of seven important open computational problems related to phylogenetic networks in this blog post. Recently, there has been quite some interest in the problem, and two papers have presented algorithms for restricted subclasses. A solution for the whole class of binary stable networks has now been proposed in:

Andreas D.M. Gunawan, Bhaskar DasGupta, Louxin Zhang. Stability Implies Computational Tractability: Locating a Tree in a Stable Network is Easy. arXiv:1507.02119 [q-bio.PE]

The paper has not been published yet, but the proof seems correct to me, and is very clever and elegant. Hence, the first of the seven "phylogenetic network millennium problems" has been solved!

Below you see Louxin Zhang presenting the algorithm at the Phylogenetic Networks workshop in Singapore.

August 3, 2015

The distinction between networks and augmented trees is interesting from a biological, computational and mathematical point of view. An augmented tree is the result of adding cross-connecting branches to a tree, turning it into a network. So each augmented tree is a network (called a tree-based network). But is each network an augmented tree? In a previous blog post we showed that this is not the case. There exist networks that are inherently network-like and cannot be obtained by adding branches to a tree. (If we are allowed to create new nodes by subdividing branches of the tree, but are not allowed to subdivide any of the previously-added branches.)

The biological question here is as follows: is evolution a tree-like process augmented with horizontal events, or is evolutionary inherently network-like?

This concept is also relevant to phylogenetic network reconstruction approaches, because several such methods work by adding edges to an estimated species tree. Therefore, there exist networks that will always be missed by such methods.

Interestingly, it has turned out that it is easy to find out if a given network is tree-based or not. A polynomial-time algorithm was presented recently by Francis and Steel:

Andrew Francis and Mike Steel, Which Phylogenetic Networks are Merely Trees with Additional Arcs? Systematic Biology (2015).

They solve the problem by reducing it to a model called 2-SAT, which is interesting because it automatically leads to a very simple and fast algorithm solving the problem.

An interesting question that remains open is the following. Given a network and a tree, can we decide in polynomial time if the network can be obtained by adding edges to the given tree? Another question is whether there exists a clean graph-theoretic characterisation of tree-based networks.

Below you see Mike Steel presenting their recent paper at the Phylogenetic Networks Workshop in Singapore. He also discussed other recent results, concerning folding and unfolding phylogenetic trees and networks, as well as distance-based methods for detecting reticulation.

July 31, 2015

Today was a mixed bag ot talks.

Louxin Zhang started with a couple of proofs about what he called "stable" networks; and Stefan Grünewald developed his thoughts on quartet algoritms for splits graphs. At the other extreme, Nadine Ziemert talked entirely biology, introducing the audience to the problem of trying to study the evolution of secondary metabolites. In between, Eric Tannier tried to use horizontal gene transfer to date the nodes of networks, assuming that HGT requires a temporally consistent network. Francois-Joseph Lapointe produced the only really statistical talk of the week, trying to produce p-values for patterns on sequence similarity networks.

Daniel Huson popped in for the last day, and presented us with some ideas for the future development of both SplitsTree (unrooted networks) and Dendroscope (rooted networks). Apparently, the need is for SplitsTree to handle larger sets of trees, while for Dendroscope it is to produce networks from pairs of input trees. He also noted that there are still more networks being produced using median joining rather than neighbor-net, due to the amount of work being done on human mitochondrial sequences.

An interest was expressed in continuing the series of meetings on phylogenetic networks (Leiden 2012, Leiden 2014) — I first met most of the people working on networks in phylogenetics in Uppsala in 2004 (Phylogenetic Combinatorics and Applications).

Today we also celebrated Dan Gusfield's 2^6 birthday, with a strawberry cream cake.

So, all in all, a very successful meeting.

After the sessions finished, I went down to the Gardens By The Bay to look at the Supertree Grove. As you can see, a "super" tree is by any definition actually a network.

July 30, 2015


There was more heavy maths today.

Charles Semple started by counting trees within specified types of network. In the process, he provided the first mathematical proof of the week (he actually provided two). He also raised the issue of what, exactly, is a phylogenetic network — we have had many mathematical restrictions placed on networks this week, and it is not always clear how any of them might relate to biological concepts.

Leo van Iersel tried constructing super-networks from incomplete sub-networks, sticking to algorithms rather than proofs. Yufeng Wu and Zhi-Zhong Chen later tried the same strategy for their networks, as did Lusheng Wang for pedigree comparison (he was the only person other than myself to even mention pedigrees).

Mike Steel considered under what circumstances a network can be viewed as a "tree with reticulations" rather than a non-tree network (ie. not every vertex is part of the same underlying tree); this led him to the interesting observation that whether a dataset can be represented by a tree can depend on the taxon sampling. He also looked at when a set of non-tree distances can appear to be tree-like, which is the sort of question that only a mathematician would ask.

Most of the audience interjections this week have come from Sagi Snir, and the rest of the speakers got to return the favor this afternoon, when he spoke about trying to reconstruct trees subject to large amounts of horizontal gene transfer. In the process, he also tried to "sketch" a mathematical proof, which turned into a full-sized painting, before moving on to his algorithm.

July 29, 2015


It was hard going this morning for the biologists, as there were three main computational talks. First, Vince Moulton further developed some of his ideas about split networks, including median networks, quasi-median networks and neighbor-nets, and what sorts of trees they might contain. Then, Céline Scornavacca expanded on her ideas for calculating the "hybrid number". Finally, Jens Lagergren outlined his work on fitting gene trees to known species trees and networks; this has come a long way in recent years.

We had the afternoon off, although many people took the opportunity to pretend that they were still in their offices at home. Myself, I sat by the pool waiting for the temperature to cool (this was the hottest day so far this week), and then went to the Singapore Botanical Gardens, where I circumnavigated the Evolution Garden, the National Orchid Garden, and the Rainforest Walk. I then briefly perused Orchard Street (one of the most ridiculous shopping meccas you will ever see) and the Raffles Hotel (an even more ridiculous hang-over from British imperialism), before returning to the pool side. It's a tough life.

July 28, 2015

The computational people were very patient today, as the three major talks focussed on biology, with only the shorter talks being computational.

In particular, Eric Bapteste and James McInerney were determined to tackle the true complexity of phylogenetics, rather than trying to see genealogical history as being a tree with reticulations. They have recently been championing sequence similarity networks as tools for exploring phylogenetic history, and Eric discussed them in relation to prokaryote evolution while James looked at gene families. Strictly speaking, SSNs are not phylogenetic networks, because they do not involve the inference of unobserved nodes connected to observed (labelled) nodes by inferred edges, bit instead connect observed (labelled) nodes via observed edges. This does not mean that they have no role to play in phylogenetics, as the speakers made amply clear.

My own talk had little to do directly with empirical networks, but instead tried to look at an overview of the field, presenting some of my own ideas about where networks are heading, and what role they might play as phylogenetic tools. Not everyone was convinced.

Also of personal interest to me, Philippe Gambette unveiled the new, much more ambitious, version of the Who is Who in Phylogenetic Networks database. I claim no other role in this than encouraging Philippe to be as ambitious as possible. I think that people will be genuinely impressed by what can now be done to explore the people, literature and software associated with phylogenetic networks.

PDF copies of the talks have now started to appear on the workshop web page, which will give you a bit more idea of what our speakers have tried to say.

We also started the discussion about how to engender more effective development of computational tools for phylogenetic networks. Topics covered included the need for more gold-standard datasets that can be used to test new methods — to date, the ones available on this blog have been compiled by me alone, but in order to expand this other people will need to contribute. Also, improved communication and collaboration between biologists and computationalists would be very helpful, and several suggestions were canvassed, but no real way forward was found. One interesting point was made that many of the practical applications of networks were not likely to attract the professional interest of most computationalists — indeed, to date, most phylogenetics programs have been written by biologists rather than computational people.

In the evening, I had a very enjoyable dinner at the Singapore Seafood Republic, in the company of Louxin Zhang (our organizer) and some very nice Chinese visitors, all of whom politely failed to comment on my inability to use chopsticks. Pictures of dinner were taken, and may appear on Facebook, if I am not careful. We finished just in time to watch the Sentosa Crane Dance, which you should all check out.

July 27, 2015


I'm not sure how these reports are going to go, as I did not bring a laptop with me. Also, Blogger is not happy with me logging in from another country. However, I have managed to get a decent sized keyboard on the screen of my iPad Mini, so I can at least type somewhat normally. I will not, however, write about every talk (and my apologies to those speakers who do not get mentioned).

Singapore is as expected — hot and humid; except when one is indoors, and even 24 °C surprisingly feels cold. I have washed most of my shirts once already, to remove the perspiration.

Most people seem to have arrived; indeed, many have already been here for a few days. Myself, I spent Sunday afternoon touring Chinatown, and Little India. The food market at the latter location was unbelievably hot, although the locals did not seem to realize this.

We have now dealt with the first day of talks. No mercy was shown to the uninitiated, and we started with the heavy network stuff right from the start.

This took the form of Dan Gusfeld explaining to us in no uncertain terms that Integer Linear Programming can be used to solve many computational problems that are too hard for Dynamic Programming, using Ancestral Recombination Graphs as his example. When asked about possible connections to actual biology, he patiently explained that this was another matter entirely. Kathi Huber later said the same thing when asked about the loss of biological information resulting from unrooting a rooted network. At the time, she was trying to "bridge the gap" between rooted and unrooted networks, and unrooting them is surprisingly effective way to achieve this.

Luay Nakhleh's talk was my favorite of the day. He is one of the few people in this business who can successfully talk computations to a mathematician and biology to a biologist — most of the rest of us fail at one or the other (or both). Sadly, he pointed out that under the coalescent model any gene tree fits inside any species tree (or network), simply by having the gene coalescences occur after the species root is reached. He also noted that we can't distinguish among reticulation processes on a network, which took away one third of my talk!

We finnished with Jesper Jansson decomposing networks into triangles, which is a neat change from the usual decomposition into triplets, clusters or trees. Along the way, he concluded that we need to keep using a lot of different measures for network to network distances, because none of the current ones are good under all conditions. That is another major difference between trees and networks.

July 21, 2015


Next week there will be a gathering in Singapore, for a Phylogenetic Network Workshop. This is being hosted by the Institute for Mathematical Sciences, at the National University of Singapore.

The workshop has been organized under the guidance of Louxin Zhang. The program and abstracts can be found here. It runs for the whole week, 27 – 31 July 2015.

The workshop is actually the final part of a much larger, 2-month programme, called Networks in Biological Sciences (1 June – 31 July 2015). This programme is focused on mathematics for network models in biology, including complex networks and systems biology. Network modeling is extremely challenging, and so it offers outstanding opportunities for mathematicians and statisticians. The phylogenetics workshop will focus on the mathematics needed to develop fast and robust computer programs for inferring an evolutionary network models from biological sequence data.

The participants are principally from the computational sciences, of course, including many who have attended the previous network workshops in Leiden, in the Netherlands, in October 2012 and July 2014. There are, however, a few biologists to round out the field, including myself.

Singapore is hot and humid for most of the year, and July is no exception. So, I am expecting the unacclimatized participants to spend most of their time indoors, avoiding the daily thunderstorms.

I am hoping to add some blog posts based on what happens at the workshop, as it proceeds.

July 19, 2015


The following diagrams are taken from the book A History of Architecture on the Comparative Method for the Student, Craftsman, and Amateur. This book is considered to be "a canonical text that has played a formative role in the education of generations of architects" because it really does "cram everything into a single volume". The first edition of the book appeared in 1896, with the 20th edition appearing in 1996.

The first picture is from the 5th edition (1905), and the second one is from the 16th edition (1954).

As noted in the first figure, these trees purport to show the "evolution" of the various architectural styles. However, they do no such thing.

At the base of the tree trunk is a set of individual architectural styles that apparently led nowhere, while at the crown of the tree several styles are repeated. Each of the latter styles exist on two side-branches from the main trunk, each pair connected by vertical tendrils. So, this is a network, at least. However, the meaning of this network is not immediately obvious. Indeed, even a short perusal of the diagram should lead you to the idea that the meaning is contained more in cultural bias than in the actual history of architecture.

The history of the book itself is somewhat complex. The first edition was written by the father and son team of Banister Fletcher & Banister F. Fletcher. Subsequent editions were revised by Banister F. Fletcher (the son), with the 6th edition (1921) being rewritten by Fletcher and his first wife (who got no credit, even though the father's name was then dropped). After Fletcher's death in 1953, the 17th edition (1961) was revised by R.A. Cordingley, the 18th (1975) by James Palme, the 19th (1984) by John Musgrove, and the 20th (1996) by Dan Cruickshank. The tone and arrangement of the book was changed with each edition.

The tree has been analyzed in detail by Gülsüm Baydar Nalbantoglu (1998. Toward postcolonial openings: rereading Sir Banister Fletcher's "History of Architecture". Assemblage 35: 6-17). She notes the following:
Until the fourth edition of 1901, A History of Architecture had been a relatively modest survey of European styles. The fourth edition, however, appeared with an important difference: this time the book was divided into two sections, "The Historical Styles", which covered all the material from earlier editions, and "The Non-Historical Styles", which included Indian, Chinese, Japanese, Central American, and Saracenic architecture. The "Tree of Architecture" has a very solid upright trunk that is inscribed with the names of European styles and that branches out to hold various cultural / geographical locations. The nonhistorical styles, which unlike others remain undated, are supported by the "Western" trunk of the tree with no room to grow beyond the seventh-century mark. European architecture is the visible support for nonhistorical styles. Nonhistorical styles, grouped together, are decorative additions, they supplement the proper history of architecture that is based on the logic of construction. In the posthumously published seventeenth edition of 1961, the two parts were renamed "Ancient Architecture and the Western Succession" and "Architecture in the East", respectively. The nineteenth edition of 1987, on the other hand, consisted of seven parts based on chronology and geographical location. Cultures outside of Europe included "The Architecture of the Pre-Colonial Cultures outside Europe" and "The Architecture of the Colonial and Post-Colonial Periods outside Europe".
That is, "architecture" for the Banisters was defined as being about a building's construction, not its decoration. European cultures focused on construction, and they developed their styles through time. Other cultures focused on decoration, and were therefore not a proper part of architecture, and had no historical development. This is what the tree attempts to show.

This cultural bigotry was corrected in the final few editions of the book (after the Fletchers were no longer involved), where all architectural styles were considered more-or-less equal.

July 14, 2015


Some years ago I came across this paper in the arXiv:
David Chavalarias and Jean-Philippe Cointet (2010) The reconstruction of science phylogeny. arXiv:0904.3154v3I was intrigued by what they could possibly mean by "science phylogeny". The abstract contains this information:
We are facing a real challenge when coping with the continuous acceleration of scientific production and the increasingly changing nature of science. In this article, we extend the classical framework of co-word analysis to the study of scientific landscape evolution. Capitalizing on formerly introduced science mapping methods with overlapping clustering, we propose methods to reconstruct phylogenetic networks from successive science maps, and give insight into the various dynamics of scientific domains ... These results suggest that there exist regular patterns in the “life cycle” of scientific fields. The reconstruction of science phylogeny should improve our global understanding of science evolution and pave the way toward the development of innovative tools for our daily interactions with its productions. Over the long run, these methods should lead quantitative epistemology up to the point to corroborate or falsify theoretical models of science evolution based on large-scale phylogeny reconstruction from databases of scientific literature.The only actual description of phylogenetic methods is this:
The core question is: How can we reconstruct science dynamics through automated bottom-up analysis of scientific publications? ... The reconstruction of these inheritance patterns will be very useful to get a global overview of the activity and evolution of large scientific domains. Moreover, contrary to what is often encountered in biology, we should expect some hybridization events be- tween fields of research, which requires switching from phylogenetic trees to phylogenetic networks. Reconstructing the phylogenetic network of science consists in answering this simple question: given a scientific field CT' at period T' and a period T prior to T', from which fields at T does CT' derives its conceptual legacy? To achieve inter-temporal matching between fields, we have to find for each field at T the field or union of fields from which it inherits.When the authors formally published their work, the literature had changed, and the reference to phylogenetic networks had been replaced:
David Chavalarias, Jean-Philippe Cointet (2013) Phylomemetic patterns in science evolution — the rise and fall of scientific fields. PLOS One 8: e54847.The abstract contains this information:
We introduce an automated method for the bottom-up reconstruction of the cognitive evolution of science, based on big-data issued from digital libraries, and modeled as lineage relationships between scientific fields. We refer to these dynamic structures as phylomemetic networks or phylomemies, by analogy with biological evolution; and we show that they exhibit strong regularities, with clearly identifiable phylomemetic patterns.The explanation of phylomemetics is this:
[The] evolution of science, featuring innovations, cross-fertilization and selection, is suggestive of an analogy with the evolution of living organisms. We propose an adaptation of the concept of the phylogenetic tree, and combine it with the Richard Dawkins intuition of meme, to refer to phylomemetic networks (or phylomemy), which describes the complex dynamic structure of transformation of relations between terms. The concept of "phylomemetic network" is used by analogy to biological phylogenetic trees, which account for evolutionary relationships between genes. We do not make any assumption concerning the type of dynamics underlying the evolution and diffusion of terms. As such, contrarily to previous works in line with the memetics theory [9], which have already coined the term, we do not claim that cultural entities (memes) evolve following the same laws of selection as biological replicators (genes) do.The term "phylomemetics" was coined by:
Christopher J. Howe and Heather F. Windram (2011) Phylomemetics — evolutionary analysis beyond the gene. PLoS Biology 9: e1001069.However, you should note that Chavalarias & Cointet explicitly distance themselves from Howe & Windram's claim that cultural entities (memes) evolve following the same laws of selection as biological replicators (genes) do. They also insist upon a network representation rather than Howe & Windram's use of a tree.

The resulting networks are rather odd looking things, with multiple roots occurring at different times. There is one network for each of the selected fields of science (defined by their use of specific terminology). This is the one for the term "Gap junctions":

July 12, 2015


When I was young, my siblings and I used to go to the movies regularly with my father. One of the movies I remember well, even after more than 40 years, was Jacques Tati's final cinema release, Trafic.

Tati made only five cinema features, plus several short movies, and one final made-for-TV movie (made in Sweden in 1974):
  • Jour de Fête (1949)
  • Les Vacances de Monsieur Hulot (1953)
  • Mon Oncle (1958)
  • Playtime (1967)
  • Trafic (1971)
All of them were originally in French, and were released internationally with subtitles. However, Tati came from the world of mime, and so his movies had very little dialg anyway, relying instead on the "moving picture" aspect of film making.

In spite of his small output, Tati managed to have a large impact on world cinema. His movies won several awards, notably at the Cannes Film Festival and the Venice Film Festival, and Mon Oncle won the Academy Award for Best Foreign-Language Film (and Les Vacances de Monsieur Hulot received a nomination for Best Screenplay). His movies regularly appear in "Top 50" and "Top 100" lists. Many people have acknowledged his influence, and Rowan Atkinson's character Mr Bean is basically an updated English-language version of Tati's character M. Hulot (even to the extent of making Mr Bean's Holiday). There is even a small homage to Mon Oncle near the beginning of The Blues Brothers.

Not unexpectedly, then, Tati's movies continue to attract the attention of the critics. At the aggregator site Rotten Tomatoes, his movies have 100% positive reviews, except for Mon Oncle which unexpectedly has only 92%. There are a total of 89 critics listed as having written reviews about at least one of Tati's movies, although only 28 of them have reviewed more than one of the films. Indeed, only four critics have provided individual reviews all five movies (and a few others have reviewed them collectively).

The fact that so few of the movie reviewers compare Tati's movies can be used to illustrate the dangers of assessing things in isolation. If we average the reviewer's scores for the movies, then we get this (standardized to a scale of 0-1):
Jour de Fête
Les Vacances de Monsieur Hulot
Mon Oncle
Trafic No. reviewers
18Average score
0.782 Clearly, Playtime is the favorite, with Trafic trailing the field (although still with a good score).

On the other hand, this pattern is not quite repeated when we consider only those reviewers who provided scores for more than one movie. That is, we do not see quite the same pattern when we assess the pairwise preferences of those critics who scored at least two of the films.

Sometimes, the overall pattern is repeated. For example, you will note that Les Vacances de Monsieur Hulot scored higher than the other movies except for Playtime. Of the reviewers who also scored Les Vacances, 5 preferred Playtime, 4 scored them as equal, and 1 preferred Les Vacances, so that Playtime is clearly preferred. Similarly, 4 critics preferred Les Vacances to Trafic, 3 scored them as equal, and no-one preferred Trafic.

However, overall Les Vacances de Monsieur Hulot scored higher than Mon Oncle, which also reflects the latter's 92% "fresh" rating noted above, but this pattern is not repeated for the pairwise comparisons. If we look at the 10 reviewers who scored both of these movies, then 4 preferred Les Vacances to Mon Oncle, 3 scored them as equal, and 3 preferred Mon Oncle, thus showing little preference for one film over the other.

So, direct comparisons can be more important than independent assessments.

As usual, we can use a phylogenetic network as a form of exploratory data analysis, to compare all five movies in a single diagram. I first used the gower similarity to calculate the similarity of the five movies based on those 20 reviewers who scored more than one movie. This was followed by a Neighbor-net analysis to display the between-film similarities as a phylogenetic network. So, films that are closely connected in the network are similar to each other based on their scores, and those that are further apart are progressively more different from each other.

Clearly, the movies are not really very different from each other in score, and there is little preference for one over another for these 20 critics. This contrasts with the scores from all 89 critics.

The "audience score" at Rotten Tomatoes differs somewhat from the critics' scores. They score Playtime (90%) and Mon Oncle (89%) at the top, followed by Les Vacances de M. Hulot (86%) and Jour de Fête (85%), and finally Trafic (77%). In spite of this, I still have a soft spot for Trafic, although Mon Oncle is my personal favourite.

July 7, 2015


Genotypes or phenotypes?In a blogpost from 2013, David investigated some of the popular analogies between anthropology (including linguistics) and biology. He rejected those analogies that compare the genotype with anthropological entities (like the common "words = genes" analogy). Instead, he proposed to draw the analogy between anthropological entities and the phenotype. I generally agree that we should be very careful about the analogies we draw between different disciplines, and I share the scepticism regarding those naive approaches in which genes are compared with words or sounds are compared with nucleotide bases. I am, however, sceptical whether the alternative analogy between phenotypes and anthropological entities offers a general solution for the study of language evolution.

Productive and unproductive analogiesMy scepticism results from a general uncertainty about the transfer of models and methodologies among scientific disciplines. I am deeply convinced that such a transfer is useful and that it can be fruitful, but we seem to lack a proper understanding of how to carry out such a transfer. Apart from this general uncertainty as to how to do it properly, I think that for linguistics the analogy between phenotypes and linguistic entities is too broad to be successfully applied.

Instead of drawing general analogies between biology and linguistics, it would be more useful to carry out a fine-grained analysis of productive analogies between the two disciplines. By productive, I mean that the analogies should lead to an interdisciplinary transfer of models and methods that increases the insights about the entities in the discipline that imports them. If this is not the case for a given analogy, this does not mean that the analogy is wrong or false, but rather that it is simply unproductive, since an analogy is just a similarity between entities from different domains, and what we define as being "similar" crucially depends on our perspective. With enough fantasy, we can draw analogies between all kinds of objects, and we never really know the degree to which we construct rather than detect, as I have tried to illustrate in the graphic below.

Constructed or detected similarities?
Local productive analogies: alignment analysesA productive analogy does not necessarily have to be global, offering a full-fledged account of shared similarities, as in the analogies which compare, for example, languages with organisms (Schleicher 1848) or languages with species (Mufwene 2001), but also the analogy between phenotypes and anthropological entities proposed by David. It is likewise possible to find very useful local analogies, which only hold to a certain extent, but offer enough insights to get started.

Consider, for example, the problem of sequence alignment in biology and linguistics. It is clear, that both biologists and linguists carry out alignment analyses of some of the entities they are dealing with in their disciplines. We use alignment analyses in biology and linguistics, since both disciplines have to deal with entities that are best modeled as sequences, be it sequences of DNA, RNA, or amino acids in biology, or sequences of sounds in linguistics. In both cases, we are dealing with entities in which a limited numer of symbols is linearily ordered, and an alignment analysis is a very intuitive and fruitful way to show which of the symbols in two different sequences correspond.

In this very general point, the analogy between words as sequences of sounds and genes as sequences of nucleic acids holds, and it seems straightforward to think of transferring models and methods between the disciplines (in this case from biology to linguistics, since automatic sequence alignment has a longer tradition in biology).

In the details, however, we will detect differences between biological and linguistic sequences, with the main differences lying in the alphabets (the collections of symbols) from which our sequences are drawn (discussed in more detail in List 2014: 61-75):
  • Biological alphabets are universal, that is, they are basically the same for all living creatures, while the alphabets of languages are specific for each and every language or dialect.
  • Biolological alphabets are limited and small regarding the number of symbols, while linguistic alphabets are widely varying and can be very large in size.
  • Biological alphabets are stable over time, with sequences changing by the replacement of symbols with other symbols drawn from the same pool of symbols, while linguistic alphabets are mutable: not only can they acquire new sounds or lose existing ones, but also the sounds themselves can change.

How similar are words and genes in the end?
What are the consequences of these differences in the word-gene analogy? Can we still profit from the long tradition of automatic alignment methods when dealing with phonetic alignment (the alignment of sound sequences, like words or morphemes) in linguistics? Yes, we can! But within limits!

Linguists can profit from the general frameworks for sequence alignment developed in biology, but we need to make sure that we adapt them according to our linguistic needs. For alignment methods, this means, for example, that we can use the traditional frameworks of dynamic programming for pairwise alignment, which were developed back in the seventies (Needleman and Wunsch 1971, Smith and Waterman 1981). We can also use some of the frameworks for multiple sequence alignment, which were developed a bit later, starting from the end of the eighties, be it progressive (Feng and Doolittle 1987, Thompson et al. 1994, Notredame et al. 1998), iterative (Barton and Sternberg 1987, Edgar 2004), or probabilistic (Do et al. 2004). But we can only import the overall frameworks, not their details.

All algorithms for phonetic alignment that are supposed to be applicable to a wide range of data (and not serve as a mere proof of concept that handles but a limited range of test datasets) need to address the specific characteristics of sound sequences. Apart from the differences in alphabet size and the mutable character of sound systems mentioned above, these differences also include the important role that context plays in sound change (List 2014: 26-33), the problem of secondary sequence structures (List 2012), the problem of metathesis(List 2012: 51f), but also the problem of unalignable parts resulting from cases of partial and oblique homology in language evolution (see my recent blog post on this issue).

Concluding remarksDrawing analogies between the research objects of different disciplines is not a bad idea, and it can be very inspiring, as multiple cases in the history of science show. When transferring models and methods from one discipline to another, however, we need to make sure that the analogies we use are productive, adding value to our research and understanding. We should never expect that analogies hold in all details. Instead we need to be aware about their specific limits, and we need to be willing to adapt those models and methods we transfer to the needs of the target discipline. Only then can we make sure that the analogies we use are really productive in the end.

  • Barton, G. J. and M. J. E. Sternberg (1987). “A strategy for the rapid multiple alignment of protein sequences. Confidence levels from tertiary structure comparisons”. J. Mol. Biol. 198.2, 327 –337. 
  • Do, C. B., M. S. P. Mahabhashyam, M. Brudno, and S. Batzoglou (2005). “ProbCons. Probabilistic consistency-based multiple sequence alignment”. Genome Res. 15, 330–340.
  • Edgar, R. C. (2004). “MUSCLE. Multiple sequence alignment with high accuracy and high throughput”. Nucleic Acids Res. 32.5, 1792–1797.
  • Feng, D. F. and R. F. Doolittle (1987). “Progressive sequence alignment as a prerequisite to correct phylogenetic trees”. J. Mol. Evol. 25.4, 351–360.
  • List, J.-M. (2014). Sequence comparison in historical linguistics. Düsseldorf: Düsseldorf University Press.  
  • List, J.-M. (2012a). "Improving phonetic alignment by handling secondary sequence structures". In: Hinrichs, E. and Jäger, G.: Computational approaches to the study of dialectal and typological variation. Working papers submitted for the workshop organized as part of the ESSLLI 2012. 
  • List, J.-M. (2012b). “Multiple sequence alignment in historical linguistics. A sound class based approach”. In: Proceedings of ConSOLE XIX. “The 19th Conference of the Student Organization of Linguistics in Europe” (Groningen, 01/05–01/08/2011). Ed. by E. Boone, K. Linke, and M. Schulpen, 241–260.
  • Mufwene, S. S. (2001): The ecology of language evolution. Cambridge: Cambridge University Press.
  • Needleman, S. B. and C. D. Wunsch (1970). “A gene method applicable to the search for similarities in the amino acid sequence of two proteins”. J. Mol. Biol. 48, 443– 453.
  • Notredame, C., L. Holm, and D. G. Higgins (1998). “COFFEE. An objective function for multiple sequence alignment”. Bioinformatics 14.5, 407–422.
  • Schleicher, A. (1848). Zur vergleichenden Sprachengeschichte [On comparative language history]. Bonn: König.
  • Smith, T. F. and M. S. Waterman (1981). “Identification of common molecular subsequences”. J. Mol. Biol. 1, 195–197.
  • Thompson, J. D., D. G. Higgins, and T. J. Gibson (1994). “CLUSTAL W. Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice”. Nucleic Acids Res. 22.22, 4673–4680.

July 5, 2015


In an earlier blog post, I discussed some of the evocative Metaphors for evolutionary relationships, particularly reticulating ones.

In that post I listed the concept of a "braided river", and mentioned a 1994 paper by John Moore as my earliest source for the image. However, the metaphor actually goes back more than 100 years earlier. It occurs as the central metaphor in this quite remarkable book on comparative religion:
Forlong, J.G.R. (1883) Rivers of Life: or Sources and Streams of the Faiths of Man in All Lands, Showing the Evolution of Faiths from the Rudest Symbolisms to the Latest Spiritual Developments. 2 vols. Bernard Quaritch: London.James George Roche Forlong was a Scottish engineer serving in the British army that occupied India during the 19th century. He apparently had a life-long interest in comparative religion, and his book arose from his personal experience of non-Christian religions (facilitated by his knowledge of several languages). The book involves a serious re-interpretation of the evolutionary history of world religions, as a series of six inter-connecting rivers running from ancient times into the modern world, each river representing a different type of worship.

The illustrative chart that accompanies the book can be viewed here. A low-resolution copy is shown below.