The Genealogical World of Phylogenetic Networks

Biology, computational science, and networks in phylogenetic analysis


XML feed

Last update

11 min 39 sec ago

January 27, 2015


We don't normally discuss individual papers in this blog (except as example datasets), but today I am simply drawing your attention to what appears to be a little-known paper on phylogenetic networks.

Naruya Saitou has not contributed much to the theory of networks, being instead best known for the development of the neighbor-joining method for phylogenetic trees. (The 20th most cited paper ever; see Massive citations of bioinformatics in biology papers) However, this recent paper is of interest:
Naruya Saitou, Takashi Kitano (2013) The PNarec method for detection of ancient recombinations through phylogenetic network analysis. Molecular Phylogenetics and Evolution 66: 507-514.The paper presents a new method for detecting ancient recombinations through phylogenetic network analysis. Recent recombinations are easily detectable using alternative methods, although splits graphs can also be used, but older recombinations are more tricky.

Importantly, I particularly like the opening paragraph of the paper:
The good old days of constructing phylogenetic trees from relatively short sequences are over. Reticulated or "non-tree" structures are omnipresent in genome sequences, and the construction of phylogenetic networks is now the default for describing these complex realities. Recombinations, gene conversions, and gene fusions are biological mechanisms to produce non-tree structures to gene phylogenies, while gene flow is a well known factor for creating reticulations within population phylogenies.These are heart-warming words from the developer of the most commonly used tree-building method!

January 25, 2015


It might be nice to live in a world where the mere fact that you are male or female does not attract attention to you within your profession. But while we are waiting for that day, you might like to ask yourself about women in systematics. David Archibald suggests that the tree produced by Anna Maria Redfield is "the first tree – creationist or evolutionary – by a woman and may well be the only such tree by a woman until well into the twentieth century."

Anna Maria Redfield (1800-1888, née Treadwell) is described in these terms by Michon Scott's Strange Science web site:
Born at the dawn of the 19th century, Anna Maria Redfield earned the equivalent of a master's degree from the first U.S. institution of higher learning devoted to female students: Ingham University, and became perhaps the first woman to design a tree-like diagram of animal life. Although tree-like, her diagram didn't show common ancestry but instead showed the "embranchements" established by Georges Cuvier: vertebrates, arthropods, mollusks, and "radiata" (today classified as cnidarian and echinoderm phyla). To be fair, this diagram was published before Darwin's Origin of Species but later editions of her work made no mention of evolution either. Instead, she wrote about our simian cousins, "The teeth, bones and muscles of the monkey decisively forbid the conclusion that he could by any ordinary natural process, ever be expanded into a Man." Still, her elegant work is great fun to behold even now.The tree-like diagram (shown in miniature above) was a wall chart (1.56 x 1.56 m) called A General View of the Animal Kingdom, published in 1857 by E.B. and E.C. Kellogg, New York. It is heavily illustrated with images of the taxa, their names, and brief notes: eg. "Man alone can articulate sounds, and is capable of improving his faculties or advancing his condition". Only three lithograph copies of the original tree are now known, one of which was sold at auction by Christie's in 2005 for £7,200.

The following year the same publishers produced a companion volume to the chart, called Zoölogical Science, or Nature in Living Forms: Adapted to Elucidate the Chart of the Animal Kingdom, and designed for the higher seminaries, common schools, libraries, and the family circle (1858, reprinted 1860, 1865, 1874). A copy is available in the Biodiversity Heritage Library. Only 57 original copies of the book are now known.

This book of 743 pages is richly illustrated, the artist being unacknowledged in the first edition but credited as E.D. Maltbie from then on. (He is presumably responsible for the chart as well.) The book has the frontispiece shown below, which is an edited version of the base of the tree.

Redfield and her chart have recently been discussed by Susan Butts (2011. Conservation of the Anna Maria Redfield wall chart: A General View of the Animal Kingdom. Society for the Preservation of Natural History Collections Newsletter 25(1): 18-19). She notes:
The wall chart is a masterpiece, with intricate and accurate illustrations of representatives of the animal kingdom portrayed as a Tree of Life, which illuminates the relationships of the major groups of organisms. It is an important document in the study of biology and in the pioneering work of women in science. The wall chart has eloquent phrases, which express a Victorian humanistic view of nature (often intermingled with anthropomorphism, biblical overtones, and the biological superiority of humans).Redfield's views on evolution are clear from her book, indicating that the relationships shown represent affinity not evolution:
There is no evidence whatever that one species has succeeded, or been the result of transmutation of a former species.Butts notes that unfortunately Redfield "remains a relatively minor and poorly recorded figure in the history of women in science, let alone biological and evolution studies in general."

January 20, 2015


Charles Darwin's metaphor of the Tree of Life was not a tree, even in The Origin of Species. As noted by Franz Hilgendorf (see The dilemma of evolutionary networks and Darwinian trees) "the branches of a tree do not fuse again", and yet in his book Darwin discusses at least one circumstance when they do precisely that — hybridization.

Darwin's discussion of hybridization occupies all of chapter 8 of the Origin. His stated motivation is to address what many people might see as a fatal objection to his theory of species origins by means of natural selection. One of Darwin's main arguments in the book is that "descent with modification" is continuous, and therefore the distinction between species and varieties (and subspecies, etc) is an arbitrary cut in a continuum of biodiversity. However, it was conventionally accepted that varieties within the same species could cross-breed freely, but any attempt to hybridize distinct species would always fail. Darwin opposes this view by citing extensive evidence showing that varying degrees of sterility are encountered in efforts to cross-breed different species of plants (and a few birds) — if the species are closely related then often there will be a small degree of fertility in the hybrid offspring. So, as two related forms diverge from one another in the course of evolution, their ability to inter-breed gradually diminishes and eventually falls to zero (absolute sterility).

It is important to note that his motivation for writing about hybridization was independent of his ideas about phylogeny. So, he seems not to have noticed the consequence of hybridization for phylogenetic patterns.

This is similar to the situation regarding his so-called "tree diagram", in chapter 4. His motivation for the diagram (the only figure in his book) was a discussion of descent with modification, and particularly the continuity of evolutionary processes. He was expressing his idea about uninterrupted historical connections. In particular, this was part of his concern that there is no fundamental distinction between varieties and species, because evolutionary divergence is continuous — it is all a matter of degree, without sharp boundaries. His Tree of Life image expressed the continuity of evolutionary connections, not phylogenetic patterns. This is clear from his poetic invocation of the biblical Tree of Life, which is about the inter-connectedness of all living things along tree branches, not about patterns of biodiversity.

Implicit in this world view is the idea that the Tree of Life is still a tree in spite of hybridization. That is, Darwin failed to see that his "tree simile" (chapter 4) had to ignore hybridization (chapter 8) in order to work. His figure does not show any evidence of hybridization, only divergence. It was not intended to be what we would now call a phylogeny, but merely an idealized view of divergence and continuity of descent. When introducing the Tree of Life, he was using religious imagery to stimulate the imagination of his readers, and in so doing presented a contradictory argument — there is continuity along the branches as well as continuity of inter-connections.

The alternative conception is that Darwin's Tree of Life was never a tree — it was a network. From this world view, Hilgendorf's dilemma was actually irrelevant. He commented:
An observation which, as far as I know, contradicts these previously discussed views, [would be], that formerly separate species approach each other and finally merge with each other. This would not fit the beautiful image that Darwin presented about the connection of species in a branch-rich tree; the branches of a tree do not fuse again.Well, they do, even in a Darwinian tree.

January 18, 2015


The Tree of Life and the Tree of Knowledge are images that have appeared in many cultures throughout the world. They are often combined as a cosmic or world tree, with the tree of knowledge supporting the heavens and earth and the tree of life connecting all living beings. However, the word "tree" is obviously rather nebulous in these images, and it can take many forms.

In the christian Bible these trees appear in the garden of Eden in a more restricted form as the Tree of Eternal Life and the Tree of Knowledge of Good and Evil. Even here, though, it is not clear whether they are one and the same tree. For example, only one tree is mentioned in the book of Revelation, when promising a new Eden.

The Tree of Knowledge was co-opted in Medieval times as a symbol of learning, and a metaphor for arranging all human knowledge, the Arbor Scientiae (see Relationship trees drawn like real trees). This idea was adopted by biology in the 1700s, where trees were used as metaphors for the relationships among biological species. In modern parlance, these depicted affinity or phenetic relationships, and so they represented knowledge (not life). In the mid 1800s Charles Darwin (in the Origin of Species) took this pre-existing tree idea and instead made it represent evolutionary relationships among species. In the process he re-named it the Tree of Life, thus once again uniting the Tree of LIfe and the Tree of Knowledge. We have been stuck with the ToL name ever since.

At about the same time as the rise of the Arbor Scientiae, a combined Tree of Life and Tree of Knowledge also appeared as the central mystical symbol of the Kabbalah of esoteric Judaism, consisting of the 10 Sephirot (enumerations). It is shown above in its full modern form. This is a reinterpretation of the Hebrew Bible, conceptually representing a list the attributes of God (how God emanates).

In the Kabbalist view, both of the trees in the biblical garden of Eden were alternative perspectives of the Sephirot. The 10 Sephirot are arranged into three columns, with 22 Paths of Connection. As a tree, it has roots above and branches below. To quote Wikipedia:
Its diagrammatic representation, arranged in 3 columns/pillars, derives from Christian and esoteric sources and is not known to the earlier Jewish tradition. The tree, visually or conceptually, represents as a series of divine emanations God's creation itself ex nihilo, the nature of revealed divinity, the human soul, and the spiritual path of ascent by man. In this way, Kabbalists developed the symbol into a full model of reality, using the tree to depict a map of Creation.My main point here is that by combining two conceptual trees this icon is clearly a network, unlike most other conceptual trees such as the dichotomous Tree of Knowledge.

The Kabbalah started without an image, being described solely in words. The diagram of the Tree used by modern Jewish Kabbalists is usually based on the diagram published in the print edition of Rabbi Moses Cordovero's Pardes Rimonim from 1591 [composed 1548], and sometimes called the "Safed Tree". It is shown in the next figure.

One of the earliest illustrations comes from the 1516 Portae Lucis of Paolo Riccio, a Latin translation of Joseph ben Abraham Gikatilla's most influential kabbalistic work, Sha'are Orah (Gates of Light) from the 1300s. It is shown in the next figure.

There are actually two modern version of the Kabbalah. The one shown here in the first illustration has the crossing diagonals lower down than does the one shown in the second illustration. The one with two diagonals at the bottom is an earlier version that is still favoured by Hermetic Kabbalists. Both made their first public appearance in the Pardes Rimonim.

January 13, 2015


BLAST is a computer program that searches a database for similarity matches to a given query sequence, either DNA or amino acid. It is most commonly used to search the GenBank database for matches to any new sequence that we might happen to have, in the hope that we will find one or more homologous sequences.

To most of us BLAST is a black box, in the sense that we have little idea about the details of how it does what it does. So, maybe we should at least look at what it does, just in case we ever need to know.

About 10 years ago I was working with some EST data. For those of you not old enough to know, ESTs consist of short DNA reads from arbitrary primers. In the hope of identifying the coding gene represented by each EST, BLASTX is used to search the GenBank protein database using each translated nucleotide query (in all six possible reading frames). BLASTX produces an E-value for each matching sequence, representing the strength of the match to the query. An E-value is not a probability (as they can vary from 0 to infinity), but at p=0.050 the expected E-value happens to be E=0.051. There is no consensus for what E-value should serve as indicating a "significant" match.

I decided to find out what happens if a DNA query sequence varies in either length or GC content. I used both random sequences (which were thus not in GenBank) as well as real sequences (which were in GenBank). The short answer is that the BLASTX results vary a lot. I never published these results because I figured the first thing a referee would do is ask me to explain BLASTX's behaviour, and I did not have an explanation (and still don't).

I present the results here for what they are still worth. Obviously, the results are not restricted to EST data, but apply any time that we use BLASTX.


The content of GenBank is quite different today to what it was back in late 2003, and so maybe the results will vary if the work was to be repeated. For reference, the first graph shows the GC content of the GenBank protein-coding sequences at the time of my work. Also, it is possible that BLASTX is different as well — I used v. 2.2.6 with default parameters (BLOSUM62, edge correction, length correction, SEG filtering, universal genetic code, gap penalty 11+k). Maybe some intrepid soul will be inspired to find out what happens nowadays.

Random sequences

I generated sets of 1,000 replicate "ESTs" using the perl script Randseq by M. Raymer (5/27/2003). These sets varied in DNA length (100–1,000 nt) and in GC content (0–94%), but were otherwise random sequences of nucleotides. These sequences are not expected to be homologous to anything already in GenBank, and should thus form BLASTX matches only by random chance.

The results for varying the sequence length are shown in the next graph, with each point representing the mean E-value observed. The lines represent four somewhat different GC contents; and the anticipated E-value for random data (0.051) is also shown. Clearly, very few points are near the expected value. The lines all show the same shape, with a minimum E-value near 450 nt, and rising slowly with longer lengths and rising rapidly with shorter lengths.

A more detailed assessment of the results for varying the GC content is shown in the third graph. The lines represent two somewhat different sequence lengths; and the anticipated E-value for random data (0.051) is also shown. It is clear that the E-value is capable of varying by up to seven orders of magnitude in response to variation in the GC content of the sequence.

Real sequences

I used the sequences contained in the Poxvirux Orthologous Clusters database (POCs), which used to be available at: This has since been replaced by the Viral Orthologous Clusters database (VOCs). These virus protein sequences are expected to already be in GenBank, and they should thus form good BLASTX matches.

The POCs database could be queried by both sequence length and GC content, and it was the only such database that I could find at the time. For each combination of length (in 50-nt bands) and GC-value (in 10% bands) I gathered a minimum of 20 sequences. There were few sequences for the shortest lengths, so I chopped up the longest sequences (longer then needed) to increase the sample size. There were also few sequences at the greatest GC values, so I used sequence AE004437.1 from GenBank (a Halobacterium sp.) to increase the sample size.

The results are shown in the final graph, with each point representing the mean E-value observed. The E-values are all small, since they represent actual database matches. Clearly, variation in sequence length can lead to orders of magnitude variation in E-value, while variation in GC content has an effect only at longer sequence lengths.


For a program that is supposed to produce comparable results, no matter what the sequence, these BLASTX results are disquieting. After all, BLAST is one of the most cited programs ever (see Massive citations of bioinformatics in biology papers), and yet I suspect that most people do not realize that it behaves like this.

The random sequences assess the effect of false positives. That they vary so much in E-value is amazing. Clearly, BLASTX E-values are not comparable between sequences. It is interesting that GC content seems to have a bigger effect than sequence length — for any given GC content the effect of length is relatively small for sequences longer than c. 600 nt. However, variation in GC content can produce orders of magnitude of effect at any given sequence length.

The real sequences assess the effect of true positives. That they vary in E-value is also not good — the E-values all represent true database matches (and presumably exact ones). Nevertheless, the effect of variation in sequence length and GC content is repeated for these real sequences. However, variation in GC content only has a large effect for the longer sequences, and instead it is the sequence length that produces the orders of magnitude variation in E-value.

You can make of this what you will.

January 11, 2015


To a modern phylogeneticist the answer to this question is obviously "no". Phylogenetic trees occur in the literature with their root at the top, the left or the bottom, and more rarely on the right. The graph has the same interpretation no matter where the root is placed, as all of the edges are implicitly directed away from the root. The tree can even be circular, with the root in the centre and the tree radiating outwards.

However, this was not always so for genealogies, and indeed this freedom seems to be a product of the past 200 years or so. The history of tree orientation has been discussed in detail by Christiane Klapisch-Zuber (1991. The genesis of the family tree. I Tatti Studies in the Italian Renaissance 4: 105-129).

Originally, genealogies were drawn with the root at the top, as shown in previous blog posts: The first royal pedigree, and The first known pedigree of a non-noble family. These pedigree trees (ie. genealogies of individuals) have a particular ancestor at the root of the "tree", so that the tree expands forwards in time down the page, to increasing numbers of descendants at the leaves (ie. a "descent tree"). This made linguistic sense, because people "descended" from the ancestor down the page. In European languages pages are read top to bottom, and so the natural reading order was the same as the time sequence.

However, this arrangement makes no sense if one refers to the graph as a "tree". Trees have their root at the bottom, not the top. Trying to draw the pedigree as a tree while retaining the original orientation could lead to unusual results, as shown in the first figure, from the end of the 1300s CE (from Universitätsbibliothek, Innsbruk, ms. 590, folio 116r). This is actually an Arbor Consanguinitatis rather than an empirical pedigree — it shows the various relatives of a nominated individual (the man pictured in the center) and their degree of relationship to that person. These diagrams have been used to compute which relatives can marry without committing incest, or which can inherit if a person dies intestate. Jean-Baptiste Piggin, at his web site Macro-Typography, has noted that the earliest known examples are from the 400s CE.

In order to match a real tree, the genealogy has to be read from bottom to top. This implies an ascent through time, instead, with a spreading out of the family upwards through time.

The first known empirical pedigree in which the ancestor is at the base is the Genealogia Welforum, the pedigree of a dynasty of German nobles and rulers (Dukes of Bavaria, and Holy Roman Emperors, successors of the Carolingians). The earliest known example, drawn as part of the Historia Welforum [Welf Chronicle], is shown in the second figure (from Hessische Landesbibliothek, Fulda, ms. D.11 folio 13v). The original text version of the pedigree is dated 1167-1184 CE, with the miniatures added sometime from 1185-1191 CE.

Clearly, this diagram is only sketchily like a tree, with many of the people along the main trunk, and medallions hanging off for other relatives. This seems to arise from the pedigree's origin as prose, and the subsequent literal illustration of that prose.

The ancestor is labeled "Welf Primus", and he apparently lived in the time of Charlemagne (the best known of the Carolingian dynasty). The empty space at the top of the chart was apparently intended for a picture of Emperor Frederick I Barbarossa, of the House of Hohenstaufen. The woman at the top right is Henry the Black's daughter Judith, who was the mother of Barbarossa. Intriguingly, the final bend of the Welf trunk to the left, combined with Barbarossa at the top, seems to imply that it is the descendants of Barbarossa who continue the Welf lineage, rather than those bearing the Welf name.

Historically, it seems to have been the proliferation, after about 1200 CE, of illustrations of the Tree of Jesse that popularized the idea of "pedigrees as trees". The next figure shows such a tree from c. 1320 CE (from a Speculum Humanae Salvationis manuscript, Kremsen ms. 243/55). Jesse lies at the base of the tree, and the tree actually arises from him. His descendants then ascend to Jesus, shown at the crucifixion, with Heaven illustrated at the top. The tree thus uses Christ's pedigree to symbolize the ascent of humans to heaven (via his crucifixion), rather than simply the descent of humans through time. That is, the tree correctly represents ascent (as well as descent).

This leaves us contemplating just when we added the final twist to the iconography, by putting a single descendant at the base of the tree, and having the ancestors branching out above as leaves (ie. an "ascent tree"). This means that time flows from the top to bottom of the figure, even though the tree is oriented from bottom to top. This is quite illogical as an analogy, given that the base of a real tree is the origin of its growth (see Goofy genealogies). This particular iconography is not used for phylogenies but is very commonly used for pedigrees.

I have no idea when this first occurred. However, David Archibald (2014. Aristotle's Ladder, Darwin's Tree: The Evolution of Visual Metaphors for Biological Order. Columbia Uni Press) draws attention to a very tree-like pedigree of Ludwig (Louis III), fifth Duke of Württemberg, from the late 1500s, shown here as the final figure (from Württembergisches Landesmuseum, Stuttgart). Ludwig is at the base of the tree, and ironically he had no descendants (although he married twice). His parents are above him in the tree (Christoph, Duke of Württemberg, to the left, and Anna Maria von Brandenburg-Ansbach, to the right), followed by four more ancestral generations. Note the leaves and hanging fruits, which highlight the tree metaphor.

January 6, 2015


Sometimes there has been discussion about the structural complexity of phylogenetic networks. At one extreme, species phylogenies are seen as trees with occasional reticulations, and at the other end there is a whole cobweb of reticulations with no visible tree. In this context, comments are sometimes made about the likeliness of those outputs from network programs that show extensive gene flow. If a biologist does not believe that the history of "their" organisms involves extensive reticulation, then the algorithmic outputs might be dismissed as unrealistic.

Here I present one well-known example of extensive hybridization, in which the computer programs seem to agree on the same complex solution — the history of common bread wheat.

The data and analyses are from:
Marcussen T, Sandve SR, Heier L, Spannagl M, Pfeifer M, International Wheat Genome Sequencing Consortium, Jakobsen KS, Wulff BB, Steuernagel B, Mayer KF, Olsen OA (2014) Ancient hybridizations among the ancestral genomes of bread wheat. Science 345: 1250092.The hybridization network shown above is a montage of two different phylogenies from the original paper. It shows four splits, one homoploid hybridization, and two polyploid hybridizations. The time is shown in the circles in units of millions of years (note that the scale is not linear).

The first split (6.5 million years ago) is between the genera Triticum (wheat) and Aegilops (goatgrasses), which are morphologically highly distinct, with Aegilops having rounded glumes rather than keeled glumes. There are currently c.20 recognized species in both Aegilops and Triticum, so only a small part of the diversity is shown in the network.

Domesticated Bread wheat (T. aestivum) is a hexaploid species, with the three diploid genomes being known as A, B and D. Their lineages are labeled and colored in the network diagram. The genome D lineage is the result of a homoploid hybridization (which has been taxonomically treated as part of Aegilops). Bread wheat is then the recent result of two successive allopolyploid hybridizations, with a tetraploid lineage as the intermediate.

Of the other species shown in the network, all of the goatgrasses are wild diploid species, as is T. uartu. T. monococcum is also diploid, with domesticated Einkorn wheat being derived from the wild ancestor. T. turgidum is a tetraploid species, with domesticated Emmer wheat being derived from the wild ancestor — it has recently diversified into many modern wheat species.

This is one of the most complex phylogenetic networks known, although that complexity is at least partly the result of leaving out most of the other diploid species in the Triticum and Aegilops clades. Program outputs that are more complex than this are unlikely to be realistic.

January 3, 2015


Networks are visually more complicated than trees, because there are extra edges representing reticulate relationships. Technically this means that some of the nodes have in-degree >1, and that there are one-to-many connections among these nodes. This can create visual clutter. I recently presented one simple way that might alleviate this (Circular phylograms for phylogenetic networks).

Another possibility is to add to the network what are called meta-nodes. These meta-nodes represent groups of nodes, so that the edges between the meta-nodes and the other nodes can represent different types of relationship. This reduces the one-to-many connections in the graph.

As pointed out by Elijah Meeks at the Digital Humanities Specialist blog, pedigrees represent a neat example of this concept. In this example, there are several types of traditional relationship that can be represented: husband, wife and child. Since these relationships are explicitly shown (ie. the direction of the relationship is explicitly shown), the figure can be drawn unrooted.

The example shown here (reproduced from Meeks' post) has the meta-nodes in grey, each representing a family. These nodes are unlabeled, while the person-nodes are labeled with the person's name and noble title. Females have pink nodes, and males blue ones. The edges connecting them to the grey nodes are colour-coded as: blue = husband, pink = wife, orange = child.

So, for example, the right-hand family node indicates that Charles I and Henrietta Maria were husband and wife, and that they had three children: Mary Henrietta, James II and Charles II.

In this case, the reduction in one-to-many connections does make the relationships more clear, so that interpretation is easy. However, it potentially makes the network more complicated (as Meeks notes) because of "just how tangled up certain families can be" — adding the extra meta-nodes exacerbates the tangling. Meeks provides another example in his blog post.

December 29, 2014


This end-of-year post has nothing to do with networks, or even phylogenetics, although the general principle involved might apply to both. My point here is simply that experts sometime look foolish when they commentate on fields outside their own area of expertise.

As an introductory example, I remember reading a paper in a physics journal that tried to convince the readers that humans could potentially live forever. Unfortunately, the authors confused the concepts of lifespan and longevity, which is pretty basic stuff in population biology. Lifespan is the length of time for which humans normally live. We have more than doubled this over the past millenium, due to changes in sanitation, medication, surgery and safety. Longevity is the length of time for which humans are capable of living. We have not changed this by even one year, as it seems to be related to phenomena like programmed cell death. Changes in lifespan do not therefore entail changes in longevity; all that has happened is that our expected lifespan is now closer to our observed longevity than it previously has been.

More recently, an electrical engineer drifted into the field of literature while claiming to be a scientist — Mikhail Simkin (2013) Scientific evaluation of Charles Dickens. Journal of Quantitative Linguistics 20: 68-73. Sadly, his article displays neither of the characteristics of science (replication and control), nor does it appear to contribute anything much to literature.

As noted on his web page, the author had trouble publishing this article, and he has subsequently received "a flood of criticism", which he naively seems to believe he has rebutted at the Significance blog.

His intention was a simple one: a comparison of the writing style of Charles Dickens and that of Edward Bulwer (later known as Edward Bulwer-Lytton). His premise was: "Edward Bulwer-Lytton is the worst writer in history of letters ... In contrast, Charles Dickens is one of the best writers ever." He put online a quiz with "a dozen representative literary passages, written either by Bulwer-Lytton or by Dickens." The takers had to nominate the author of each quote. Simkin discovered that on average the votes were "about 50%, which is on the level of random guessing. This suggests that the quality of Dickens's prose is the same as that of Bulwer-Lytton." The results are shown in the graph above.

Simkin's intention seems to have been to demonstrate that currently revered and non-revered authors do not differ much in style, which is a contention that I see no reason to disagree with, but if so he has gone about showing this in a remarkably unscientific manner.

Let us take the premise first, for which the author provides no personal justification nor any reference to a published one. It seems patently true that the current fashion is for Dickens to be widely read but Bulwer not. This on its own means little, however, as even the Shakespearean works have had a century or so of being out of fashion, although not in the past couple of hundred years (to the dismay of anyone who has had an English-language education).

Was Bulwer a bad writer? Well, first, the results of Simkin's poll imply "no", at least in comparison to Dickens. But more importantly, many other sources say "no", as well. Indeed, Wikipedia makes a strong case both for his popularity in his own time, and for considerable influence on literature since then. Indeed, he is so 'obscure' that towns as far apart as Canada and Australia are named after him. His works are so 'poorly known' that we continue to use his expressions "pursuit of the almighty dollar" and "the pen is mightier than the sword". His works have been so 'derided' that several operas are based on his books, including one by Richard Wagner; and authors such as Edgar Allan Poe have paraphrased his words. His books are such 'poor examples' of English that people have felt compelled to translate them into Serbian, German, Russian, Norwegian, Swedish, French, Finnish, Spanish and Japanese, among other languages.

Clearly, the premise that Bulwer represents the nadir of English-language literature holds no water. He is currently obscure, but as John S. Moore has noted, the fact that he is not read does not mean that he is not worth reading.

Indeed, a scientist would immediately note the lack of replication here. Why are "best" and "worst" writers not replicated in the experiment? This would immediately address any possible mis-judgements about potential literary worth. It is repeated patterns that provide convincing evidence in science, not isolated pairwise comparisons. This poll is hardly a "scientific evaluation", as claimed by the author.

Now let us consider the experimental procedure. This consisted of choosing "representative literary passages", without any explanation for how this was done or what were the criteria for choice. Clearly, this choice is the key to the experiment. After all, all the experiment does is show that one can find passages by both Dickens and Bulwer that are hard to distinguish. That could very well be true of almost any pair of writers from the same culture (ie. country and century). The experimental comparison has thus not been controlled, as it would be in science.

What would experimental control look like in this case? Clearly, the issue is one of style, since authors vary their writing style depending on the book, the plot situation, and even the character involved. (One of Bulwer's passages is actually taken from the dialog of one of his characters, which hardly represents the author's own writing style!) The objective, then, must be to find passages that represent the range of styles present in the corpus of each writer. One might try grouping the passages into topics or styles, for example, or whether they describe actions or locations, etc.

Without either replication or control, this literary evaluation cannot be considered to be scientific. Sadly, on his website Simkin has several other so-called scientific comparisons within the arts, designed in exactly the same inadequate way.

As a final note, we can ask why was Bulwer chosen for this comparison in the first place? The choice seems to be almost solely due to various extant parodies of the opening of one of his books, Paul Clifford (1830): "It was a dark and stormy night; the rain fell in torrents ..." For example, this was chosen by Charles Schultz in his Peanuts cartoons, as the opening of one of Snoopy's failed attempts to be a world-reknowned author. The full sentence does not actually seem bad, although it tries to cram a bit much information into the number of words available. Thomas Hardy later tried the same thing, but with more success, in The Return of the Native (1878): "A Saturday afternoon in November was approaching the time of twilight ..."

However, the award for sheer bravado surely goes to D.H. Lawrence, in his short story Tickets, Please! (1919), which starts with a paragraph consisting of a sentence of 118 words, followed by sentences of 15 words, 27 words and finally 113 words.** A plethora of commas, colons, semi-colons and dashes are needed to keep the meaning coherent in this page-long paragraph. You and I could not get away this, which is why Lawrence is considered to be one of the great English literary stylists. Apparently, Bulwer did not get away with it, either.

** My count is based on the original publication in The Strand magazine, which is slightly different to subsequent versions.

December 24, 2014


Season's greetings.

For your Christmas reading, this blog usually provides a seasonally appropriate post on fast-food, including to date: nutrition (McDonald's fast-food) and geography (Fast-food maps). This year, we will focus on the effects of fast-food on people.

Defining fast-food is a bit of a trick. The U.S. Census of Retail Trade defines a fast-food establishment merely as one that does not offer table service. However, legislation recently passed in Los Angeles defines fast-food establishments as those that have a limited menu, items prepared in advance or heated quickly, no table service, and disposable wrappings or containers. Some people feel that these definitions should include all pizza restaurants, even those that do offer table service in addition to take-away (or take-out). The latter are sometimes distinguished as fast-casual restaurants rather than fast-food restaurants.

About 90% of Americans say they eat fast-food, including those who visit an establishment on average once per day. The main concern about the effect of fast-food, then, is on people's diet. By "diet" I mean the combination of foodstuffs consumed each day, which may or may not match what is known to be required for a healthy human. Fast-food rarely matches this diet, and so there must be some effect of eating the stuff.

In particular, fast-food has been implicated in what is now known within medicine as the "obesity epidemic" — the observation that an increasing proportion of the people in the developed world are formally classified as obese. The usual symptom of obesity is a body mass index (BMI) > 30 (overweight is 25-30, normal is 18.5-25). BMI is an approximate measure of body fat.

Obesity has risen rapidly in recent decades, but there is some evidence that the levels are now beginning to stabilize (Obesity Rates & Trends Overview). The main risk with obesity is its strong association with potentially fatal health problems, notably heart disease, stroke, high blood pressure, and diabetes. Indeed, it has been suggested that obesity may be the greatest cause of preventable death in the United States.

Demonstrating a relationship between fast-food and obesity is not hard, given the high sugar, carbohydrate, fat, and salt content of most of the food items. This results in the intake of more energy than the body uses, and this excess is stored as fat. This pattern shows up clearly in large-scale samples of prevalence, such as this one collated on the DataMasher site, where each point represents a state of the USA.

An obvious issue concerning fast-food is our ability, or lack of it, to understand just how many calories (or joules) there are in fast-food meals. The marketing people seem to have a clear idea about how different fast-food chains are presented in terms of their food quality, as shown in this Perceptual Map.

However, this perception is clearly not accurate in terms of calories, especially for Subway. An article in the British Medical Journal evaluated the ability of people to estimate the calorie content of the fast-food meal they had just purchased. As shown in the next graph, clearly in most cases there was a major under-estimate, and this was worst for the highest-calorie meals. The under-estimation of calorie content was largest among Subway diners. Diners at both Subway and Burger King showed greater under-estimation of meal calorie content than those at McDonald's, whereas diners at Dunkin' Donuts had less under-estimation. In other words, Subway is not as healthy for you as you think it is, but you already know how bad those Donuts are.

One response to this situation has been to insist that fast-food places advertise the calorie content of their food on the menu board itself. For example, it has been suggested that nutrition experts can compose apparently healthy meals based on the nutritional information provided in the menus of fast-food restaurant chains.

This will only have an effect, however, if people actually use this information when choosing their meal. An article in the Journal of Public Health suggested that most young people don't actually do so, and that people who eat fast-food most often are least likely to do so. Indeed, a report from Sandelman Associates showed that the only people who are likely to use calorie information regularly are those with a specific "calorie target" for their personal diet, as shown in this next chart.

Nevertheless, an article published in the British Medical Journal has reported a decrease in the energy content of fast-food purchases after the introduction of calorie information on the menu boards, except at Subway, where there was an increase. (before the labeling the Subway meals chosen had fewer calories than for the other chains but afterwards they had more!)

Another important feature of fast-food is the usually large portion sizes, which exacerbates the energy imbalance. An article in the Journal of the American Dietetic Association has shown that not only does modern fast-food exceed dietary standard serving sizes by at least a factor of 2, and sometimes by as much as 8, these serving sizes have increased dramatically over the past 50 years.

What is perhaps most surprising is the truly vast difference that can occur between servings of what is allegedly the same fast-food product, not only between countries but within a single country. The following graph is from an article in the International Journal of Obesity. It shows, for the named locations, the amounts of total fat in a meal consisting of 171 g McDonald's french fries and 160 g KFC chicken nuggets. The darker colour indicates the added amounts of industrially produced trans fat. The values in parenthesis are the amount of trans fat as a percentage of total fat.

On a somewhat different note, one of the main characteristics of fast-food is the focus on a sweet taste, rather than on a diversity of tastes. In contrast, traditional cooking in many cultures has focussed on mixing together a diversity of complementary ingredients. Indeed, this was the impetus for the formation of the Slow Food movement, founded "to prevent the disappearance of local food cultures and traditions ... and combat people's dwindling interest in the food they eat, where it comes from and how our food choices affect the world around us." (It was organized after a public demonstration at the intended site of a McDonald's franchise at the historic Spanish Steps, in Rome.)

This topic was investigated in detail in an article published in Nature Scientific Reports. The authors produced the following network of food flavours.

Interestingly, they conclude that:
We introduce a flavor network that captures the flavor compounds shared by culinary ingredients. Western cuisines show a tendency to use ingredient pairs that share many flavor compounds, supporting the so-called food-pairing hypothesis. By contrast, East Asian cuisines tend to avoid compound-sharing ingredients.There is diversity even in the amount of diversity.

December 21, 2014


Here are five more tattoos in our compilation of evolutionary tree tattoos from around the internet. For more examples of this circular design for a phylogenetic tree, in a variety of body locations, see Tattoo Monday, Tattoo Monday V, and Tattoo Monday VII.

December 16, 2014


It has been noted before that we have a wide range of mathematical techniques available for producing data-display networks, most notably the many variants of splits graphs (see Huson & Scornavacca 2011). For example, NeighborNets and Consensus networks are commonly encountered in the phylogenetics literature, and Reduced median networks and Median-joining networks are commonly used for haplotype networks in population biology.

However, there are few techniques used to produce evolutionary networks. Studies of reticulate evolutionary histories, which include recombination networks, hybridization networks, introgression networks and HGT networks, have no unifying theme as yet. So, the biological literature has many papers in which biologists struggle with reticulate evolutionary histories using ad hoc collections of techniques, which often boil down to simply presenting incongruent phylogenetic trees from different datasets (see Morrison 2014a).

So, maybe a brief look at the current state of play with evolutionary networks would be useful. There are enough worthwhile techniques out there for people to be using them more often than they are.


Almost all current phylogenetic methods assume that the basic building unit is a non-recombining sequence block, for which the evolutionary history is strictly tree-like. We tend to call these blocks "genes" and their history "gene trees", but this is just for semantic convenience. In practice, we first collect data for various loci, and we then simply make the assumption that there is recombination between the loci but not within them. This is basically the assumption of independence between loci. At the limit, each nucleotide along a chromosome has a tree-like history, but for aggregations of nucleotides it is all assumptions.

Furthermore, we assume that there are no data errors that will confound any reconstruction of the phylogenetic trees. Possible sources of error include: incorrect data (e.g. contamination), inappropriate sampling (taxa or characters), and model mis-specification. Any of these errors will lead to stochastic variation at best and to bias at worst.

Gene-tree incongruence

Reticulate evolutionary processes lead to gene trees that are not all congruent. However, there are two other processes that have been widely recognized as also producing gene-tree incongruence, but which do not involve reticulation in the strict sense: incomplete lineage sorting (deep coalescence; ancestral polymorphism), and gene duplication-loss.

Many studies have now shown that stochastic variation due to ILS can be very large (see Degnan & Rosenberg 2009), and that this varies in relation to both the population sizes of the taxa and the times between divergence events. The expectation of completely congruent gene trees is thus very naive, even when the evolutionary history of the taxa has been strictly tree-like. A number of methods have been developed to reconstruct species trees in the face of ILS (Nakhleh 2013).

DL involves gene duplication (which can be repeated to create gene families) followed by selective gene loss. The phylogenetic history of the genes is usually presented as an unfolded species tree, where each gene copy has its own part of the tree. A number of methods have been developed to reconstruct gene DL histories given a "known" species tree, which is called gene-tree reconciliation (Szöllősi et al 2015). However, our interest here is in the reverse process, in which reconstructed but incongruent gene trees are combined into a single species tree, given a model of duplication and selective loss, which is called species-tree inference (which is the same as cophylogeny reconstruction; Drinkwater & Charleston 2014).


Known biological processes such as recombination, reassortment, hybridization, introgression and horizontal gene transfer all create reticulate phylogenetic histories. However, it is a moot point as to whether these processes can be distinguished from each other solely in the context of an evolutionary network (Holder et al. 2001; Morrison 2015). These evolutionary processes operate by distinct biological mechanisms, but the evolutionary patterns that they create can all be rather similar. The processes all result in gene flow among contemporaneous organisms (usually called horizontal flow or transfer), whereas other evolutionary processes involve gene flow from parent to offspring (usually called vertical inheritance), including ILS and DL. These gene flows create incongruent gene histories, which we may detect directly in the data or via reconstructed gene trees. The patterns of incongruence do not necessarily allow us to infer the causal process.

There are a number of differences in pattern, but the consistency of these is doubtful. Polyploid hybridization produces the most distinctive pattern, because there is duplication of the genome in the hybrid. However, subsequent aneuploidy will serve to obscure this pattern. Homoploid hybridization nominally involves 50% of the genome coming from difference sources, while introgression ultimately involves a smaller percentage. However, in practice, genome mixtures vary continuously from 0 to 50%. HGT also involves a small percentage of the genome, but in theory it also can vary from 0 to 50%. Reassortment produces mixtures of viral genes, which can occur in such a great number that reconstructing the history is severely problematic.

So, in the absence of independent experimental evidence, distinguishing one form of evolutionary network from another is almost a matter of definition. This has become increasingly obvious in the methodological literature, where semantic confusion abounds.

For example, a network produced directly from a set of characters has usually been called a "recombination network", while one produced from a set of trees has usually been called a "hybridization network", irrespective of what processes the gene trees represent. Furthermore, models that add reticulation events to DL trees have usually referred to the horizontal gene flow as "HGT", whereas models that add reticulation events to ILS trees have usually referred to the horizontal gene flow as "hybridization" (Morrison 2014a). Studies of horizontal gene flow during human evolution have usually referred to "admixture", which is a more process-neutral term.

In many, if not most, cases we might all be better off if network methods simply distinguish gene flow among contemporaries (horizontal) from gene inheritance between generations (vertical), rather than trying to infer a process — process inference can often best take place after network construction. This does not help anthropologists, of course, who are dealing with evolutionary networks where oblique gene flow is possible (so that they do not have Time inconsistency in evolutionary networks).


There seems to be a dichotomy of purposes to current method development, which are neatly summarized by the contrasting theoretical views of Mindell (2013) and Morrison (2014b). These views each recognize that evolutionary history involves both vertical and horizontal processes, but they reconstruct the resulting evolutionary patterns as a species tree and a species network, respectively. Obviously, this blog is dedicated to the latter point of view, but it is the former one (the so-called Tree of Life) that seems to currently dominate the literature.

Focussing on gene-tree inference, Szöllősi et al (2015) provide a comprehensive review of the various models that have been used to describe the dependence between gene trees and species trees. Essentially, gene trees are contained within the species tree, and they may differ from it in relative branch lengths and/or topology. The differences between genes and species are the result of population-level processes, often modeled using the coalescent. These authors recognize four current classes of probabilistic model that combine different evolutionary processes:
  • the DLCoal model, which combines coalescence and DL
  • the DTLSR model and the ODT model, both of which combine gene transfer and DL
  • models that combine hybridization and ILS
  • models of allopolyploidization.
When inferring species trees from gene trees (species-tree inference), we basically combine the scores for all of the gene trees, and then search for the species tree with the best overall score. This involves adding the scores in parsimony analyses, or multiplying the conditional probabilities in likelihood analyses (ie. maximum-likelihood or bayesian context). Many methods have been developed for inferring a species tree based on multi-locus data. These differ in whether the gene and species trees are estimated simultaneously or sequentially, and in how the gene trees are used to infer the species tree. Nakhleh (2013) and Szöllősi et al (2015) discuss both parsimony and likelihood methods for species-tree inference based on either ILS or DL models.

Extending these ideas to infer networks (rather than species trees) is a bit more tricky, and most of the work to date has involved combining hybridization and ILS. There has been no recent summary of the ideas. However, calculating the parsimony score of a network, given a set of gene-tree topologies, has been beed addressed by Yu et al (2011); and Yu et al (2013a) have extended these ideas to heuristically search the network space for the optimal network (the one that minimizes the number of extra reticulation lineages in a species tree). Furthermore, methods for computing the likelihood of a phylogenetic network, given a set of gene-tree topologies, have been devised by Yu et al (2012, 2013b); and Yu et al (2014) have extended these ideas to heuristically search for the maximum-likelihood network for limited cases of introgression or hybridization (since they differ only in degree).

There are also several methods that simply use gene-tree incongruence to infer reticulation events in a species network (Huson et al. 2010). Basically, these methods combine gene trees into "hybridization networks" by minimizing the number of reticulations required for reconciliation, measured either by counting the reticulations or calculating the network level. The combinatorial optimization can be based on trees, triplets or clusters, using parsimony as the optimality criterion. These methods model homoploid hybridization by assuming that reticulation is the sole cause of all gene-tree incongruence. This means that they are likely to overestimate the amount of reticulation in a dataset when other processes are co-occurring.

The most completely developed network methods involve data for allopolyploid hybrids. Here, there are multiple copies of each gene, one in each copy of the genome, so that allopolyploid hybrids have more copies than do their diploid parent taxa. To construct a hybridization network topology, Huber et al (2006) developed a parsimony method based on first estimating a multi-labeled gene tree, and then searching for the single-labeled network that best accommodates the multiple gene patterns. The model has been extended to heuristically include ILS (Marcussen et al 2012), as well as dates for the internal nodes (Marcussen et al 2015). Jones et al. (2013) have also developed models that incorporate ILS in a bayesian context, but only for the case of a single hybridization event between two diploid species (an allotetraploid).

Species-tree inference for a pair of gene phylogenies that may be networks not trees, has been considered in terms of parsimony by Drinkwater & Charleston (2014).

This brings us to the matter of introgression. The massive recent influx of genome-scale data for hominids has lead to the development of methods explicitly for the analysis of what is termed admixture among the lineages. These methods basically work by constructing a phylogenetic tree that includes admixture events, the topology inference being based on allele frequencies. There has been no formal comparison of the methods, and not much application to non-humans. Three such methods have been produced so far (Patterson et al 2012; Pickrell & Pritchard 2012; Lipson et al 2013).

Recombination has somewhat been the poor cousin to other causes of reticulation, as most network methods assume it to be absent. Nevertheless, Gusfield (2014) has recently provided an ample survey of the study methods available to date.


Degnan JH, Rosenberg NA (2009) Gene tree discordance, phylogenetic inference and the multispecies coalescent. Trends in Ecology & Evolution 24: 332-340.

Drinkwater B, Charleston MA (2014) An improved node mapping algorithm for the cophylogeny reconstruction problem. Coevolution 2: 1-17.

Gusfield D (2014) ReCombinatorics: the Algorithmics of Ancestral Recombination Graphs and Explicit Phylogenetic Networks. MIT Press, Cambridge.

Holder MT, Anderson JA, Holloway AK (2001) Difficulties in detecting hybridization. Systematic Biology 50: 978-982.

Huber KT, Oxelman B, Lott M, Moulton V (2006) Reconstructing the evolutionary history of polyploids from multilabeled trees. Molecular Biology & Evolution 23: 1784-1791.

Huson D, Rupp R, Scornavacca C (2010) Phylogenetic Networks: Concepts, Algorithms, and Applications. Cambridge University Press, Cambridge.

Huson DH, Scornavacca C (2011) A survey of combinatorial methods for phylogenetic networks. Genome Biology & Evolution 3: 23-35.

Jones G, Sagitov S, Oxelman B (2013) Statistical inference of allopolyploid species networks in the presence of incomplete lineage sorting. Systematic Biology 62: 467-478.

Lipson M, Loh P-R, Levin A, Reich D, Patterson N, and Berger B (2013) Efficient moment-based inference of population admixture parameters and sources of gene flow. Molecular Biology & Evolution 30: 1788-1802.

Marcussen T, Heier L, Brysting AK, Oxelman B, Jakobsen KS (2015) From gene trees to a dated allopolyploid network: insights from the angiosperm genus Viola (Violaceae). Systematic Biology 64: 84-101.

Marcussen T, Jakobsen KS, Danihelka J, Ballard HE, Blaxland K, Brysting AK, Oxelman B (2012) Inferring species networks from gene trees in high-polyploid north American and Hawaiian violets (Viola, Violaceae). Systematic Biology 61: 107-126.

Mindell DP (2013) The Tree of Life: metaphor, model, and heuristic device. Systematic Biology 62: 479-489.

Morrison DA (2014a) Phylogenetic networks: a review of methods to display evolutionary history. Annual Research and Review in Biology 4: 1518-1543.

Morrison DA (2014b) Is the Tree of Life the best metaphor, model or heuristic for phylogenetics? Systematic Biology 63: 628-638.

Morrison DA (2015, in press) Pattern recognition in phylogenetics: trees and networks. In: Elloumi M, Iliopoulos CS, Wang JTL, Zomaya AY (eds) Pattern Recognition in Computational Molecular Biology: Techniques and Approaches. Wiley, New York.

Nakhleh L (2013) Computational approaches to species phylogeny inference and gene tree reconciliation. Trends in Ecology & Evolution 28: 719-728.

Patterson NJ, Moorjani P, Luo Y, Mallick S, Rohland N, Zhan Y, Genschoreck T, Webster T, Reich D (2012) Ancient admixture in human history. Genetics 192: 1065-1093.

Pickrell JK, Pritchard JK (2012) Inference of population splits and mixtures from genome-wide allele frequency data. PLoS Genetics 8: e1002967.

Szöllősi GJ, Tannier E, Daubin V, Boussau B (2015) The inference of gene trees with species trees. Systematic Biology 64: e42-e62.

Yu Y, Barnett RM, Nakhleh L (2013a) Parsimonious inference of hybridization in the presence of incomplete lineage sorting. Systematic Biology 62: 738-751.

Yu Y, Degnan JH, Nakhleh L (2012) The probability of a gene tree topology within
a phylogenetic network with applications to hybridization detection. PLoS Genetics 8:

Yu Y, Dong J, Liu KJ, Nakhleh L (2014) Maximum likelihood inference of reticulate evolutionary histories. Proceedings of the National Academy of Sciences of the USA 111: 16448-16453.

Yu Y, Ristic N, Nakhleh L (2013b) Fast algorithms and heuristics for phylogenomics
under ILS and hybridization. BMC Bioinformatics 14: S6.

Yu Y, Than C, Degnan JH, Nakhleh L (2011) Coalescent histories on phylogenetic networks and detection of hybridization despite incomplete lineage sorting. Systematic Biology 60: 138-149.

December 14, 2014


This blog has previously reproduced some of the unpublished sketches by Charles Darwin that involve tree-like relationships:
  • Part 1 — collected notebooks and notes
  • Part 2 — a letter to Charles Lyell
  • Part 3 — a reconstruction from one of his books
Recently, the first two of these posts have been updated.

Part 1 was updated to include three new sketches. I had previously encountered references to them but had not located them amongst the online Darwin documentation.

Part 2 was updated to include information from a paper on the same topic that was published several months after the blog post itself.

December 9, 2014


Phylogenetic trees have been drawn in many formats, including what are known as vertical, horizontal, multidirectional, radial, hyperbolic (restricted to interactive trees) and figurative (ie. looking like an actual tree). Radial, or circular, trees are used when there are many taxa — the root is placed at the centre, and the increasing length of the circumference is used to display the increasing number of nodes. An example is shown in the earlier blog post Why do we still use trees for the dog genealogy?

Here, I point out that the radial format also makes it much easier to display reticulations in an evolutionary network. My example comes from The Nam Family: a Study in Cacogenics (Arthur H. Estabrook and Charles B. Davenport. 1912. Eugenics Record Office Memoir No. 2. Cold Spring Harbor, NY). This book involves, among other things, a pedigree study of an extended family in New York state, with a large amount of inbreeding. Two large pedigrees are presented, representing the genealogies of two different parts of the extended family in a place called "Nam Hollow".

One of these pedigrees is drawn in the vertical format, with the earliest generations at the top. The other pedigree is drawn in the radial format, with the earliest generations in the centre.

The difference in choice of format seems to be a result of the fact that in the second case there is extensive reticulation within the earlier generations, and this is obviously much easier to display in the centre of a circle, with increasing circumference for the large number of descendants. Nevertheless, the first pedigree would also be easier to read in the radial format. It is surprising that this format is not used more often.


The study under discussion was one of several projects that arose from the eugenics movement in the USA. The reports include Hill Folk: Report on a Rural Community of Hereditary Defectives (Davenport. 1912), The Kallikak Family: a Study in the Heredity of Feeblemindedness (Henry Herbert Goddard. 1912), and The Jukes (Estabrook. 1916). Eugenics arose in the wake of research on Mendelian inheritance, applying it to the study of human societies. This was thus the initial phase of what we now call the study of human genetics, and large amounts of detailed data were collected in many parts of the world.

Unfortunately, the researchers greatly over-estimated the role of genetics in human behavior, attributing many of the by-products of poverty to "constitutional" characteristics. In particular, many of what we now consider to be environmental aspects of poverty were attributed to inbreeding (which is another feature common in poor communities). This is in contrast to previous studies of the same US families, such as that of Richard L. Dugdale (1874-1877. The Jukes: a Study in Crime, Pauperism, Disease and Heredity), which placed more emphasis on the environment as a factor in criminality, disease and poverty.

So, the eugenics researchers tended to collect data that we would now consider to be seriously biased, where the observations are inextricably confounded with interpretations. For example:
V-166 [person #166 in generation V] is a temperate, sociable, and licentious man, who married his cousin, V-183, a Nam-like, stolid shy, reticent, suspicious harlot. They had eight children ... All have the characteristic slowness in movement, and indolence and lack of ambition of the Nams. They vary little except that some are more reticent and shy than others, and there is some licentiousness. All are illiterate, and probably without the capacity for learning from books. VI-257, who is especially careless, disorderly, and shy, had an illegitimate son, who died of infantile diarrhea. Here again we see the uniformity resulting from inbreeding.What was worse, the eugenics movement did not stop at mere scientific enquiry. They indulged, with governmental support, in what they politely called "social prophylaxis". For example:
Although our primary aim is the present the bare facts [!] we cannot altogether neglect the natural inquiry as to the proper treatment of such condition as we have described. Various possible modes of treatment will be considered.First there is the method of laissez faire. The Nam community takes care of itself to a large extent; why do anything? Unfortunately, the community is not wholly isolated. From it families have gone to Minnesota and other points in the West and there formed new centers of degeneration. Harlots go forth from here and become prostitutes in our cities. The tendency to larceny, burglary, arson, assault, and murder have gone, with the wandering bodies in which they are incorporated, throughout the State and to great cities like New York. Nam Hollow is a social pest spot whose virus cannot be confined to its own limits. No state can afford to neglect such a breeding center of feeble-mindedness, alcoholism, sex-immorality, and infanticide as we have here. A rotten apple can infect the whole barrel of fruit. Unless we abandon the ideal of social progress throughout the State we must attempt an improvement here.The authors seem to be almost foaming at the mouth by the end of their spiel. Option two, "improving the conditions of the persons in the Hollow" is dismissed as "supplying a veneer of good manners to a punky social body." Option three, "scattering the people" is seen as "fraught with danger". Nevertheless, this was the option preferred by the British government in the late 1700s and early 1800s, when they founded penal colonies in Australia for crimes like "stealing five cheeses". The assumption that poverty is hereditary certainly has a long history, and a wide geographical spread.

Option four, preventing the people from breeding, by isolating them, is the recommended one. The final note is: "Of course, asexualization would produce the same result; but it is doubtful if public sentiment would favor such treatment, quite within the province of the State though it be." We now know this to be a very naive conclusion. By the 1930s many western countries had active compulsory sterilization programs (see Wikipedia); and many still do, including states of the USA.

However, eugenics did have positive outcomes, among the obvious negative ones. For example, the first demonstration of simple Mendelian inheritance of a human medical condition concerned Unverricht-Lundborg disease, a form of epilepsy. This was first reported in 1891 by Heinrich Unverricht, in Estonia. However, it was Herman Lundborg, a Swedish physician, who first identified its genetic component (1903. Die progressive Myoclonus-Epilepsie (Unverricht’s Myoclonie). Almqvist and Wiksell, Uppsala).

He traced the ancestry of 17 affected people in one family from southern Sweden, showing that they were all descended from the same ancestors. The pedigree showed the pattern of disease occurrence expected from Mendelian inheritance of a single recessive locus. This study was facilitated by frequent inbreeding within the family (20% of households had first-cousin parents), which Lundborg referred to as "unwise marriages". We now know that the disease results from a mutation in the CCC-CGC-CCC-GCG repeat region of the cystatin B gene — unaffected people have 3-4 repeats while affected people have 40+ repeats.

Lunborg himself was an active member of the eugenics movement in Sweden (which was referred to as 'race biology'), and most of his writings about the epileptic family were as bad as those quoted above (their "degeneration" was attributed to the fact that "they distilled their own alcohol, and thus became drunkards"). He eventually became Professor for Racial Hygiene; and he was influential in the implementation of forced sterilization programs in Sweden, believing that "The future belongs to the racially fine people", which obviously included himself.

December 7, 2014


I noted in an earlier post (The first royal pedigree) that interest in genealogy dates back to at least Roman times, where the so-called stemmata were displayed in homes, to distinguish between the patrician class (those with proven noble ancestry) and plebeians (commoners). We are not quite so ostentatious today, but the nobility are still just as snooty about their ancestry.

I also noted that the first known illustration of a noble pedigree is the Tabula Genealogica Carolingorum (c.1000 CE), which traces Cunigunde of Luxembourg's ancestry in a tree-like manner back to Charlemagne, and thence to the origin of the Carolingian dynasty in the mid 500s. This raises the question of the first known written pedigree not involving the nobility.

This appears to be a diagram labelled Genealogia Ouduini et Heimerici Decani Filii Sui, which dates from c. 1121 CE. This type of pedigree may have been relatively common among certain families at the time, but this seems to be the only surviving exemplar that has come down to us.

This diagram appears towards the end of the book Liber Floridus, composed by Lambert of Saint-Omer, who was canon of the city Church of Our Lady in Saint-Omer, in north-eastern France. The Universeitsbibliotheek at Ghent University owns the autograph of this work (ms. 92), i.e. the actual copy penned by the author himself; and it is in this copy that the author has inscribed his family pedigree (on folio 154r).

This may recall to many of you the trend to keep hand-written records of pedigrees in the fly-leaves of family Bibles during the 1800s and early 1900s, particularly in English-speaking parts of the world. It does, however, seem to go a bit beyond this. Lambert repeatedly identifies himself in the text as the author of the book, and he also includes a portrait of himself writing his book, although this is apparently usual in medieval iconography.

The Liber Floridus (Book of Flowers) is literally an illustrated encyclopedia, rather than an encyclopedia with pictures. You will find copies of the illustrations all over the Internet, because Lambert was an imaginative and colorful illustrator. He was apparently concerned that uneducated people would lose access to important knowledge, and so (unlike his predecessors) he deliberately created a book that was accessible to almost everyone. It contains a curate's egg of information, including mythical biology (ie. a beastiary), selected history, and particularly biblical knowledge. It also contains an account of the genealogy of the Counts of Flanders, Lambert's local nobility, which may have inspired his personal account.

So, in his personal copy Lambert included a tree of his maternal ancestors going back to his great-great-grandfather Odwin, as shown in the first figure. It is rather scrappy and unclear, and so Jean-Baptiste Piggin has digitized a copy, as shown below.

There are c.80 names crammed into the compact space. As with other early pedigrees of which we have a record (eg. The first royal pedigree), the tree is rooted at the top and the family ramifies downwards. Like the Great Stemma (see How confusing were the first written genealogies?), siblings are grouped in short vertical lists, so that groups of first-names form family blocks that have only one connection to their parent.

Lambert is at the bottom centre, labelled as "qui librum fecit Lambertus filius Onulfi; Eva" [Lambert who produced the book, son of Onulph and Eva]. His lineage is traced back to Eva and her siblings, so that these are Lambert's maternal relatives. Why his mother and not his father is not directly explained, but the genealogy is listed as being that of Odwin and Heimericus the Dean, so that Heimericus is presumably the important progenitor (his family dominates the tree). Lambert does refer elsewhere in the book to his father, Onulph, who had been canon of the Church of Our Lady before him. Just in case you are left in any doubt about the purpose of the pedigree, the text at the top left of the figure specifies Lambert's direct lineage from Odwin to Heimericus the Dean to Baduif to Eva and thence himself.

How accurate this genealogy is is anyone's guess. Presumably it represents an oral tradition, even if many of the relatives continued to live close to each other. It was not until much later that formal records were kept. In Britain, for example, from 1538 King Henry VIII required that church ministers keep records of christenings, baptisms, marriages and burials; and civil registration did not became law until 1837. The Germanic lands began to keep similar sacramental records at roughly the same time as the British; and the Scandinavian countries followed suit. Thus, in most European countries it is the church parish registers that pre-date any civil record keeping. Otherwise, for commoners there have been only personal records.

December 2, 2014

Network diagrams have become rather commonplace in the modern world. Most of them are constructed along the same lines — observed entities (objects or concepts, or groups of them) are connected by lines showing observed relationships. Such visualizations are relatively easy to create using computers, and so they represent a relatively new form of visual data analysis. The complexity of the diagrams can be both seen and quantitatively analyzed, thus forming part of what is now grandiosely called "data mining and knowledge discovery".

The Visual Complexity project has been compiling an interesting set of online network visualizations. While the author (Manuel Lima) intends this to be "a unified resource space for anyone interested in the visualization of complex networks", at the moment it is simply a magpie collection of references to web pages. There are currently nearly 800 visualizations referenced, grouped into:
  • Art
  • Music
  • Biology
  • Food Webs
  • Transportation Networks
  • Business Networks
  • Social Networks
  • Political Networks
  • Computer Systems
  • Internet
  • World Wide Web
  • Pattern Recognition
  • Semantic Networks
  • Knowledge Networks
  • Multi-Domain Representation
  • Others
Our interest is in the Biology group, of course, where we have long known about networks, including food webs, which you will notice are grouped separately. There are currently 52 networks (plus 8 in the Food Web group), covering a wide range of topics, such as:
  • Gene interaction networks
  • Protein-protein interaction networks
  • Protein "homology" networks
  • Neuron networks
  • Haplotype blocks
  • Metabolic pathways
  • Genome maps
  • Physiology maps
  • Disease maps
  • Visualizing the aging process

This is all very well. However, we are specifically interested in phylogenetic networks, which are as old-fashioned as food webs. They differ significantly from these other biological networks. Phylogenies connect observed entities (objects, or groups of them) only indirectly, via unobserved nodes, with the lines representing inferred affinity or genealogical relationships. Only at the population level is it likely that all internal nodes, representing individuals, will be observed, and that their relationships might also be observed.

There are currently three phylogenies referenced by Visual Complexity:
Only the last of these is a network, the other two being trees. Sadly, the first one also contains a dead link, which is a problem common for most multi-year internet projects.

Unfortunately, the uniqueness of phylogenies among networks is not acknowledged by the Visual Complexity site. This is not unusual amongst network researchers, most of whom have never even heard of phylogenies. Moreover, many of the people who do seem to have heard of them often fail to understand them and their interpretation, so that they do not notice the fundamental difference. Nevertheless, phylogenetic networks are among the oldest type of recorded network, and there are certainly complex versions of them dating back to the 1700s (see those by Herman and by Batsch in Affinity networks updated).

Finally, the Visual Complexity site does not yet have much from anthropology (as distinct from the social sciences in general) or anything from linguistics (other than programming languages!). These are promising areas for studies of visual complexity.

November 30, 2014


I mentioned in a previous post that genealogies first appeared as human pedigrees, initially based on biblical histories (The role of biblical genealogies in phylogenetics). However, such ideas were also adopted by the Roman nobility as stemmata (literally, garlands connecting portraits of ancestors) to be displayed in their homes. The latter pedigrees were used to assert the nobility of the nobles by right of family descent — stemmata distinguished between the patrician class (those with noble ancestry) and plebeians (commoners). This usage continues to this day, in most parts of the world.

However, there are no extant pedigrees (of real people) from the earliest times. The first preserved written records appear towards the end of the first millenium CE, when family chronicles began to be written by clerics in the courts or monasteries of northern France. For example, the Genealogia Arnulfi Comitis [Genealogy of Count Arnould] was compiled between 951 and 959 CE by the Benedictine monk Witger, listing the pedigree of the counts of Flanders. It was preserved at the abbey of Saint Bertin, and is reproduced in Monumenta Germaniae Historica, Tomus IX (1851) pp. 302-304.

This seems to have been as much a response to the feudal inheritance system (automatic consanguineous inheritance of fiefs) as it was a concern for familial prestige or preserving the memory of ancestors. Legitimacy of succession was the key motif, not history. It might have been this motivation that lead to the use of diagrams, as these illustrate the succession in unambiguous terms.

The first known illustration of a pedigree is the Tabula Genealogica Carolingorum from c.1000 CE. Here, Cunigunde of Luxembourg's ancestry is traced in a tree-like manner to include Charlemagne, thus legitimizing her claim to being of royal descent. Cunigunde (c.975-1040) married Henry, Duke of Bavaria, in 999. He became King Henry II of Germany ("Rex Romanorum") in 1002, at which point she became Queen consort of Germany (1002-1024); and when he was crowned Holy Roman Emperor ("Romanorum Imperator") in 1014, which was the tradition for the King of Germany, she became Empress consort of the Holy Roman Empire (1014-1024). Henry died in 1024, and Conrad II was elected to succeed him.

Cunigunde's ancestry is thus of some practical importance. Being able to trace that ancestry to Charlemagne ("Charles the Great") is of especial interest, as it made her a descendant of the Carolingian dynasty. Charlemagne (c.742-814 CE) was the last great ruler of a united Western Europe. When his son, Louis the Pious (778–840), died, his own sons fought over the succession. The resulting Treaty of Verdun (843) divided the Carolingian Empire into three kingdoms, without any consideration for linguistic or cultural groupings. Europe has been arguing over national boundaries ever since; and the European Union is thus the first serious attempt to return to Carolingian times for more than 1,100 years.

The oldest copy of the Tabula Genealogica Carolingorum is shown in the first figure. It is from the Bayerischen Staatsbibliothek, in Munich. BSB Clm 29880(6. Since it is almost unreadable, Jean-Baptiste Piggin has digitized a copy, as shown above.

The pedigree is drawn very like an upside-down tree. (Actually, it looks like a chandelier hanging from the ceiling.) The ancestors of Charlemagne form a trunk at the top, and his descendants fan out as tree branches at the bottom. Cunigunde herself is at the bottom-left, labelled "Cynigund imperatrix" [empress]. She is thus part of the seventh generation from Charlemagne (labelled "Karolus rex" and also "imperator in Frantia"). Her connection is through Louis the Pious' second son, who became "Karolus rex Francie et Hispaniae". Her ten siblings are not shown.

Charlemagne's ancestors are traced back 200 years, to the mid 500s CE. The ancestry as shown is via the male lineage back to Arnulf of Metz (c.582-640). However, the person listed at the root of the pedigree, Arnoald of Metz (c.540/560-c.611), is disputed — he may have been the father of Arnulf's wife (Doda), rather than of Arnulf himself.

Cunigunde's husband is shown in a separate pedigree of seven people at the bottom right. He is labelled "Heinricus dux Baioariae" — the rest is unreadable but Piggin transcribes it as "postea imperator" [later emperor].

There is also an annotated transcription in the Monumenta Germaniae Historica, Tomus II (1829) p.314, as shown above. This is taken from the copy in the Codicum Manuscriptorum Bibliothecae Regiae Monacensis. It is displayed in a much more conventional modern form; and it lists Henry as "Romanorum imperator".

Piggin notes that another version of the pedigree was drawn between 1101 and 1111 CE at the monastery of Prüm and bound into the Liber Aureus, a book of important Prüm documents. Finally, there is also a version of the pedigree that tries to hint at a divine origin for the nobles, as shown in the figure below. This is from the Chronicon Universale at the Thüringer Universitäts- und Landesbibliothek, in Jena, Codex Bose quarto 19 fol. 152v. Several editions of this book were produced between 1100 and 1125 CE.

In noble pedigrees, the presence of sacred progenitors who sanctify the lineage is not uncommon, as this legitimizes the nobility in religious as well as secular terms. Interestingly, this idea seems to trace all the way back to the Ancient Greeks, who employed genealogy to prove descent from a god or goddess.

November 25, 2014


This the 300th post on this blog, and so I thought we might have a bit of a summary. Here is the early history of phylogenetic trees and networks as we currently know it. There may, of course, be as yet undetected sources. Details of each of these historical notes (including illustrations) can be found elsewhere in this blog — you can use the search feature in the right side-bar to find them.


Genealogies as pedigrees (the history of individuals) have a long history. For example, they appear in inscriptions concerning the pharaohs of Ancient Egypt, although these are very imprecise and have caused many headaches for modern scholars. They appear as chains of ancestors and descendants in the Old Testament of the Christian Bible, often contradicting each other and claiming impossible lifespans. Most importantly for modern usage, they were employed in the New Testament to legitimize Jesus as the messiah foretold in the Old Testament. The first known illustration of this appeared in c.400 AD, and it was actually a network, as there were two lineages leading to Jesus (via both Joseph and Mary).

The apparent success of this application (later called the Tree of Jesse, pictures of which started appearing in the 10th century) has meant that both royalty and the nobility have subsequently used pedigrees to assert their own right to be regal and noble. The first known illustration of this is from c.1000 AD, in which Cunigunde of Luxembourg's ancestry was traced in a tree-like manner to include Charlemagne, thus legitimizing her claim to being royal.

Also, up until 1215 AD marriage within seven degrees of separation was not allowed by the christian church, and intestate inheritance applied the same relationship limit. So, a record of blood ties among relatives was often needed; and these started appearing in family bibles, for example. The first recorded tree-like illustrated pedigree was for Lambert of Saint-Omer, which appeared in 1122 AD in his personal copy of his book Liber Floridus.

It seems obvious, then, to also construct genealogies for groups of organisms, which we now call phylogenies (a word coined by Ernst Haeckel in 1866). The Great Chain of Being was for a long time the most popular iconography for relationships, mainly because it neatly tied in with the Christian philosophy of a chain of intellectual ideas, leading from pragmatic earthly concerns and culminating in the idealistic heavens. Humans were, of course, at the head of the chain of earthly beings, and capable of ascending to the heavens.

However, this did not work from a purely observational point of view. Observed pedigrees were not linear, but branched with each generation and often fused again via marriage. Furthermore, biodiversity (the patterns among groups of organisms) also seemed to have multiple relationships. This lead Vitaliano Donati in 1750 (Della Storia Naturale Marina dell' Adriatico) to suggest that:
In addition, the links of the chain are joined in such a way within the links of another chain, that the natural progressions should have to be compared more to a net than to a chain, that net being, so to speak, woven with various threads which show, between them, changing communications, connections, and unions. [from the original Italian]He was not alone in this thought, although others chose different metaphors. For example, Carl von Linné in 1751 (Philosophia Botanica) wrote this:
All plants show affinities on either side, like territories in a geographical map. [from the original Latin]Neither author published a reticulating diagram to illustrate their thoughts, although one of Linné's students subsequently produced a version of his ideas in 1792 (Caroli a Linné, Praelectiones in Ordines Naturales Plantarum).

So, it was Georges-Louis Leclerc, Comte de Buffon, who produced the first empirical phylogeny in 1755 (Histoire Naturelle Générale et Particulière, Tome V). This was a network showing the evolutionary origin of domesticated dog breeds. This was followed by Antoine Nicolas Duchesne in 1766 (Histoire Naturelle des Fraisiers), who produced a network showing the evolutionary origin of strawberry cultivars. In both cases the evolutionary process illustrated by the reticulations in the network was hybridization. Note that both of these diagrams refer to within-species genealogies, rather than to relationships between species; and neither author seems to have contemplated the idea of among-species phylogenies.

Thus, in both theory and practice modern phylogenetic metaphors started as networks, not trees. It was Peter Simon Pallas in 1776 (Elenchus Zoophytorum) who first suggested using a tree as a simplified metaphor:
As Donati has already judiciously observed, the works of Nature are not connected in series in a Scale, but cohere in a Net. On the other hand, the whole system of organic bodies may be well represented by the likeness of a tree that immediately from the root divides both the simplest plants and animals, [but they remain] variously contiguous as they advance up the trunk, Animals and Vegetables; [from the origina Latin]Again, no diagram was forthcoming to illustrate this. It was Jean-Baptiste Pierre Antoine de Monet, Chevalier de Lamarck, who finally produced an empirical phylogeny in 1809 (Philosophie Zoologique). This was a small tree showing the evolutionary relationships among the major groups of animals. However, it represented what we would now call transformational evolution, as Lamarck did not believe in extinction, and thus he showed one group transforming into another. This differed from both Buffon and Duchesne, who were illustrating a process of increasing diversity of groups. It also differed by referring to supra-species relationships.

For the next 50 years, diagrams showing biodiversity relationships illustrated what we now call patterns of affinity, rather than showing historical relationships. These affinity diagrams showed apparent similarities among groups of organisms, without any implication that the relationships were the result of evolutionary history. The majority of these diagrams were networks rather than trees, indicating that groups of organisms had observed similarities with several other groups.

It is Charles Darwin and Alfred Russel Wallace who are credited with introducing, in 1858, the idea that natural selection could be the important process by which new species arise, although the idea of natural selection itself had been "in the air" for more than half a century with respect to within-species variation. (In the case of Patrick Matthew, he had also suggested a role in the origin of new species; 1831, On Naval Timber and Arboriculture; with Critical Notes on Authors who have Recently Treated the Subject of Planting).

As was by now becoming a tradition, neither Darwin nor Wallace (nor Matthew) produced a diagram to illustrate their thoughts. Darwin did draw a theoretical diagram in his subsequent 1859 book (On the Origin of Species by Means of Natural Selection), but he used it to illustrate continuity of evolutionary descent and the processes of extinction and diversification, rather than strictly as representing a phylogeny. His famous "Tree of Life" metaphor had nothing to do with the diagram (it was a Biblical metaphor, to stimulate the imagination of his readers).

The first person to get into print what we could call an empirical diagram representing Darwin's idea was Johann Friedrich Theodor Müller in 1864 (Für Darwin), who drew a small (three-species) tree of amphipods. This was followed by St George Jackson Mivart in 1865 (Contributions towards a more complete knowledge of the axial skeleton in the primates. Proceedings of the Zoological Society of London 33: 545-592). This was a much more extensive diagram illustrating possible evolutionary relationships among primate species (including humans) based solely on their body skeleton.

Confusion between trees and networks reappeared at this time. In particular, Franz Martin Hilgendorf had produced an unpublished PhD thesis in 1863 (Beiträge zur Kenntniß des Süßwasserkalkes von Steinheim) during which he constructed an empirical network of relationships among extinct snail species; but he rejected this because it did not match the Darwinian idea of an evolutionary tree. He later collected more data, and instead published a phylogenetic tree in 1866 (Planorbis multiformis im Steinheimer Süßwasserkalk: ein beispiel von gestaltveränderung im laufe der zeit).

Thus, we last saw an explicit evolutionary network in 1766, referring to with-species variation. The first person to publish an evolutionary network showing relationships among species was apparently Ferdinand Albin Pax in 1888 (Monographische übersicht über die arten der gattung Primula. Botanische Jahrbücher für Systematik, Pflanzengeschichte und Pflanzengeographie 10: 75-241). He produced 14 networks of various primula species, apparently showing affinity relationships, but three of these also illustrate hybridization, which is strictly an evolutionary process.


Genealogies appear in anthropology as well as in biology. Any human creation can be considered to have a history of "descent with modification" if copies are passed from generation to generation (eg. languages, books, tales). For our purposes here, the most important historical developments were in linguistics (languages studies) and in stemmatology (manuscript studies).

Georg Stiernhielm appears to have been the first linguist to draw a genealogy, when he produced a small network of Germanic languages in 1671 (De Linguarum Origine Præfatio, the preface to his edition of Evangelia ab Ulfila Gothorum). This was followed by Félix Gallet in c.1800 (Arbre Généalogique des Langues Mortes et Vivantes), who produced a single broadsheet with a network of Indo-European languages.

Note that, as for biology, the modern metaphors started as networks, not trees. More importantly, note that Stiernhielm's diagram pre-dated Buffon's dog network by more than 80 years — evolutionary ideas were less revolutionary in linguistics than they were in biology.

Darwin explicitly noted a connection between language genealogies and biology genealogies in 1859. However, the first people to get into print what we could call empirical diagrams representing Darwin's idea did so before Darwin published anything on the subject. In 1853 František Ladislav Čelakovský published a tree depicting a history of the Slavic languages (Čtení o Srovnávací Mluvnici Slovanské na Universitě Pražskě), and Auguste Schleicher published one on the development of the Indo-Germanic language family (Die ersten Spaltungen des Indogermanischen Urvolkes. Allgemeine Monatsschrift für Wissenschaft und Literatur 1853: 786-787).

Stemmatology differs from linguistics and biology in first producing a tree rather than a network. Hans Samuel Collin and Carl Johan Schlyter produced this in 1827 (first volume of Corpus Iuris Sueo-Gotorum Antiqui), with a tree of relationships among hand-written copies of documents containing the Medieval laws of Sweden. This was also a tree that represented Darwin's genealogical idea, and so it may be considered to be the first one of that type to be published (ie. 25 years before Čelakovský and Schleicher, and 30 years before Darwin).

This early lead was followed by the first network in 1832, when Friedrich Wilhelm Ritschl's stemma of a book by Thomas Magister (Thomae Magistri sive Theoduli Monachi Ecloga vocum Atticarum) explicitly showed sources of contamination among the manuscript copies — that is, different parts of a manuscript were copied from different sources, rather strict ancestor-descendant copying.

Interestingly, the tree metaphor didn’t endure in anthropology as well as it did in biology. It was quickly replaced by alternative metaphors, such as wave, web, warp & weft, lattice and other continuously reticulating images. Horizontal flow of information has always been seen as a dominant force in anthropological histories.



1671 Georg Stiernhielm — small language network
1750 Vitaliano Donati — biology network suggestion
1751 Carl von Linné — biology map suggestion
1755 Georges-Louis Leclerc, Comte de Buffon — intra-species network
1766 Antoine Nicolas Duchesne — intra-species network
1792 Carl von Linné — map
1800 Félix Gallet — language network
1832 Friedrich Wilhelm Ritschl — small manuscript network
1863 Franz Martin Hilgendorf — unpublished inter-species network
1888 Ferdinand Albin Pax — inter-species network


1776 Peter Simon Pallas — biology tree suggestion
1809 Jean-Baptiste Pierre Antoine de Monet, Chevalier de Lamarck — small inter-species tree
1827 Hans Samuel Collin and Carl Johan Schlyter — manuscript tree
1853 František Ladislav Čelakovský — language tree
1853 Auguste Schleicher — language tree
1859 Charles Robert Darwin — generalized tree
1864 Johann Friedrich Theodor Müller — small inter-species tree
1865 St George Jackson Mivart — large inter-species tree
1866 Franz Martin Hilgendorf — large inter-species tree

November 23, 2014


Infographics have become very popular in recent decades, with the advent of computer graphics packages. Infographics combine data and pictures, trying to produce an aesthetically pleasing but still informative presentation of numeric information. Recently, the following book appeared:
The Infographic History of the World (2013)
by Valentina D’Efilippo & James Ball
HarperCollins (UK) / Firefly Books (US)

A selection of the the infographics can be perused at the senior author's web page: the blog the author also explains her intentions:
The Infographic History of the World is a new book that continues to push the field of infographics forward. Our task required research, organization and the selection of topics. Then, we needed to decide how to display data in order to tell a coherent and compelling story. We have never considered this to be an alternative to tons of books of history, but hopefully a refreshing interpretation of what history is about. With this book, we hope to lead readers on a journey, to interpret the data and find the implications that resonate with them. We don’t pretend that every set of data presents an unquestionable truth. And, rather than looking to define the world’s history, we were looking to present readers with an unconventional interpretation of the subject.Sadly, these good intentions have not always been achieved. As noted by a review at Amazon:
the book showcases *clever* ways of displaying data, not *clear* ways of displaying it ... Far too often I had to pore over the graphic to figure out what it was trying to say.What is worse for the readers of this blog, the information is not always correct. Consider this version of the Tree of Life, which has a long-standing tradition in systematics as one of the world's first examples of an infographic:

Click to enlarge.
Quite a number of the taxonomic labels are misplaced. You can check them for yourselves, but here is a selection of some of the surprising information contained in this infographic:
  • Amphibians are not Tetrapods
  • Humans are not Mammals
  • Mammals are not Amniotes
  • Turtles are not Reptiles
  • Lobe-finned fishes are not Sarcopterygians
  • Ray-finned fishes are not Bony Vertebrates
  • Charophytes are Land Plants
  • Hornworts are Vascular Plants
  • Ferns and Horsetails are not only Seed Plants they are Gymnosperms
  • Conifers, Gnetophytes, Gingko and Cycads are not Gymnosperms
 Clearly, little has been done to check the veracity of the information in this infographic, which completely defeats its purpose.

November 18, 2014


In a previous post I introduced the Great Stemma as the earliest known pedigree, being a genealogical view of biblical history (The first infographic was a genealogy). In it I noted that people were enclosed in circles, which were connected by lines showing relationships, much as we still do today. However, the lines combined marriage, parent-offspring and brotherly relationships without distinction. So, while it is a good first attempt, the Great Stemma leaves room for informational confusion, and this was not corrected at any time during its centuries of being copied. (In fact, confusion was increased through embellishments, deletions and modifications; but that is another story.)

To illustrate the potential problem of interpreting this early type of genealogy, I have included here a specific example.

The above excerpt from the Stemma shows the the children of Jacob by his wife Leah (who is shown at the top centre), and their subsequent children (ie. Leah's grandchildren). I have annotated the diagram to show parent-offspring (P), brother (B) and half-brother (HB) relationships. Note that all relationships are between males unless specified otherwise (so, half-brothers have the same father).

Leah is at the top [generation 1], with her six sons in a row below her (in birth order left to right), and her daughter to the side [generation 2]. Below this is the first-born son of each of the sons [generation 3], followed in columns down the page by their later sons, in birth order. Sons by later relationships are shown as half-brothers. At the bottom are two of Leah's great-grandchildren [generation 4].

Thus, the genealogical diagram does not effectively separate the generations visually, and parental and fraternal relationships are depicted in the same way. These days we solve this, of course, by keeping each generation as a single row and linking each child directly to the parent. It is easy to get used to the Stemma way of doing it, because it is fairly consistent about the arrangement. If there is confusion, then each circle does specify the relationship in words.

So, as I noted, this is a good first attempt, but some of the things that we now feel need distinguishing were not distinguished by the (unknown) original author.

However, the 24 extant copies of the Stemma are not identical, and two of them try to fit more information into Leah's family tree than is shown above. This information concerns the origin of the fourth generation, which is accurately depicted as far as it goes, but the above figure leaves out a lot. Some of the extra information is shown in the Stemma version below, which adds two extra people, both of them wives. I have annotated this version the same as the previous one, except that this pedigree adds one more relationship to the mix — marriage (M).

The extra details come from Genesis 38, which describes a set of relationships that would make a modern television soap-opera scriptwriter jealous. The story goes something like this (I have indicated the named people with letters in the diagram above, with Leah as L):
Judah (J) marries the [unnamed] daughter (W) of Shua. Judah and his wife have three children, Er (E), Onan (O), and Shelah (S). Er marries Tamar (T), but God kills him because he "was wicked in the sight of the Lord" (Gen. 38:7). Tamar becomes Onan's wife in accordance with the custom of the time, but he too is killed by God after he refuses to father children for his older brother's childless widow, and "spills his seed on the ground" instead (Gen. 38:8-10). Although Tamar should marry Shelah, the remaining brother, Judah does not consent, for fear of his son's life (Gen. 38:11). In response, after Judah's wife has died, Tamar deceives Judah into having intercourse with her, by pretending to be a prostitute (Gen. 38:12-23). When Judah discovers that Tamar is pregnant he prepares to have her killed, but recants and confesses when he finds out that he is the father (Gen. 38:24-26). The result is twin boys, Zerah (Z) and Perez (P) (Gen. 38:27), who are accepted as Judah's sons.Biblically, this story is important because Judah became the founder of the Tribe of Judah, one of the twelve Tribes of Israel. Their land encompassed most of the southern portion of the Land of Israel, including Jerusalem. Both the Book of Ruth and the Gospel of Matthew identify Tamar's son Perez as an ancestor of King David, which makes Judah and Tamar also ancestors of Jesus.

For our purposes here, though, the interesting thing is the confusion caused by trying to add the two marriage relationships to the pedigree. These are in no way distinguished visually from the paternal and fraternal relationships, although the circled text does specify the relationship in words. Today, we solve this potential confusion by using horizontal lines for marriage relationships and vertical lines for parent-offspring relationships.

Equally importantly, note that Tamar's (legal) relationship supplants the (biological) parent-offspring relationship between Judah and her sons — you would never conclude from the diagram that Perez was Judah's son, for example, rather than Er's. However, note the neat attempt to keep Tamar's children in a single column by putting one twin above her and one below (perhaps also signifying simultaneous birth).

The above part of this post was inspired by a blog post from Jean-Baptiste Piggin (The Tamar Storyboard). The first picture above is from an unnamed manuscript in the Biblioteca Medicea Laurenziana, Florence, Plut.20.54, dated c. 1050 AD. The second picture if from an unnamed manuscript in the Pierpont Morgan Library, New York, M.644, dated 940-945 AD.

Moving on, the scribes of that time tried to go even further in complicating simple genealogies, as shown in the next figure. This is drawn by Stephanus Garsia Placidus, and is taken from the Saint-Sever Beatus in the Bibliothèque Nationale de France, Paris, ms. lat 8878, dated c. 1060 AD.

It shows the non-Semitic (ie. polytheistic) part of Noah's family. Noah is at the top right (sacrificing two doves), with his son Japheth (J) to the left and son Ham (H) below. Their wives (W) are indicated by intersecting circles, rather than by lines, which is a more successful approach than in the Stemma. Their descendants are shown in roughly the same style as above, with the first-born son followed by the later ones in order (so that the P and B relationships are not clearly distinguished) — Japheth has seven sons and Ham has four.

However, the illustrator has also tried to include a lot of history in this genealogy. For example, the sons of Ham's son Cush end with Nimrod (N), who has a small essay attached to his name. Among other things, he founded Babel, the city that plays an important role later in the Bible. Moreover, the sons of Ham's son Canaan (C) are shown as a reticulating network rather than as a simple chain. This apparently represents their roles as founders of the 11 tribes who originally occupied the ancient Land of Canaan, and who were later driven out and enslaved by the Israelites. These lines thus represent later history rather than parental or fraternal relationships.

This diagram is thus not a simple pedigree, as we would usually leave it today.