The Genealogical World of Phylogenetic Networks

Biology, computational science, and networks in phylogenetic analysis

URL

XML feed
http://phylonetworks.blogspot.com/

Last update

1 hour 43 min ago

July 22, 2014

22:30

I have written before about the expected genetic problems associated with inbreeding, including consanguinity and incest (relationships between people who are first cousins or closer). Conventionally, the evolutionary advantage of sexual over non-sexual reproduction is considered to be the creation of genetic diversity through heterozygosity. Inbreeding, by reducing heterozygosity, then seems to negate the advantages of sexual reproduction — it leads to the propagation of deleterious recessive alleles and thus inbreeding depression. So, there is a clear evolutionary dimension to the fact that incest avoidance is nearly universal in humans.

The best known exceptions to this situation are among royalty, including the family "trees" of the ancient Egyptian 18th Dynasty (see Tutankhamun and extreme consanguinity) and the Egyptian Ptolemaic dynasty (see Cleopatra, ambition and family networks), which were hybridization networks rather than conventional trees. The presence of consanguinity and incest among royal families then requires a biological explanation. As noted by van den Berghe & Mesher (1980):
Royal incest is best explained in terms of the general sociobiological paradigm of inclusive fitness ... Royal incest (mostly brother-sister; less commonly father-daughter) represents the logical extreme of hypergyny. Women in stratified societies maximize fitness by marrying up; the higher the status of a woman, the narrower her range of prospective husbands. This leads to a direct association between high status and inbreeding.The benefits of inclusive fitness refer to the increased number of offspring in future generations that result from increasing the reproductive success of close relatives. This is achieved via choice of mate. In other words, close relatives share genes, and the success of any relative in leaving offspring is a success for all relatives. Therefore, evolutionary fitness is a combination of individual fitness plus the fitness of close relatives. Inbreeding may reduce individual fitness but can increase inclusive fitness, as noted by Puurtinen (2011):
Theoretical work has shown that inclusive fitness benefits can favor close inbreeding even when this results in substantial reduction in offspring fitness. These models have identified the boundary level of inbreeding depression limiting the evolution of inbreeding among first-order relatives, that is, between full siblings, or between parents and offspring.So, there is a stable level of inbreeding in those populations that practice mate choice for optimal inbreeding. For example, the genetic risks of close inbreeding can be more than accounted for by the production of a highly related heir who has access to a wide choice of mates. Nevertheless:
For a wide range of realistic inbreeding depression strengths, mating with intermediately related individuals maximizes inclusive fitness.In other words, mating with very close relatives is unlikely to evolve via natural selection because it is not an optimal strategy; and we must thus look to a sociological component to incest (such as retaining wealth within the family), as well as a biological one.


In this context, it is interesting to note exceptions to the usual restriction of incest to the aristocracy. The society of Graeco-Roman Egypt (from c. 300 BCE to 300 CE) provides the best-documented case (eg. see Hopkins 1980; Shaw 1992; Parker 1996; Scheidel 1997; Huebner 2007; Remijsen & Clarysse 2008). [This era starts with the Ptolemaic dynasty, which marks the collapse of Egyptian rule of Egypt.] During this time a significant proportion of all marriages noted in official Roman census declarations were between full brothers and sisters. That is, the Roman-era Egyptians did not limit this type of inbreeding to any small group, but spread it across several social classes (mainly Greek settlers rather than native Egyptians).

As noted by Schiedel (1997):
According to official census returns from Roman Egypt (first to third centuries CE) preserved on papyrus, 23·5% of all documented marriages in the Arsinoites district in the Fayum (n=102) were between brothers and sisters. In the second century CE, the rates were 37% in the city of Arsinoe and 18·9% in the surrounding villages. Documented pedigrees suggest a minimum mean level of inbreeding equivalent to a coefficient of inbreeding of 0·0975 in second century CE Arsinoe. Undocumented sources of inbreeding and an estimate based on the frequency of close-kin unions indicate a mean coefficient of inbreeding of F=0·15-0·20 in Arsinoe and of F=0·10-0·15 in the villages at the end of the second century CE. These values are several times as high as any other documented levels of inbreeding.For comparison, the inbreeding F values for these family relationships are:
self
parent-offspring = siblings
uncle-niece = double first cousins
first cousins
first cousins once removed
second cousins 0.500
0.250
0.125
0.063
0.031
0.016
However, inbreeding depression seems not to have been a notable problem during this historical time. As noted by John Hawkes:
There is not a single mention in the evidence that links sibling marriage to negative genetic effects or unhappy marriages.This does not mean that there were no problems, but merely that any problems were not documented, as noted by Scheidel (1997):
Even in the absence of explicit references to inbreeding depression from Roman Egypt, there is no compelling reason to assume that brother–sister marriage could have remained entirely without negative consequences for the Arsinoites. It is however possible that, due to a low incidence of lethal recessives, such effects were considerably weaker than in some western samples. The census returns do not suggest lower levels of fertility or smaller numbers of children among sibling couples ...The practice seems to have stopped solely because it was contrary to Roman Law:
Before a.d. 212 the Romans had accepted discrepancies between their own legal practice and prevailing local customs and traditions in the Eastern provinces. Papyri from Roman Egypt, the Talmud, and the Romano-Syrian law book indeed reveal legal procedures which differed significantly from Roman law in matters such as marriage, guardianship, paternal authority, sales, and debts. The Constitutio Antoniana, however, made all free men and women of the Roman Empire into Roman citizens, and so Roman law became applicable to all inhabitants of Egypt. Brother-sister marriages cease to be documented in our Roman census returns from the early third century on. Our last [incest] testimony dates to a.d. 229.
References

Hopkins K (1980) Brother-sister marriage in Roman Egypt. Comparative Studies in Society and History 22: 303-354.

Huebner SR (2007) "Brother-sister" marriage in Roman Egypt: a curiosity of humankind or a widespread family strategy? Journal of Roman Studies 97: 21-49.

Parker S (1996) Full brother-sister marriage in Roman Egypt: Another look. Cultural Anthropology 11: 362-376.

Puurtinen M (2011) Mate choice for optimal (k)inbreeding. Evolution 65: 1501-1505.

Remijsen S, Clarysse W (2008) Incest or adoption? Brother-sister marriage in Roman Egypt revisited. Journal of Roman Studies 98: 53-61.

Scheidel W (1997) Brother-sister marriage in Roman Egypt. Journal of Biosocial Science 29: 361-371.

Shaw BD (1992) Explaining incest: brother-sister marriage in Graeco-Roman Egypt. Man 27: 267-299.

July 20, 2014

16:30

I have commented before on fact that the general public associates an inappropriate "March of Progress" image with the concept of "evolution" (see Haeckel and the March of Progress, and especially Tattoo Monday VIII - the March of Progress). It therefore seems worthwhile to gather a few examples together in the one place. Most of these are abbreviated versions of the image in the book Early Man by Francis C. Howell (1965. Time-Life International, New York). There were originally 14 images (see the version here), but the modern versions have a half or fewer images.














July 16, 2014

01:14
We all worked hard during the workshop. Here is our fearless leader, in deep thought:

While some of the younger participants enjoyed drawing on the walls:

Professor Whitfield has come up with a great new model of evolution: phylogenetic windmills:


There was not only work, but also time to relax and enjoy the beautiful Dutch summer weather:

And not to forget the delicious Dutch food:

But really, most of the time we were busy touching the data, which you can find on this website:


For more photos, see the Touching the Data website.

July 11, 2014

14:33

We have now completed the workshop.

Since the first report, we have had three more talks. First, Mukul Bansal outlined the relationship between phylogenetic networks and reconciliation analysis, and the way in which the latter can be used to construct the former. Starting from an estimated species tree, the tree for each locus is optimized for fit to the species tree, which helps locate any areas of extensive gene flow (ie. reticulation). This can be done using a large number of loci and an even larger number of taxa.

Celine Scornavacca provided details of some of the fundamental limitations of network analysis.The most important of these is unidentifiability of network topologies -- there are classes of network topologies that cannot be distinguished based on the information that is currently used, so that we cannot guarantee that a unique optimal network will be found during an analysis. Branch lengths may help with this situation, but cannot guarantee to resolve it.

Jim Whitfield covered the advantages and potential problems of using genomic-scale data for phylogenetic analysis. The basic problem is the increased scope for error in moving to the genome data (genome assembly problems, gene homology issues, alignment difficulties), although the potential advantages are extensive.

Most importantly, we spent two days "touching" some data. The participants broke into smaller groups of continuously varying size, each of which focussed on a particular dataset (as supplied by some of the participants). These data were evaluated in many different ways, to assess the characteristics of the data as well as to evaluate the data-analysis methods. This not only allowed us to identify the current state of the art with respect to phylogenetic networks, but it also allowed computationalists to improve their understanding of biological data and how biologists proceed to analyze it, as well as allowing biologists to obtain immediate feedback with respect to their data-analysis issues.

Production of phylogenetic networks seems to have come a long way in the past few years, although there is still no single "one-stop shopping" software tool to use. Practical issues getting programs to perform on all computer types were identified, along with data-format issues. Nevertheless, all of the participants seemed to find that this was a very valuable exercise, as a means of focussing interactions among themselves.

Finally, we considered both European and U.S. funding for network research, in the latter case assisted by David Mindell (from the N.S.F.). In particular, we identified sources of funding for future workshops (either in the south of France or the north-eastern U.S.A.).

The canal-boat cruise turned out well, in spite of the somewhat uncooperative weather. The football, of course, has turned out to be rather disappointing for the hosts, although they have one more game to play.

July 8, 2014

13:15

We have now completed two days of the workshop. We have had a relaxed approach to progress, and are thus currently running behind the nominal schedule. Nevertheless, we are progressing splendidly.

We had three talks on the first day and one today. I tried to kick things off by asking a series of what I consider to be unanswered questions from observing practitioners and computationalists in action, although apparently several members of the audience already had their own answers to some of these. The bottom line is that phylogenetic analysis focuses on data patterns while interpretation focuses on processes / mechanisms, and this constitutes a large part of the apparent separation of practitioners and computationalists.

Steven Kelk and Luay Nakhleh introduced the diversity of computational approaches that we already have. These presentations neatly complemented each other, providing a valuable summary of the field as well as an overview of current limitations and future prospects. This topic was taken up later by various members of the audience, as one of the inherent problems for practitioners is how to navigate through the methods to choose a suitable one -- there are methods based on parsimony, likelihood and bayesian analysis, and methods that tackle de novo network construction, gene tree / species tree reconciliation, gene tree scoring, and network presentation.

This topic was followed up today by presentations introducing some of the currently available software. Some of these have progressed significantly in recent years, notably PhyloNet and Dendroscope, and there are some relatively new ones, as well as even newer ones in the pipeline. Based on the literature, these programs are being dramatically under-used compared to their actual usefulness.

This morning Scot Kelchner introduced us to the application of Zen Buddhism to science in general and phylogenetics in particular. This went down much better than he seemed to be expecting -- there were apparently a lot of  "Zen" people in the room. The basic idea is not to get trapped by preconceived expectations, especially arbitrary categorical notions, when interpreting the output of a phylogenetic analysis. You can consult The Nine-Headed Dragon River, by Peter Matthiessen, if you would like further information.

Finally, we got to the topic implied by the workshop's title: Touching the Data. We had a brief run-through of the pre-existing datasets stored with this blog (see the upper right-hand corner), which cover some of the diversity of what practitioners have provided to date in the way of usable datasets with "known" phylogenetic patterns.

By far the most interesting, however, was the presentation of some recent datasets made available by members of the workshop, notably Axel Janke (bear species), Scot Kelcher (bamboo species) and Mattis List (Indo-European languages) (Jim Whitfield will present his datasets tomorrow morning). These datasets generated much interest, as they provide a diversity of different possible applications for phylogenetic networks. The idea from here on in the workshop is to address what can currently be done with these datasets and what we might like to do with them if the tools were available. This will help focus the participants on specific practical issues, which should lead to the progress that we hope to achieve.

It has rained most of the day, which is actually unusual -- intermittent rain is more common in this climate. We are currently waiting for the football to start: Germany versus Brazil. Tomorrow will be the Netherlands versus Argentina. It is risky being in this country this week! The current local betting is for an all-European final,an assessment that involves no cultural bias whatsoever.

July 6, 2014

16:30

This week we have returned to Leiden (in the Netherlands), for another workshop sponsored by the Lorentz Center. The previous workshop, in October 2012, is discussed in this prior blog post: Workshop: The Future of Phylogenetic Networks.


The full title of the new workshop is: Utilizing Genealogical Phylogenetic Networks in Evolutionary Biology: Touching the Data. As before, it has been organized by Steven Kelk, Leo van Iersel, Leen Stoogie and myself. The program and abstracts can be found here. It runs for the whole week 7 July – 11 July 2014.

The workshop differs significantly from the previous workshop in two ways: it is intended to be a much smaller and more focused workshop, and it is intended to be practical rather than theoretical. The basic aim is to get biologists and computational people to sit down in small groups and actually talk about real phylogenetic data, so that each side of the phylogenetics "coin" gets to understand a bit better what is going on on the other side. To this end, we have gathered together some of the experts in the field specifically of evolutionary / genealogical networks (rather than data-display networks), as this is the area that needs the greatest future development. We have also gathered together some real-world datasets involving apparent reticulating evolution, which will be the focus of discussion. These datasets are available here and also here.

The weather is predicted to be changeable during the workshop, which is to be expected in northern Europe even in summer — that is why everyone else has gone to southern Europe.

I am hoping to add some blog posts based on what happens at the workshop, as it proceeds.

July 2, 2014

16:30

I recently wrote a manuscript comparing the tree-likeness of phylogenetic data in biology and anthropology (see Are phylogenetic patterns the same in anthropology and biology?). While doing so, I also made a comparison of genotype and phenotype data within biology.

The comparison is based on maximum-parsimony analyses of the data, using the (ensemble) Retention Index (RI) as the measure of tree-likeness. If RI = 1 then all of the characters are compatible with the same tree, whereas if RI = 0 then none of them are pairwise compatible. As the graph shows, the genotype data are considerably less tree-like than are the phenotype data (mean RI ≈ 0.5 versus 0.7, respectively).

It would be interesting to know whether other people have observed this pattern. If it is general, then what causes it? Are the phenotype characters being chosen (subconsciously or not) because they show nested grouping patterns (which lend themselves automatically to a tree representation)? Or do the genotype data inherently have more stochastic variation? Does this mean that we should always be using phylogenetic networks for the representation of genotype data?


You can read the manuscript if you want the details of the analyses. Briefly, the initial collections of datasets were taken from Collard et al. (Evolution and Human Behavior 27: 169-184; 2006) — the graphed data are taken from the paper as I never managed to get the original datasets from the authors. I then supplemented this information with phenotype datasets from TreeBase (total of n=31) and miscellaneous genotype datasets from the literature (n=15). All of the datasets refer to vertebrates and insects (with one phenotype dataset from spiders). My parsimony analyses used the parsimony ratchet and PAUP*.

June 28, 2014

02:00

It was 14 years ago that the Millennium started, but there are therefore still 986 years left to solve the following seven phylogenetic network Millennium problems. These are not necessarily the most important problems to solve from a biological point of view, but are challenging computational problems that have (at least) some biological relevance. The problems are all about phylogenetic networks, except for Problems 2 and 7 which are about the closely related topic of agreement forests. Solving these problems will not be rewarded with $1,000,000 but only with eternal fame.

In each of these problems, a phylogenetic network on X is a directed acyclic graph with a single root and no vertices that have only one incoming and only one outgoing arc, and in which each leaf is labelled by an element of X and each element of X labels one leaf.

Problem 1. Is the Hybridization Number problem fixed-parameter tractable (FPT) if the input is an unrestricted set of nonbinary trees and the only parameter is the hybridization number? Hybridization Number is the following problem. Given a finite set X, a collection T of rooted (possibly nonbinary) phylogenetic trees on X and a natural number k, decide if there exists a rooted phylogenetic network on X that displays all trees from T and has reticulation number at most k. See e.g. (van Iersel, Kelk, 2013) for more detailed definitions.

Problem 2. Does there exist a polynomial-time 2-approximation algorithm for MAF on two binary trees? Maximum Agreement Forest (MAF) on two binary trees can be defined as follows. Given a finite set X and two rooted binary phylogenetic trees on X, what is the minimum number number of components in a forest on X that can be obtained from each of the input trees by deleting vertices, deleting edges and suppressing indegree-1 outdegree-1 vertices? For a 2.5-approximation see (Shi, You, Feng, 2014).

Problem 3. Is there an FPT algorithm for finding a level-k phylogenetic network consistent with a given dense set of rooted triplets, if k is the parameter? A rooted triplet is a phylogenetic tree with three leaves. A set of rooted triplets is called dense if it contains at least one triplet for each combination of three leaves. A network is level-k if it can be turned into a tree by deleting at most k edges per biconnected component. This problem is known to be solvable in polynomial time if k is fixed, see (Habib and To 2012).

Problem 4. Is Tree Containment polynomial-time solvable or NP-hard for reticulation visible networks? Tree Containment is the problem of deciding if a given phylogenetic network displays a given tree. A phylogenetic network is called reticulation visible if from each reticulation (vertex with indegree greater than one) there exists a path that does not pass through any other reticulations and ends in a leaf. Tree Containment is known to be NP-hard for general networks and for some restricted classes of networks; see (Kanj, Nakhleh, Than, Xia, 2008) and (van Iersel, Semple, Steel 2010).

Problem 5. Is there a constant-factor approximation algorithm for computing the softwired parsimony score of a binary tree-child network and a binary character? Given a network and a character state (0 or 1) for each leaf, the softwired parsimony score is the minimum number of state-changes in any tree (on all leaves) displayed by the network, over all possible assignments of states to the internal vertices. A phylogenetic network is called tree-child if each non-leaf vertex has at least one child that is not a reticulation. This problem does not have a constant factor approximation for general networks or for other (less severely) restricted classes of networks, unless P = NP (Fischer, van Iersel, Kelk, Scornavacca 2013).

Problem 6. Given k > 1, what is the maximum value of p such that for any set of rooted triplets there exists some level-k phylogenetic network on n leaves that is consistent with at least a fraction p of the input triplets? For k = 0 the maximum is p = 1/3 and for k = 1 it is roughly 0.48, see (Byrka, Gawrychowski, Huber, Kelk 2009).

Problem 7. Is there an O(c^n) algorithm for Maximum Acyclic Agreement Forest (MAAF) on two binary phylogenetic trees with c < 2? An acyclic agreement forest is an agreement forest (see above) for which the following directed graph D is acyclic. D has a vertex for each component of the forest and there is an arc from component A to component B if in at least one of the input trees there is a directed path from the root of A to the root of B. It is known that there exist an O*(2^n) algorithm for this problem (van Iersel, Kelk, Lekic, Stougie, 2013).

June 24, 2014

22:30

I recently published a post on Evolution and timelines, in which I pointed out that presenting historical data as a timeline is a very poor way of representing an evolutionary history. Evolutionary history is much better presented as a phylogeny, which will be either a tree or a network. However, this does not mean that all histories that are presented as a tree, for example, necessarily represent a phylogeny.

I have encountered a few examples of history-as-tree that seem to have very little connection to a phylogeny. That is, the relationships among the objects are presented along the branches of a tree, but the relationships along the branches seem to be little more than a timeline. So, the whole structure is simply a series of interconnected timelines.

Consider this first example, which is a poster purporting to show for the USA:
the evolution of jazz in its more than one hundred year history. From Archaic to Avant Garde, from blues to bebop, from radio to fusion, from spirituals to swing, from Armstrong to Zawinul, the jazz pedigree presents the diverse history and development of jazz in a clear way.
Perhaps it is the strong central trunk that gives it away as a non-phylogeny. The side-branches do group the jazz performers roughly by genre, but that is all they do. The actual title is a bit more accurate about the content — it is a "Story" rather than a phylogeny.

This poster is accompanied by a European counterpart with an even stronger central trunk. It is labeled as a "Community", but it still claims to "display the history and development of European jazz".


As another example, in 1946, the magazine P.M.published a tree by Ad Reinhardt with a sardonic view of modern American art. [Thanks to Joachim Dagg for alerting me to this example.]


At least there is no central trunk this time, but the clustering of artists along the branches seems to have less to do with phylogenetic history than with artistic genre (and satire). There was a follow-up example 15 years later, in which the sardonic humor plays much the strongest role in the relationships represented.


Finally, here is an example of a timeline that really should be represented using a phylogenetic tree. It is difficult to believe that the group of professions illustrated form a transformational series, as implied by the timeline that is actually shown. Most of the entrepreneur groups depicted actually still exist to this day, rather than being extinct, and so we have here a history of variational evolution, instead of a transformation.


June 22, 2014

16:30

Phylogenetics plays no part in games like Trivial Pursuit, but the web offers more opportunities. The Fun Trivia web site, for example, offers a page on Phylogenetics. You should try it, and see how well you do.


The answers (and explanations) are quire good, but the wording of some of the questions leaves a lot to be desired.

June 17, 2014

22:30

One possible use of blog posts is as first drafts of ideas that might make their appearance in a refereed publication at a later date. Thus, many of my blog posts have appeared in one form or another in my recent publications. Here I have listed the ones that I can remember using, just in case anyone wants a citable reference for the information in these posts.

A. Morrison DA (2013) Phylogenetic networks are fundamentally different from other kinds of biological networks. In W.J. Zhang (ed.) Network Biology: Theories, Methods and Applications (Nova Science Publishers, New York) pp. 23-68.

    9 Biological versus phylogenetic networks
  13 Network measures and phylogenetic networks
  23 An explanation of graph types
  25 Networks and bootstraps as tree-support criteria
  34 Networks of affinity rather than genealogy
  36 Networks of genealogy
  53 Are mathematical constraints biologically realistic?
  54 Some odd network definitions and terms
  63 Human races, networks and fuzzy clusters
  69 Is this the first network from conflicting datasets?
  70 Why do we still use trees for the Neandertal genealogy?
  72 Networks and most recent common ancestors
  74 Open questions about evolutionary networks, part 1
  75 Open questions about evolutionary networks, part 2
  76 Open questions about evolutionary networks, part 3
  88 When is there support for a large phylogeny?
  90 Explanation of the names for phylogenetic networks
  94 Phylogenetic position of turtles: a network view
  99 How networks differ from bootstrapped trees
107 We should present bayesian phylogenetic analyses using networks
115 Is there a philosophy of phylogenetic networks?

B. Morrison DA (2014) Phylogenetic networks — a new form of multivariate data summary for data mining and exploratory data analysis. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 4: in press.

  29 Network analysis of scotch whiskies
  50 Phylogenetic network of the FIFA World Cup
  61 How to interpret splits graphs
101 Distortions and artifacts in Principal Components Analysis analysis of genome data
103 Networks can outperform PCA ordinations in phylogenetic analysis
114 Network analysis of Genesis 1:3
119 Network of ancient Thai bronze Buddha images
134 A network analysis of Simon and Garfunkel
159 Networks and human inter-population variation
172 The acoustics of the Sydney Opera House

C. Morrison DA (2014) Next generation sequencing and phylogenetic networks. EMBnet.journal: Bioinformatics in Action 20: e760.

191 Next Generation Sequencing and phylogenetic networks

D. Morrison DA (2014) Phylogenetic networks: a review of methods to display evolutionary history. Annual Research and Review in Biology 4: 1518-1543.

    2 The first phylogenetic network (1755)
  21 The second phylogenetic network (1766)
  34 Networks of affinity rather than genealogy
  36 Networks of genealogy
  67 Metaphors for evolutionary relationships
  89 Relationship trees drawn like real trees
168 Who first used the term "phylogenetic network"?
182 Affinity networks updated
183 Reticulation patterns and processes in phylogenetic networks
187 What are evolutionary networks currently used for?

E. Morrison DA (2014) Rooted phylogenetic networks for exploratory data analysis. Advances in Research 2: 145-152.

  43 Rooted networks for exploratory data analysis

F. Morrison DA (2014) Is the Tree of Life the best metaphor, model or heuristic for phylogenetics? Systematic Biology 63: 628-638.

  23 An explanation of graph types
  34 Networks of affinity rather than genealogy
  36 Networks of genealogy
  58 Who published the first phylogenetic tree?
  89 Relationship trees drawn like real trees
143 Resistance to network thinking
144 Destroying the Tree of Life?
147 Should phylogenetic modelling proceed from simple to complex or vice versa?
171 Conflicting placental roots: network or tree?
182 Affinity networks updated

June 15, 2014

16:30

In a previous blog post (Tattoo Monday VIII), I noted that the usual "March of Progress" image that the general public associates with the concept of "evolution" is originally based on the frontispiece to Thomas Henry Huxley's book Evidence as to Man's Place in Nature (1863. Williams & Norgate, London). A century later, this image was expanded and updated in the book Early Man by the anthropologist Francis C. Howell (1965. Time-Life International, New York) — this picture, with labels, can be viewed here.

What is perhaps less well known is that Ernst Haeckel also made a contribution to this genre. Shown here are the frontispiece and title page of Haeckel’s Natürliche Schöpfungsgeschichte (1868. Verlag von Georg Reimer, Berlin), usually translated as "The History of Creation". This book was Haeckel's attempt to introduce the idea of evolution to the German-speaking general public, after his detailed specialist two-volume book Generelle Morphologie der Organismen (1866. Verlag von Georg Reimer, Berlin). This previous book was difficult to read, and was also full of invective against doubters and supposed opponents; so a more readable approach was needed (the original text itself was apparently derived from one of his student's notes taken during Haeckel's lectures!).


The frontispiece lithograph (by Gustav Müller) is labeled as "The family group of the Catarrhines". It was notoriously supposed to demonstrate (as explained on page 555 of the book) "the highly important fact" that the "lowest humans" stand "much nearer" to the "highest apes" than to the "highest human". The various images are labeled (from "highest" to "lowest"):
  1. "Indo-German"
  2. "Chinese"
  3. "Fuegian"
  4. "Australian Negro"
  5. "African Negro"
  6. "Tasmanian"
  7. gorilla
  8. chimpanzee
  9. orangutang
  10. gibbon
  11. proboscis monkey
  12. mandrill.
The book was a best seller, and remained in print until the 1920s. Fortunately, the frontispiece was quickly changed. For example, in the 4th edition (1873) the frontispiece was a collage of various calcareous sponges, and in the 8th edition (1889) it was a picture of Haeckel himself (as it also was for the 5th and all subsequent editions). The book actually went through 12 editions, with the number and composition of the figure plates changing several times, in addition to the changes to the frontispiece.

June 10, 2014

22:30

Any history can be represented as a timeline, but a timeline diagram does not necessarily show an evolutionary history. Unfortunately, this does not stop people from putting the word "evolution" on their timeline diagrams.

A timeline simply represents the timing of certain events. These events are presumably related in some way, but they do not necessarily refer to the history of a set of objects, or even concepts, as we might expect for an evolutionary history. Here is classic example of a perfectly valid timeline that refers to a disparate set of objects / concepts.


Apparently we are expected to infer from this timeline that McDonald's attitude to providing the public with information about the nutritional value of their fast-food products has changed over the decades. But the idea that this changed attitude might involve some sort of evolutionary process is stretching an analogy a bit too far. The timeline certainly represents a journey, as claimed, but not an evolutionary one.

For most members of the general public, "evolution" is a story of the transformation of some object or idea through time, with each stage replacing the previous one. This is a simple story with a beginning, a middle and (possibly) an end. The story can usually be presented as a timeline, of course, with each stage of the transformation arranged in the correct time order. For a biologist, this is a transformation series, representing "transformational evolution", which follows the history of a single lineage through time (ie. a history chain).

There are plenty of examples of this use of a timeline to represent transformational evolution. For instance, consider corporate logos, such as those of these two well-known beverage manufacturers. Each new logo replaced the previous one, thus providing an analogy to evolution of a single object.



The word "evolution" as used here is not one that a biologist would use, but many other people would do so. Bank notes in the USA show a similar phenomenon — in this case, the people involved appear to get younger through time! [The same thing happens on the $100 bill, as well.]


We can even take the idea of transformational evolution and use it for prediction, as was done by Takeshi Fukuda in 2002:


However, biologists do not see the evolution of organisms in this way, at all. For them, evolution is a process of variation, with lots of new forms appearing and some old ones disappearing. So, rather than an ordered series of forms, each one replacing the previous one through time, biologists see an increasing diversity of forms that is counter-acted by loss of forms (ie. extinction). This is "variational evolution" rather than transformational evolution.

Variational evolution is usually represented using a phylogeny, which will be a network or a tree, depending on the particular history, rather than a timeline chain. A phylogeny shows the relationships among a wide variety of objects, many of which will exist (or have existed) at the same time. There may have been replacement of some objects by others, but in general it is the diversity of objects existing at the same time that is of principal interest.

The issue here is that a timeline is a poor way of representing variational evolution. A timeline enforces a linear ordering of relationships, solely because "time's arrow" has one direction only. But a linear temporal order cannot reflect the complex evolutionary relationships among the objects.

Consider this example from McDonald's in Canada. There is a clear timeline here but it does not refer to transformational evolution — instead, it refers to variational evolution. These breakfast items have not necessarily replaced each other, and thus their evolutionary relationships are more complex than can be represented by a timeline.


Indeed, many of these breakfast items are still on the menu today, including: Egg McMuffin, Scrambled Eggs, Hash Browns, Hot Cakes and Sausage, Sausage McMuffin, Sausage McMuffin with Egg, Breakfast Burritos (Sausage), Bagel (Bacon, Egg Cheese, Steak, Egg Cheese), and the Fruit 'N Yoghurt Parfait.

Here is another seemingly simple image from McDonald's but with the same complexity problem — it is variational not transformational.


And finally, here is a much more complex history from Apple computers:


A timeline shows the timing of certain events, which do not necessarily involve replacement. It might be a useful way to represent transformational evolution, but it is a poor way to represent variational evolution. A phylogeny is much more appropriate.

June 8, 2014

16:30

Some of you may have noticed that the Who is Who in Phylogenetic Networks database is currently down while it is being moved to a new server. It will soon be back, but in the meantime here is a graph showing the number of publications in the database in its most recent iteration. The data are grouped according to their authors, with the top 20 most prolific authors separated on the left. The next 9 authors are included solely to get myself onto the graph.


There are no real surprises here; and there are plenty of other authors in the database with fewer publications to their name. The database data will be updated when the web site returns, in which case I might update this graph.

June 3, 2014

22:30

Over the past two years or so of blogging, I have presented a number of empirical examples in which I have used splits graphs as general multivariate data summaries, rather than using them for the analysis of what we might call strictly phylogenetic data. I have listed these analyses at the bottom of this post.

There have been two reasons for doing these analyses. First, I wish to emphasize that unrooted networks are a form of data display rather than being evolutionary diagrams. That is, they do not display evolutionary history, in the same manner as is intended for rooted phylogenetic trees, for example. Unrooted networks can be a valuable tool for exploring phylogenetic data, but they do not display a phylogeny. They are a form of exploratory data analysis.

Second, these networks form part of a much larger class of methods for the analysis of multivariate data. Indeed, I believe that they are a very valuable part of this class. One way to illustrate this has been to analyze a whole series of datasets that have little to do with phylogenetic analysis. That is, the data are not necessarily even related to a historical trend. This illustrates just what can be done with these methods.


I have now formalized this point of view in a peer-reviewed publication:
Morrison DA (2014) Phylogenetic networks — a new form of multivariate data summary for data mining and exploratory data analysis. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 4(4): in press (Online Early). doi:10.1002/widm.1130Abstract:
Exploratory data analysis (EDA) involves both graphical displays and numerical summaries of data, intended to evaluate the characteristics of the data as well as providing a form of data mining. For multivariate data, the best-known visual summaries include discriminant analysis, ordination and clustering, particularly metric ordinations such as Principal Components Analysis. However, these techniques have limiting mathematical assumptions that are not always realistic. Recently, network techniques have been developed in the biological field of phylogenetics that address some of these limitations. They are now widely used in biology under the name phylogenetic networks, but they are actually of general applicability to any multivariate dataset. Phylogenetic networks are fast and relatively easy to calculate, which makes them ideal as a tool for EDA. This review provides an overview of the field, with particular reference to the use of what are called splits graphs. There are several types of splits graph, which summarize the multivariate data in different ways. Example analyses are presented based on the neighbor-net graph, which seems to be the most generally useful of the available algorithms. This should encourage the more widespread use of these networks whenever a summary of a multivariate dataset is required.

If you don't have subscription access to the journal, you can contact me for a PDF copy.

Blog posts with multivariate data summaries:

Datasets involving temporal patterns

Network analysis of Genesis 1:3
Network of ancient Thai bronze Buddha images
Language history and language weirdness
Pacific rock art - ordinations and networks
The network history of the Carnival of Evolution
The rise and fall of "David"

Datasets with no phylogenetic pattern

Eurovision Song Contest 2006: a network analysis
Network analysis of scotch whiskies
Network analysis of Bordeaux wine critics
Network analysis of Bordeaux wine critics II
A network analysis of Médoc wines
Eurovision Song Contest 2012: a network analysis
Phylogenetic network of the FIFA World Cup
Astrocladistics: a network analysis
Network analysis of McDonald's fast-food
Is there good and bad fast-food?
The mysterious rankings in Forbes' Celebrity 100
Network analysis of Michelin starred restaurants
Network analysis of New York neighborhoods
A network analysis of Simon and Garfunkel
Network analysis of Manhattan apartment buildings
A network analysis of the Bundesliga
Networks of the "Sight & Sound" film polls
A network analysis of London's theatres in 1965
The acoustics of the Sydney Opera House
A network of New Zealand's livestock regions
A network analysis of airplane disasters
World ice hockey champions — a network
Fast-food maps — a network analysis
Single-malt scotch whiskies — a network
Which cars are good, really?
The Netherlands is more than just tulips and sea-dykes
Automated natural language processing
Cancer rates and diagnosis

Theoretical considerations

Distortions and artifacts in Principal Components Analysis analysis of genome data
Networks can outperform PCA ordinations in phylogenetic analysis
Multivariate data displays are not always necessary

June 1, 2014

16:30

Most people don't want to think about cancer, but everyone over the age of 40 should do so, and do so regularly. This is because early diagnosis dramatically increases your chance of survival, and early diagnosis is almost entirely up to you — you will be the first person to notice the symptoms. To put it bluntly, you can ignore the first sings of cancer and thus live for another five years or so, or you can got to a doctor and thus live for another twenty. The choice is yours, not Fate's.

Cancer is a disease of old age. Back in the dim dark past, people usually lived only until about 40 years of age, and so cancer was not a big problem. It is doubtful that it was a major cause of death amongst humans. But as we slowly but surely have increased our life expectancy, cancer has become more and more of an issue. For example, for people in the USA in 2010, cancer was the No. 1 cause of death for people aged between 40 and 80, as shown in this table. Indeed, you will note that for females cancer was in one of the top two spots for all age groups.


The incidence of cancer varies dramatically among different organs, and this variation has itself changed over time. In the USA, lung cancer has been the biggest cause of cancer deaths for males since the 1950s and for females since the 1980s, as shown in the next graph. In both cases the death rate has decreased since the recent active attempts to reduce the smoking epidemic. [Note: medically it is considered to be an epidemic, just as obesity is currently considered an epidemic in the Western world.]


The second biggest cause of death for males is prostate cancer, and for females it is breast cancer, followed in both cases by cancer of the colon and/or rectum. In all of these cases the death rate is decreasing through time, ostensibly due to changes in risk factors but most importantly the introduction of screening.

Screening is particularly important for cancer of the reproductive organs. There is only one "major" cancer-related organ for males (the prostate), but for females there are three: the breasts, the uterus and the ovaries. As you can see in the next graph, the stage at which the cancer is first diagnosed varies quite a lot among these organs.


To explain the stages: (i) localized means that the cancer is confined to one part of the organ concerned; (ii) regional means that the cancer has spread throughout the organ; and (iii) distant means that it has spread from there to other nearby organs. For effective treatment, and therefore maximum probability of survival, the cancer needs to be detected before it has reached the third stage. You will note that this is particularly problematic for ovarian cancer, which is usually not diagnosed until this stage.

Interestingly, death rates due to cancer are not randomly distributed in space, as shown in the next graph for the states of the USA. The data analyzed were for death rate (per 100,000 people) for 2006-2010 for the most common types of cancer (breast, colorectum, lung, non-hodgkin lymphoma, pancreas, prostate). I used the manhattan distance to evaluate the multivariate relationships in the data, and displayed this using a NeighborNet.

The graph shows the relationships among the different states. States near each other in the network have similar death rates for the different cancers, while states further apart are progressively more different from each other.


The states mostly form a gradient of increasing cancer death rates from the top-left towards the bottom of the network. Utah stands out because it has much the lowest death rates for colorectum, lung, and pancreas cancers amongst both men and women. DC stands out because it has nearly twice the rate of prostate cancer deaths than the other locations, presumably due to the distinctly older-male biased population of Washington city.

Finally, we can look at some international data. These data are solely for ovarian cancer, involving the 1-year survival rate after diagnosis. The data refer to survival in three different age classes of women (15-49, 50-69, 70-99 years old) for the three different stages at diagnosis. They were analyzed in the same way as above to produce the network.


On average, the Canadian survival is the highest, followed by the Norwegians, with the females from NSW (in Australia) faring the worst (particularly in the oldest age group). However, these data are rather mixed. For example, the Danish survival is worse than average for the oldest age group in the localised stage but better than average in the regional stage.

There is clearly a long way to go in the diagnosis and treatment of cancers.

Sources of data

Siegel R, Ma J, Zou Z, Jemal A (2014) Cancer Statistics, 2014. CA: A Cancer Journal for Clinicians 64: 9-29.

Maringe C, Walters S, Butler J, Coleman MP, Hacker N, Hanna L, Mosgaard BJ, Nordin A, Rosen B, Engholm G, Gjerstorff ML, Hatcher J, Johannesen TB, McGahan CE, Meechan D, Middleton R, Tracey E, Turner D, Richards MA, Rachet B, ICBP Module 1 Working Group (2012) Stage at diagnosis and ovarian cancer survival: evidence from the International Cancer Benchmarking Partnership. Gynecologic Oncology 127: 75-82.

[Declaration of interest: I have had skin cancer for nearly 30 years, which is a product of growing up in Australia, the country with the highest rate of skin cancer in the world; and recently three female members of my family have been diagnosed with cancer of their reproductive organs. So, I'm thinking about it even if you haven't been.]

May 27, 2014

22:30

Complex networks are found in all parts of biology, graphically representing biological patterns and, if they are directed networks, also their causal processes. Directed networks are currently used to model various aspects of biological systems, such as gene regulation, protein interactions, metabolic pathways, ecological interactions, and evolutionary histories.

Two types of networks can be distinguished, and this distinction seems to me to be very important. Most networks are what might be called observed networks, in the sense that the nodes and edges represent empirical observations. For example, a food web consists of nodes representing animals with connecting edges representing who eats whom. Similarly, in a gene regulation network the genes (nodes) are connected by edges showing which genes affect the functioning of which other genes. In all cases, the presence of the nodes and edges in the graph is based on experimental data. These are collectively called interaction networks or regulation networks.

However, when studying historical patterns and processes not all of the nodes and edges can be observed. So, instead, they are inferred as part of the data-analysis procedure. That is, we infer the patterns as well as the processes; and we can call these inferred networks. In this case, the empirical data may consist solely of the leaf nodes, and we infer the other nodes plus all of the edges. For example, every person has two parents, and even if we do not observe those parents we can infer their existence with confidence, as we also can for the grandparents, and so on back through time with a continuous series of ancestors. Alternatively, we may also observe some of the internal nodes of the network, such as when we do record the parents and grandparents because they are contemporaneous (ie. their generations overlap). This type of pattern can be represented as a genealogical network, when referring to individual organisms, or a phylogenetic network when referring to groups (populations, species, or larger taxonomic groups).

What, then, are the things often referred to as "evolutionary networks" but which are clearly not phylogenetic networks? They are of the first type, the interaction networks. In an evolutionary network the observed nodes are directly connected to each other to represent some aspect of evolution. This aspect may have some component of phylogeny to it, but there is more to the study of evolution than solely phylogenetic history.

For example, directed LGT (dLGT) networks connect nodes representing contemporary organisms with edges that represent inferred lateral gene transfer. That is, the evolutionary networks show gene sharing. This is obviously related to the phylogeny of the organisms, but the network does not display the phylogeny itself. This first example (from Ovidiu Popa, Einat Hazkani-Covo, Giddy Landan, William Martin, Tal Dagan. 2011. Directed networks reveal genomic barriers and DNA repair bypasses to lateral gene transfer among prokaryotes. Genome Research 21: 599-609) shows "32,028 polarized lateral recipient–donor protein-coding gene transfer events" inferred from "the completely sequenced genomes of 657 prokaryote species".


The concept of a gene-sharing network as an evolutionary network has also been applied to viruses and their relatives, for example, as shown by this next diagram (from Natalya Yutin, Didier Raoult, Eugene V Koonin. 2013. Virophages, polintons, and transpovirons: a complex evolutionary network of diverse selfish genetic elements with different reproduction strategies. Virology Journal 10: 158).


The question, then, is what to make of diagrams that combine both a phylogenetic tree and this type of evolutionary network, such as is done in the Minimal Lateral Network. This next example is from linguistics rather than biology (from Johann-Mattis List, Shijulal Nelson-Sathi, Hans Geisler, William Martin. 2013. Networks of lexical borrowing and lateral gene transfer in language and genome evolution. Bioessays 36: 141-150), and it superimposes the sharing network and the phylogenetic tree. (For a discussion in the context of LGT, see also Tal Dagan. 2011. Phylogenomic networks. Trends in Microbiology 19: 483-491).


In this diagram, the tree explicitly represents the phylogenetic history of the languages while the evolutionary network represents possible borrowings of words, with thicker lines representing more borrowed words. Clearly, the network also contains phylogenetic information of some sort. For example, the connection of the root of the Romance languages to English reflects the conquest of Britain by the French-speaking Normans, which modified the Old-German heritage of Old English. However, the diagram as a whole is a hybrid, rather than being a coherent phylogenetic network in the simplest sense (ie. a reticulation network).

To see this clearly, note that the phylogenetic tree is not fully resolved and that the evolutionary network does suggest possible resolutions for several of polychotomies, such as the relationship of Armenian and Greek, the relationship of Albanian to the Romance languages, and the relationship of the Gaelic languages to the Romance languages. So, in some cases the evolutionary network helps resolve the phylogenetic tree rather than forming a reticulating network.

It would be possible to derive a phylogenetic network from this minimal lateral network, but as it stands it is a combination of a phylogenetic tree and a so-called evolutionary network.

May 25, 2014

16:30

We haven't had any phylogenetic tree tattoos on this blog for a while, so here is a new collection of Charles Darwin's best-known sketch from his Notebooks (the "I think" tree) (for other examples, see Tattoo Monday III, Tattoo Monday V, and Tattoo Monday VI).


May 20, 2014

22:30

There is a difference between phylogenetics and clustering or classification. The latter processes put objects into groups based on some intrinsic features, but the former uses their intrinsic features to expresses their evolutionary history. Not all objects have an evolutionary history, even though they can all be put into groups. Furthermore, even objects that do have a history do not necessarily have an evolutionary history. Evolution involves ancestor-descendant relationships (as well as sister-group relationships), and not all of history involves ancestors and descendants.

This distinction is important for the use of phylogenetics as a metaphor for the history of non-biological objects. Outside of biology, many things are claimed to have a "phylogenetic history", including languages and most human artifacts. As I have noted before, one has to be careful when applying this metaphor (see False analogies between anthropology and biology).

One particular example that I have encountered involves the development of computer viruses and other malware (Iliopoulos et al. 2008). Metaphorically, such viruses can be seen to be phylogenetically related, because new viruses are often based on previous ones — that is, one virus "begets" another virus due to changes in its intrinsic attributes. In this sense the metaphor is helpful, although there is no actual copying of anything resembling a genome — this is phenotype evolution not genotype evolution.

Sorkin (1994) seems to have been the first to discuss the possibility of computer virus evolution, but the first empirical attempt to reconstruct a digital phylogeny appears to have been by Hull (1995b), who studied the Stoned computer virus (a virus that infected the boot sector of PCs between 1990 and 1995), as shown below.


Since then, phylogenetics has been a popular topic in the study of computer malware (eg. Goldberg et al. 1996; Carrera & Erdélyi 2004; Karim et al. 2005a,b; Ma et al. 2006; Wehner 2007; Walenstein et al. 2007; Hayes et al. 2009; Khoo & Lió 2011). As noted by Webster & Malcolm (2007) these studies "present classifications of malware based on phylogenetic trees, in which the lineage of computer viruses can be traced and a 'family tree' of viruses constructed based on similar behaviors." (Note that there are other possible uses of phylogenetics related to computer programming, as exemplified by Ji et al. 2008.)

In all cases the phylogeny was produced using a distance-based clustering algorithm (ie. the tree is a phenetic one). Some of the distances are well motivated in terms of historical changes in the intrinsic attributes of the malware, but that does not necessarily make the resulting phenogram a phylogeny. So, the methods use basic clustering techniques to produce a tree, thus treating classification as phylogenetics (or phenetics as phylogenetics). This certainly clusters the objects, but there is no necessary reason for the clustering to reflect phylogeny.

Thus, a simple concept of clustering is inadequate, even though it can be used to construct a tree. A phylogenetic tree expresses the nested hierarchy formed by the shared derived character states, but not by anything else. A tree expresses nested clusters, but it is a form of "special nesting" that expresses a phylogeny. Only one form of tree is relevant to phylogenetics, and trees formed in other ways are likely to be suitable only under the simplest circumstances.

Furthermore, the general conception in these papers of a virus phylogeny as tree-like is clear. As noted by Hull (1995a):
Computer viruses evolve in complex ways not usually encountered in nature. The transplantation of large segments of computer code from one virus to another need not represent evolutionary relationship, for example. A newer virus may just represent a debugged or patched earlier version. The virus author may have deliberately incorporated parts of other viruses as a short cut, or because the plagiarized code is useful. If the virus incorporates code generating 'engines', similar code may appear in viruses with no other similarities. Structural similarities deriving from functional similarities likewise derive from several sources.As we now know, these sorts of evolutionary events are usually found in nature, but they create reticulate histories, involving horizontal as well as vertical evolution. That is, computer virus phylogenetics involves reticulation — new computer code takes bits and pieces from various previous viruses. In particular, there is also what is called "oblique" evolution, in which there is horizontal evolution between generations. This is a characteristic of many histories involving human artifacts (see Time inconsistency in evolutionary networks), and it allows information to "time travel", so that the information available for horizontal transmission can come from the distant past as well as from the present.

So, malware evolution is not tree-like. Only two of the papers cited above seem to acknowledge this fact. Khoo & Lió (2011) were quite conventional in using splits graphs rather than unrooted trees to display their data, although they do not specify the algorithm for producing the networks. They do, however, claim that "networks were more useful for visualising short nop-equivalent code metamorphism than trees".

Goldberg et al. (1996) were more innovative, and analyzed their data using what they called a "phyloDAG", which is a directed network that can have multiple roots (it appears to be a type of minimum-spanning network). Interestingly, they note that "Beyond the computer virus realm for which it was conceived, the phyloDAG is also a plausible model for evolution of bacterial populations." Indeed, the possibility of multiple roots has been explicitly suggested for prokaryote phylogenetics (see Can networks have multiple roots?). I wouldn't doubt that it is also feasible for language history.

References

Carrera E, Erdélyi G (2004) Digital genome mapping – advanced binary malware analysis. Virus Bulletin Conference 2004.

Goldberg LA, Goldberg PW, Phillips CA, Sorkin GB (1996) Constructing computer virus phylogenies. Lecture Notes in Computer Science 1075: 253-270. [also Journal of Algorithms (1998) 26: 188-208]

Hayes M, Walenstein A, Lakhotia A (2009) Evaluation of malware phylogeny modelling systems using automated variant generation. Journal in Computer Virology 5: 335-343.

Hull DB (1995a) Computer viruses: naming and classification. Virus Bulletin Sept: 15-17.

Hull DB (1995b) Computer viruses: naming and classification, part II. Virus Bulletin Oct: 16-17.

Iliopoulos D, Adami C, Ször P (2008) Darwin inside the machines: malware evolution and the consequences for computer security. Virus Bulletin Conference 2008.

Ji J-H, Park S-H, Woo G, Cho H-G (2008) Generating pylogenetic tree of homogeneous source code in a plagiarism detection system. International Journal of Control, Automation, and Systems 6: 809-817.

Karim ME, Walenstein A, Lakhotia A (2005a) Malware phylogeny using maximal pi-patterns. Proceedings of the EICAR 2005 Conference, pp 156-174.

Karim ME, Walenstein A, Lakhotia A, Parida L (2005b) Malware phylogeny generation using permutations of code. Journal in Computer Virology 1: 13-23.

Khoo WM, Lió P (2011) Unity in diversity: phylogenetic-inspired techniques for reverse engineering and detection of malware families. Proceedings of the 2011 First Systems Security Workshop (SysSec'11), pp 3-10. IEEE Computer Society Washington, DC.

Ma J, Dunagan J, Wang HJ, Savage S, Voelker GM (2006) Finding diversity in remote code injection exploits. Proceedings of the 6th ACM SIGCOMM Conference on Internet Measurement, pp 53-64. ACM, New York.

Sorkin GB (1994) Grouping related computer viruses into families. Proceedings of the IBM Security ITS 1994.

Walenstein A, Hayes M, Lakhotia A (2007) Phylogenetic comparisons of malware. Virus Bulletin Conference 2007.

Webster M, Malcolm G (2007) Classification of computer viruses using the theory of affordances. Second International Workshop on the Theory of Computer Viruses.

Wehner S (2007) Analyzing worms and network traffic using compression. Journal of Computer Security 15: 303-320. (arXiv:cs/0504045v1, 2007)

May 18, 2014

16:30

There is an old saying in English that "Behind every great man there is a woman ... telling him to be great". This is intended to indicate that even in patrilineal societies women have influenced history, even if history has chosen not to formally recognize them (or historians have, anyway). However, every so often a woman has also stepped into the spotlight for herself, and recognizably influenced events in a way that has brought her name down through history.

The most famous of these is probably Cleopatra (or more properly Kleopatra), the last ruler of Ancient Egypt (as Cleopatra VII). Sadly, her ambition to become Empress of the known world seems to have destroyed two successive Roman rulers (Julius Caesar and Marc Antony) as well as her own two brothers (who would have ruled in her place); and her failure seems to have lost the country of which she was queen, so that Egypt became a Roman dependency. She ruled from 51-30 BCE, and modern Egypt did not regain its independence until 1953. This was one seriously influential woman.

As noted by Schiff (2010):
She lost her kingdom once; regained it; nearly lost it again; amassed an empire; lost it all. At the height of her power she controlled virtually the entire eastern Mediterranean coast, the last great kingdom of any Egyptian ruler. For a fleeting moment she held the fate of the Western world in her hands ... Catastrophe reliably cements a reputation, and Cleopatra's end was sudden and sensational.Her interest to us, however, is her role in a dynasty that favored incest, and thus had a "family tree" that was a hybridization network, as shown in the figure. This particular family history is rather complex. Note that Cleopatra herself had at least four liaisons, two with her brothers (who successively ruled jointly with her at Ptolemy XIII and Ptolemy XIV, respectively) and two with Romans (Julius Caesar and Marc Anthony). Later, she also ruled jointly with her son by Julius Caesar (as Ptolemy XV).


Adapted from the Too Much Information blog, based on the information at Ian Mladjov's Genealogical Tables
The Ptolemaic dynasty was founded after the death of Alexander the Great (aka Alexander III of Macedon), when his empire was divided up among his Greek generals, and in 323 BCE Egypt ended up the hands of Ptolemy, who subsequently ruled as the pharaoh Ptolemy I from 305-282 BCE. As Dray (2012) has noted:
His daughter, Arsinoe II, would start the tradition of incest. Married off to an old King of Thrace when she was still a teenager, she was the ultimate survivor. Her life was frequently in danger and she made many narrow escapes ... At some point, Arsinoe seems to have decided that if she wanted to be safe, she couldn’t trust anyone outside her immediate family. So, she returned to Egypt and married her full brother, Ptolemy II.Now, the Greeks didn’t have a tradition of incest in their ruling families … but the pharaohs of Egypt did. By marrying her brother, Arsinoe was able to help create a link between the new Ptolemaic dynasty and the very old traditions of the native Egyptians. It served her extremely well as she became the first female pharaoh of the Ptolemaic dynasty, ruling not just as the wife of the king, but as a king in her own right.Meeg (2009) suggests that:
According to tradition, incestuous marriages between the pharaohs and their sisters were common. If this was the case, it could have been done to emulate the god Osiris and his sister / wife the goddess Isis (the product of that union was Horus, the alleged ancestor of the Pharaoh), and/or to keep the sacred bloodline pure. When Alexander the Great's general Ptolemy seized control of Egypt around 323 BC, his descendants would continue the local custom of pharaonic brother-sister marriages. This practice was unknown among Greeks and Macedonians.Indeed, Wikipedia notes:
In ancient Egypt, royal women carried the bloodlines and so it was advantageous for a pharaoh to marry his sister or half-sister; in such cases a special combination between endogamy and polygamy is found. Normally the old ruler's eldest son and daughter (who could be either siblings or half-siblings) became the new rulers. All rulers of the Ptolemaic dynasty from Ptolemy II were married to their brothers and sisters, so as to keep the Ptolemaic blood "pure" and to strengthen the line of succession. Cleopatra VII (also called Cleopatra VI) and Ptolemy XIII, who married and became co-rulers of ancient Egypt following their father's death, are the most widely known example.Bevan (1927) continues the story [Note: he uses one number less for the Cleopatras and Ptolemies]:
Cleopatra VI found herself queen of Egypt at the age of seventeen or eighteen. By the custom of the house, and according to the will and testament of Ptolemy Auletes, the elder of her two brothers, then only nine or ten, was associated with her, as king (Ptolemy XII). They probably had, as a pair, the style of "Father-loving Gods" (Theoi Philopatores), though neither during the reign of Cleopatra with Ptolemy XII, nor during her reign, later on, with the younger brother, Ptolemy XIII (then about twelve), do the coins bear any head or name but that of the queen, and in Egyptian sepulchral inscriptions put up during the reign of Cleopatra with her younger brother (regnal years 5, 6, and 7 of Cleopatra) the regnal year of the boy-king is ignored. Ptolemy XIV was the acknowledged son of Julius Caesar and Cleopatra, and ruled as child king with his mother.The involvement of royalty in consanguinity and incest is widespread. As noted by Dobbs (2010):
While virtually every culture in recorded history has held sibling or parent-child couplings taboo, royalty have been exempted in many societies, including ancient Egypt, Inca Peru, and, at times, Central Africa, Mexico, and Thailand [and also Hawaii].I have already discussed incest in the family "trees" of the Egyptian 18th Dynasty, in Tutankhamun and extreme consanguinity (the other set of pharaohs where this was common); and I have covered the persistent inbreeding in the downfall of the modern Spanish branch of the Habsburgs, in Family trees, pedigrees and hybridization networks.

Not unexpectedly, this phenomenon has received attention from modern evolutionary biologists. Conventionally, the evolutionary advantage of sexual over non-sexual reproduction is considered to be the creation of genetic diversity through heterozygosity. Inbreeding, by reducing heterozygosity, then seems to negate the advantages of sexual reproduction. So, the near universality of incest avoidance in humans has a clear genetic dimension. Indeed, as I have noted in previous blog posts this is easily demonstrated in well-known families — (i) Charles Darwin's family pedigree network, (ii) Toulouse-Lautrec: family trees and networks.

The presence of incest among royal families then requires biological explanation. Indeed, van den Berghe & Mesher (1980) have provided one:
Royal incest (mostly brother-sister; less commonly father-daughter) represents the logical extreme of hypergyny. Women in stratified societies maximize [evolutionary] fitness by marrying up; the higher the status of a woman, the narrower her range of prospective husbands. This leads to a direct association between high status and inbreeding. Royal incest is a fitness maximizing strategy if the following conditions are met: polygyny, patrilineal succession, and parental control of royal succession. Under those conditions, the genetic risks of close inbreeding are more than accounted for by the production of a highly related male heir who has, himself, access to a large harem. Data from Ancient Egypt, Inca Peru, Hawaii, Thailand, Monomotapa, Bunyoro, Ankole, Buganda, Shilluk, Zande, Nyanga and Dahomey confirm hypotheses derived from the sociobiological paradigm of inclusive fitness.Finally, to return to Cleopatra, she is usually credited with being fatally attractive due to her great beauty. However, there is no evidence that this was actually the case. Her attractiveness to men seems to have come much more from a strong personality, including determined diplomacy and an easy facility with languages. Also, her ancestors were Macedonian Greeks, rather than native Egyptians, giving her a stronger genetic and cultural tie to Europe rather than to Africa, which must have helped when trying to woo the rulers of the Roman Empire. It was this ancestry that the dynasty's consanguinity and incest were intended to protect. The Egyptian populace certainly didn't benefit from it.

Indeed, Cleopatra seems simply to have been the ultimate expression of her dynasty's heritage, as noted by Ager (2006):
royal incest, as practised by the Ptolemies, was only one of a larger set of behaviours, all of which were symbolic of power, and all of which were characterized by lavishness, immoderation, excess and the breaching of limits in general.Interestingly, the potentially negative aspects of inbreeding seem not to have affected this dynasty — there is no convincing evidence of infertility, infant mortality or genetic defects, for example (Ager 2006). Instead, their main historical legacy has been their bizarre juxtaposition of either marrying each other or murdering each other, and sometimes both. Cleopatra's activities in this regard were no different to those of her ancestors.

References

Ager SL (2006) The power of excess: royal incest and the Ptolemaic dynasty. Anthropologica 48: 165-186.

Bevan ER (1927) The House of Ptolemy. Methuen Publishing, London.

Dobbs D (2010) The risks and rewards of royal incest. National Geographic Magazine.

Dray S (2012) Keeping it in the (Ptolemaic) family: when incest is best.

Meeg (2009) Royal inbreeding in Ancient Egypt.

Ian Mladjov's Genealogical Tables — The Ptolemies, kings of Egypt.

Schiff S (2010) Cleopatra: a biography. Little, Brown and Co, New York. [excerpted in Smithsonian Magazine]

van den Berghe PL, Mesher GM (1980) Royal incest and inclusive fitness. American Ethnologist 7: 300-317.

Wikipedia. Inbreeding.