The Genealogical World of Phylogenetic Networks

Biology, computational science, and networks in phylogenetic analysis

URL

XML feed
http://phylonetworks.blogspot.com/

Last update

1 hour 24 min ago

September 14, 2014

16:30

I have noted before that the evolutionary history of musical instruments is likely to be a reticulating network rather than being tree-like (Cornets: from a tree to a network). As another illustration of the pattern, we can consider the evolution over the past few centuries of the Spanish or flamenco guitar (taken from the Origem do nome Violão blog post).


This genealogy (with time proceeding from left to right) shows three basic characteristics that seem to be common in anthropological histories. First, there are multiple roots — in this case, three different instruments from the 16th century have provided input into the modern acoustic guitar. Second, there is an early history of reticulation, with ideas for new instrumentation being taken freely from among the existing instruments, in this case presumably in the search for better sound reproduction. Third, there is simple transformational evolution, with new models replacing the previous ones in popularity — for example, over the past 100 years the Spanish guitar has simply gotten larger (this is Cope's Rule.)

September 9, 2014

22:30

I noted in my previous blog post (Charles Darwin and the coalescent) that the multispecies coalescent needs to be based on a network model not a tree model. This is because reticulation processes occur both within species and between species — there is gene flow within genealogies and within phylogenies.

Reticulate genealogies are nothing new, and I have blogged about some of the best-known human genealogies with reticulations due to consanguinity (marriage between close relatives):
King Charles II of Spain
Charles Darwin
Henri Toulouse-Lautrec
Albert Einstein
Pharaoh Tutankhamun
Pharaoh Cleopatra

Importantly, in the modern world there are quite a few genealogical datasets available for study. For example, the Kinsources repository has c. 100 datasets from around the world, covering multi-generational histories for nearly 350,000 individuals. These data are actively used for research (eg. Bailey et al. 2014).

However, the best documented human genealogies are those for the various Anabaptist populations, who moved from Europe to North America during the 18th and 19th centuries. Anabaptists have mostly closed populations (ie. marriages occur solely within a population), and they are thus inbred, and most importantly they maintain detailed written genealogies. These populations include the Mennonites, Hutterites and Amish, the latter being the best known.

As noted by Agarwala et al. (2001):
The term "Anabaptist" literally means "rebaptizer" and is used to refer to a Christian movement that arose in central Europe in the first half of the 16th century. Adherents support adult baptism, pacifism, and separation of church and state. Among the large Anabaptist groups existing today are Mennonites (who were originally followers of Menno Simons), Amish (originally followers of Jakob Ammann who split away from the Mennonites at the end of the 17th century), and Hutterites (originally followers of Jakob Hutter). Amish and Mennonites emigrated to North America in multiple waves in the 18th and 19th centuries. The Hutterites began emigrating to the northern and western parts of North America in the late 1800s.Distribution of Amish settlements in North America
Note the rapid expansion over the past 25 years.
The Mennonites originated in the Swiss Alps, and diffused northward into Germany and the Netherlands. The Dutch/North German Mennonites began the migration to America in the 1680s, followed by a much larger migration of Swiss/South German Mennonites beginning in 1707. The Amish are an early split from the Swiss/South German group that occurred in 1693. There are now at least 200,000 Amish in the eastern United States and eastern Canada (see the map above, taken from here), with the numbers apparently growing rapidly with recently increasing movement westward. There are various subgroups (eg. Old Order Amish, New Order Amish). There are about 1.7 million Mennonites worldwide, with c. 150,000 in the eastern United States and eastern Canada. The genealogies of 295,000 Mennonite and Amish individuals from the eastern USA have been databased (Agarwala et al. 2001).

The Hutterites originated as an Anabaptist offshoot in the Tyrolean Alps in the 1500s, but now there are c. 135,000 Hutterites living on 1,350 communal farms in the northern United States (principally South Dakota) and western Canada. Genealogical records trace all extant Hutterites to 90 ancestors who lived during the early 1700s to the early 1800s (see Ober et al. 1999).

These Anabaptist groups are frequently used in medical studies, because it is possible to relate disease occurrences to the recorded genealogy, and thus to assess the genetic component of the disease (eg. Dorsten et al. 1999, Hou et al. 2013). So, the literature is replete with figures showing the distribution of different diseases plotted onto the genealogy. I have included some of the Amish ones here, to illustrate the extreme reticulation that results when inbreeding is ongoing over many generations.

This first one is from Georgi et al. (2014). The diseased people are marked in red.


The next one is from Garner et al. (2001).


This one is from Lee et al. (2008).


The final one is from Racette et al. (2002).


Here is one small part of this genealogy, which emphasizes that between-generation marriages are an important component of the consanguinity.


References

Agarwala R, Schaffer A, Tomlin J (2001) Towards a complete North American Anabaptist genealogy II: analysis of inbreeding. Human Biology 73: 533-545.

Bailey DH, Hill KR, Walker RS (2014) Fitness consequences of spousal relatedness in 46 small-scale societies. Biology Letters 10: 20140160.

Dorsten L, Hotchkiss L, King T (1999) The effect of inbreeding on early childhood mortality: twelve generations of an Amish settlement. Demography 36: 263-271.

Garner C, McInnes LA, Service SK, Spesny M, Fournier E, Leon P, Freimer NB (2001) Linkage analysis of a complex pedigree with severe bipolar disorder, using a Markov chain Monte Carlo method. American Journal of Human Genetics 68: 1061-1064.

Georgi B, Craig D, Kember RL, Liu W, Lindquist I, Nasser S, Brown C, Egeland JA, Paul SM, Bućan M (2014) Genomic view of bipolar disorder revealed by whole genome sequencing in a genetic isolate. PLoS Genetics 10: e1004229.

Hou L, Faraci G, Chen DT, Kassem L, Schulze TG, Shugart YY, McMahon FJ (2013) Amish revisited: next-generation sequencing studies of psychiatric disorders among the Plain people. Trends in Genetics 29: 412-418.

Lee SL, Murdock DG, McCauley JL, Bradford Y, Crunk A, McFarland L, Jiang L, Wang T, Schnetz-Boutaud N, Haines JL (2008) A genome-wide scan in an Amish pedigree with parkinsonism. Annals of Human Genetics 72: 621-629.

Ober C, Hyslop T, Hauck WW (1999) Inbreeding effects on fertility in humans: evidence for reproductive compensation. American Journal of Human Genetics 64: 225–231.

Racette BA, Rundle M, Wang JC, Goate A, Saccone NL, Farrer M, Lincoln S, Hussey J, Smemo S, Lin J, Suarez B, Parsian A, Perlmutter JS (2002) A multi-incident, Old-Order Amish family with PD. Neurology2 58: 568-574.

September 7, 2014

16:30

In an earlier blog post (The ultimate phylogenetic network?) I reproduced the lattice network from the anthropologist Franz Weidenreich. This comes close to being as complex as a network can get when applied to groups of organisms. However, when we study the genealogy of individuals, the network can get much more complex. This will be most true when there are marriages between close relatives (consanguinity), which creates inbreeding.

The family pedigree (or family tree!) shown here is for a group of people in a recently isolated population from the southwestern area of The Netherlands. There are 4,645 people involved, covering 18 generations (one row each). The average number of consanguineous loops for the 103 study individuals is 71.7, which is what is creating all of the cross-connections that make the network look so horrendous. (Consanguineous or inbreeding loops are illustrated here.)


The genealogy is from:
Liu F, Arias-Vásquez A, Sleegers K, Aulchenko YS, Kayser M, Sanchez-Juan P, Feng BJ, Bertoli-Avella AM, van Swieten J, Axenovich TI, Heutink P, van Broeckhoven C, Oostra BA, van Duijn CM (2007) A genomewide screen for late-onset Alzheimer disease in a genetically isolated Dutch population. American Journal of Human Genetics 81: 17-31.

September 2, 2014

22:30

The full title of Charles Darwin's most famous book was On the Origin of Species by Means of Natural Selection, or the Preservation of Favoured Races in the Struggle for Life. It is important to note that this title juxtaposes the concepts of between-species variation and within-species variation (Darwin usually referred to "races" rather than to "breeds", "subspecies", etc). This was one of his major insights: the idea that there is a continuum of variation in biology through time (or, as he put it, that it is arbitrary whether variants are treated as different races or as different species).

As I recently noted, this paved the way for between-species phylogenies to be seen as directly analogous to within-species genealogies (The role of biblical genealogies in phylogenetics) — previous applications of genealogies to non-humans (such as those of Buffon and Duchesne) had been explicitly restricted to within-sepcies relationships.

This conceptual integration of within-species and between-species relationships has become explicit in modern biology by using multispecies coalescent models to integrate population genetics and phylogenetics. As noted by Reid et al. (2014):
These models treat populations, rather than alleles sampled from a single individual, as the focal units in phylogenetic trees. The multispecies coalescent model connects traditional phylogenetic inference, which seeks primarily to infer patterns of divergence between species, and population genetic inference, which has typically focused on intraspecific evolutionary processes. The development of these models was motivated by the common empirical observation that genealogies estimated from different genes are often discordant and the discovery that, if ignored, this discordance can bias parameters of direct interest to systematists, such as the relationships and divergence times among species.However, as specifically emphasized by Reid et al.:
In order to reconcile discordance among gene trees and uncover true species relationships, the first gene tree/species tree models assumed that discordance is solely the result of stochastic coalescence of gene lineages within a species phylogeny ... Coalescent stochasticity, however, is not the only source of gene tree discordance. Selection, hybridization, horizontal gene transfer, gene duplication/extinction, recombination, and phylogenetic estimation error can also result in discordance.They examined this situation by studying the fit of the multispecies coalescent model:
to 25 published data sets. We show that poor model fit is detectable in the majority of data sets; that this poor fit can mislead phylogenetic estimation; and that in some cases it stems from processes of inherent interest to systematists ...Our analyses suggest that poor fit to the multispecies coalescent model can mislead inference in empirical studies. In the case of recent hybridization, the consequences may be severe, as species divergences are forced to post-date gene divergences ... When topological conflict among coalescent genealogies is the result of ancient hybridization, balancing selection, or gene duplication and extinction, the consequences may be less severe.In other words, tree-based phylogenetics is inadequate in practice because of gene flow. Within-species genealogies and between-species phylogenies intersect in the concept of a network, not a tree. That is, the multispecies coalescent needs to be based on a network model not a tree model:
The biological processes that generate variation in gene tree topologies should be explicitly modeled, as should relevant dynamics of molecular evolution. Increasingly complex multispecies coalescent models are being implemented, but there are tradeoffs. Some examine gene duplication and extinction or migration but cannot estimate divergence times.So, current models are inadequate. It will be interesting to see how these approaches develop to incorporate gene flow (reticulation) into what has heretofore been a tree model (modeling only ancestor-descendant relationships), as we are still in need of methods for estimating rooted evolutionary networks.

Reference

Reid NM, Hird SM, Brown JM, Pelletier TA, McVay JD, Satler JD, Carstens BC (2014) Poor fit to the multispecies coalescent is widely detectable in empirical data. Systematic Biology 63: 322-333.

August 31, 2014

August 26, 2014

22:30

I have previously noted that the first empirical phylogenetic tree apparently was published by St George Jackson Mivart in late 1865, a full 6 years after Charles Darwin released On the Origin of Species (Who published the first phylogenetic tree?). Mivart was not necessarily the first to start producing such a tree, but he got into print first. For example, Franz Martin Hilgendorf wrote a PhD thesis in 1863 for which he produced a hand-drawn tree, but he did not actually include the tree itself in the thesis (The dilemma of evolutionary networks and Darwinian trees). Also, Ernst Heinrich Philipp August Haeckel claimed to have started work on his series of phylogenetic trees in 1864, but the resulting book, Generelle Morphologie der Organismen, was not published until 1866 (Who published the first phylogenetic tree?).

Another actor in this series of events was Fritz Müller, who can also be considered to have published a tree first, in 1864, albeit a very small one.


Johann Friedrich Theodor Müller (1822–1897)

Müller was born in Germany, but in the 1850s he emigrated to southern Brazil with his brother and their wives. As a naturalist in the Atlantic forest, he studied the insects, crustaceans and plants, and he is chiefly remembered today as the describer of what we now call Müllerian mimicry (the phenotypic resemblance between two or more unpalatable species).
Heinrich Bronn's German translation of the Origin appeared in 1860, and Müller read it and agreed with its central thesis (as did Hilgendorf and Haeckel). Indeed, in 1864 he published a book discussing some of the empirical evidence that he adduced with regard to the Crustacea:
Für Darwin
Verlag von Wilhelm Engelman, Leipzig.The book has 91 pages and 67 figures, and the Foreword is dated 7th September 1863. Several copies are available in Google Books (here, here, here).

In this book Müller described the development of Crustacea, illustrating that crustaceans and their larvae could be affected by adaptations and natural selection at any growth stage. He discussed in detail how living forms diverged from ancestral ones, based on his study of aerial respiration, larvae morphology, sexual dimorphism, and polymorphism.

Darwin read the book, and began a life-long correspondence with Müller (ultimately some 60 letters having been exchanged between them). Subsequently, Darwin commissioned an English translation of the book, and in 1869 published it with John Murray on commission (ie. taking the risk himself). Darwin printed 1000 copies but it apparently was not a great success:
Facts and Arguments for Darwin
Translated from the German by W.S. Dallas
John Murray, London.The book has 144 pages and 67 figures, and the Translator's Preface is dated 15th February 1869. A copy is available in the Biodiversity Heritage Library (here).

The following quotes are from this English translation [Note that Müller's unnecessarily convoluted sentences exist in the original German — this writing style is one reason why the book is not as well known as the works of Darwin and Wallace]:
It is not the purpose of the following pages to discuss once more the arguments deduced for and against Darwin's theory of the origin of species, or to weigh them one against the other. Their object is simply to indicate a few facts favourable to this theory ...When I had read Charles Darwin's book 'On the Origin of Species,' it seemed to me that there was one mode, and that perhaps the most certain, of testing the correctness of the views developed in it, namely, to attempt to apply them as specially as possible to some particular group of animals ...When I thus began to study our Crustacea more closely from this new stand-point of the Darwinian theory,—when I attempted to bring their arrangements into the form of a genealogical tree, and to form some idea of the probable structure of their ancestors,—I speedily saw (as indeed I expected) that it would require years of preliminary work before the essential problem could be seriously handled ...But although the satisfactory completion of the "Genealogical tree of the Crustacea" appeared to be an undertaking for which the strength and life of an individual would hardly suffice, even under more favourable circumstances than could be presented by a distant island, far removed from the great market of scientific life, far from libraries and museums—nevertheless its practicability became daily less doubtful in my eyes, and fresh observations daily made me more favourably inclined towards the Darwinian theory.In determining to state the arguments which I derived from the consideration of our Crustacea in favour of Darwin's views, and which (together with more general considerations and observations in other departments), essentially aided in making the correctness of those views seem more and more palpable to me, I am chiefly influenced by an expression of Darwin's: "Whoever," says he ('Origin of Species' p. 482), "is led to believe that species are mutable, will do a good service by conscientiously expressing his conviction."So, for the reason stated, Müller did not produce a complete phylogeny in the book. However, of particular interest to us is the figure on page 6 of the original German edition (page 9 of the translation). It turns out to be a pair of three-taxon statements concerning species of Melita (amphipods), as shown in the figure above (original) and below (translation). Müller has this to say:
[There are five] species of Melita ... in which the second pair of feet bears upon one side a small hand of the usual structure, and o the other an enormous clasp-forceps. This want of symmetry is something so unusual among the Amphipoda, and the structure of the clasp-forceps differs so much from what is seen elsewhere in the this order, and agrees so closely in the five species, that one must unhesitatingly regard them as having sprung from common ancestors belonging to them alone among known species.This is as clear a statement of synapomorphy, and its relationship to constructing a phylogeny, as you could get; and so we could credit Müller with having produced an empirical phylogenetic tree (the one on the left in the figures).


Equally interestingly, Müller then goes on to consider a potentially contradictory character: the secondary flagellum of the anterior antennae, which is missing in one species. This would produce a different three-taxon statement (shown on the right in the figures). He resolves the issue by suggesting that the flagellum might be similar to the situation in other species, where it is "reduced to a scarcely perceptible rudiment—nay, that it is sometimes present in youth and disappears at maturity". This is a clear example of the character conflict that arises when trying to construct an empirical phylogeny; and it was also encountered by Mivart in his studies of primate skeletons (Is this the first network from conflicting datasets?).

Conclusion

Müller did not publish a complete phylogeny, but instead discussed how to produce one, and illustrated the practicality (and necessity) of doing so. In the process, he produced a simple three-taxon statement (which is not even numbered as a figure). Nevertheless, this cladogram is technically the first in print, pre-dating Mivart by a year. Darwin was right to recognize its importance, although he seemed to take a while to bring it to the attention of the English-speaking public. Furthermore, Müller was apparently the first to encounter the empirical difficulty of how to deal with conflicting data, which would produce different phylogenetic trees. This is an issue that is just as important today as it was then.

August 24, 2014

16:30


For those of you who do not understand the notation:
Homo apriorius ponders the probability of a specified hypothesis, while Homo pragamiticus is interested by the probability of observing particular data. Homo frequentistus wishes to estimate the probability of observing the data given the specified hypothesis, whereas Homo sapients is interested in the joint probability of both the data and the hypothesis. Homo bayesianis estimates the probability of the specified hypothesis given the observed data.

August 19, 2014

22:30

Phylogeneticists treat the tree image as having special meaning for themselves. Conceptually, the tree is used as a metaphor for phylogenetic relationships among taxa, and mathematically it is used as a model to analyze phenotypic and genotypic data to uncover those relationships. Irrespective of whether this metaphor / model is adequate or not, it has a long history as part of phylogenetics (Pietsch 2012). Of particular interest ihas been Charles Darwin's reference to the "Tree of Life" as a simile, since that is clearly the key to the understanding of phylogenetics by the general public.

The principle on which phylogenetic trees are based seems to be the same as that for human genealogies. That is, phylogenies are conceptually the between-species homolog of within-species genealogies. As far as Western thought is concerned, human genealogies make their first important appearance in the Bible, with a rather specific purpose. The Bible contains many genealogies, mostly presented as chains of fathers and sons. For example, Genesis 5 lists the descendants of Adam+Eve down to Noah and his sons, which can be illustrated as a pair of chains (as shown in the first figure); and the rest of Genesis gets from there down to Moses' family, for which the genealogy can be illustrated as a complex tree.

The genealogy as listed in Genesis 5.
Cain's lineage was terminated by the Flood.
However, the theologically most important genealogies are those of Jesus, as recorded in Matthew 1:2-16 and Luke 3:23-38. Matthew apparently presents the genealogy through Joseph, who was Jesus' legal father; and Luke apparently traces Jesus' bloodline through Mary's father, Eli. These two lineages coalesc in David+Bathsheba, and from there they have a shared lineage back to Abraham. Their importance lies in the attempt to substantiate that Jesus' ancestry fulfils the biblical prophecies that the Messiah would be descended from Abraham (Genesis 12:3) through Isaac (Genesis 17:21) and Jacob (Genesis 28:14), and that he would be from the tribe of Judah (Genesis 49:8), the family of Jesse (Isaiah 11:1) and the house of David (Jeremiah 23:5).

That is, these genealogies legitimize Jesus as the prophesied Messiah. Following this lead, subsequent use of genealogies has commonly been to legitimize someone as a monarch, so that royal genealogies have been of vital political and social importance throughout recorded history (see the example in the next figure). This importance was not lost on the rest of the nobility, either, so that documented genealogies of most aristocratic families allow us to identify the first-born son of the first-born son, etc, and thus legitimize claimants to noble titles — genealogies are a way for nobles to assert their nobility.

The genealogy of the current royal family of Sweden. [Note: most children are not shown]
The lineage of the recent monarchs is highlighted as a chain, with an aborted side-branch dashed.
If we focus solely on the line of descent involved in legitimization, then genealogies can be represented as a chain (as shown in the genealogy above). However, if we include the rest of the paternal lines of descent then family genealogies can be represented as a tree. However, if we include some or all of the maternal lineages as well, then family genealogies can be represented as a network. For example, the biblical genealogies only rarely name women, but where females are specifically named the genealogies actually form a reticulated network. Jacob produced offspring with both Rachel and Leah, who were his first cousins; and Isaac and Rebekah were first cousins once removed. Even Moses was the offspring of parents who were, depending on the biblical source consulted, either nephew-aunt, first cousins, or first cousins once removed. These relationships cannot be represented in a tree. (See also the complex genealogy of the Spanish branch of the Habsburgs, who were kings of Spain from 1516 to 1700.)

This idea of genealogical chains, trees and networks was straightforward to transfer from humans to other species. Originally, biologists stuck pretty much to the idea of a chain of relationships among organisms, as presented in the early part of Genesis. Human genealogies were traced upwards to Adam and from there to God, and thus species relationships were traced upwards to God via humans. However, by the second half of the 1700s both trees and networks made their appearance as explicit suggestions for representing biological relationships. In particular, Buffon (1755) and Duchesne (1766) presented genealogical networks of dog breeds and strawberry cultivars, respectively.

However, these authors did not take the conceptual leap from within-species genealogies to between-species phylogenies. Indeed, they seem to have explicitly rejected the idea, confining themselves to relationships among "races". It was Charles Darwin and Alfred Russel Wallace, a century later, who first took this leap, apparently seeing the evolutionary continuum that connects genealogies to phylogenies. In this sense, they both took ideas that had been "in the air" for several decades, but previously applied only within species, and applied them to the origin of species themselves. [See the Note below.] Both of them, however, confined themselves to genealogical trees rather than using networks. It seems to me that it was Pax (1888) who first put the whole thing together, and produced inter-species phylogenetic networks (along with some intra-species ones).

In this sense, the biblical Tree of Life has only a peripheral relevance to phylogenetics. Darwin used it as a rhetorical device to arouse the interest of his audience (Hellström 2011), but it was actually the biblical genealogies that were of most practical importance to his evolutionary ideas. Apart from anything else, the original biblical tree was actually the lignum vitae (Tree of Eternal Life) not the arbor vitae (Tree of Life). Similarly, the tree from which Adam and Eve ate the forbidden fruit was the lignum scientiae boni et mali (Tree of Knowledge of Good and Evil), not the arbor scientiae (Tree of Knowledge) that was subsequently used as a metaphor for human knowledge.

Note. Along with phylogenetic trees, Darwin and Wallace did not actually originate the idea of natural selection, which had previously been discussed by people such as James Hutton (1794), William Charles Wells (1818), Patrick Matthew (1831), Edward Blyth (1835) and Herbert Spencer (1852). However, this discussion had been in relation to within-species diversity, whereas Wallace and Darwin applied the idea to the origin of between-species diversity (i.e. the origin of new species).

References

Buffon G-L de. 1755. Histoire naturelle générale et particulière, tome V. Paris: Imprimerie
Royale.

Duchesne A.N. 1766. Histoire naturelle des fraisiers. Paris: Didot le Jeune & C.J. Panckoucke.

Hellström N.P. 2011. The tree as evolutionary icon: TREE in the Natural History Museum, London. Archives of Natural History 38: 1-17.

Pax F.A. 1888. Monographische übersicht über die arten der gattung Primula. Bot. Jahrb. Syst. Pflanzeng. Pflanzengeo. 10:75-241.

Pietsch T.W. 2012. Trees of life: a visual history of evolution. Baltimore: Johns Hopkins University Press.

August 17, 2014

16:30

These illustrations are from Alper Uzun's Biocomicals web site.






Bioinformaticians' dream


Bioinformaticians' reality

August 12, 2014

22:30

Sampling bias refers to a statistical sample that has been collected in such a way that some members of the intended statistical population are less likely to be included than are others. The resulting biased sample does not necessarily represent the population (which it should), because the population members were not all equally likely to have been selected for the sample.

This affects scientific work because all scientific questions are about the population not the sample (ie. we infer from the sample to the population), and we can only answer these questions if the samples we have collected truly represent the populations we are interested in. That is, our results could be due to to the method of sampling but erroneously be attributed to the phenomenon under study instead. Bias needs to be accounted for, but it cannot be assessed by looking at the sampled data alone. Bias can only be addressed via the sampling protocol itself.

In genome sequencing, sampling bias is often referred to as ascertainment bias, but clearly it is simply an example of the more general phenomenon. This is potentially a big problem for next generation sequencing (NGS) because there are multiple steps at which sampling is performed during genome sequencing and assembly. These include the initial collection of sequence reads, assembling sequence reads into contigs, and the assembly of these into orthologous loci across genomes. (NB: For NGS technologies, sequence reads are of short lengths, generally

August 10, 2014

17:29

In many games of chance the odds of winning or losing remain constant during play, such as in the street coin-game Two-Up and for the casino Roulette wheel. At the other extreme, the odds of winning are sometimes determined by the players to a much greater extent, such as in the card game Poker. This is why poker is such a popular form of gambling — all players are under the delusion that the advantage lies with them alone.

In between these extremes, there are games of chance where the odds of winning vary depending on the circumstances. If a player can identify these circumstances, then they can increase their wagers when the circumstances are favorable and decrease them when they are unfavorable, thus maximizing their chances of making a profit. This is called Advantage Gambling, and it is amenable to formal mathematical analysis. These analyses have kept a number of mathematicians gainfully employed over the centuries.

Some well-known examples of advantage gambling are the use of Arbitrage Bets in sports betting, and of Card Counting in card games. This blog post is about the latter, especially as applied to the casino card-game of Blackjack. [There are also many similar games played both inside and outside casinos, such as Twenty-one, Vingt-et-un, Spanish 21, Pontoon, etc.]


In blackjack the player is betting their card hand against that of the dealer (not any other player). The basic idea is to be dealt a hand of cards whose face values sum to a final score that is higher than that of the dealer's hand without exceeding a sum of 21. There are many variants throughout the world, although they tend to be minor variations on a single basic theme (as described by Wikipedia). In general, the dealer follows a strict set of rules specifying how many cards they can be dealt, while the player has a free choice regarding their own hand.

Clearly, the composition of the cards being dealt must change throughout a series of hands being dealt, because the deck of cards (or more usually several decks) gradually becomes exhausted. If the cards have been shuffled so that the random order of the cards is very even then there will be little change in composition through time, but if the random order is clustered (as it can be by random chance) then the composition of the cards remaining to be dealt may favor either the dealer or the player.

This favoritism happens because the dealer has to follow a fixed strategy, and certain cards favor that strategy. In particular, the dealer must always be dealt another card when their hand sums to a total in the range 12-16 (and sometimes 17). If the card dealt is a 10, J, Q or K (all of which have a value of ten) then the dealer's sum will exceed 21, and the player will win. Thus, if there is a high proportion of these cards remaining in the deck then the dealer is at a disadvantage relative to the player, who can chose not to take the extra card. On the other hand, if there is a high proportion of low cards remaining (especially 4s, 5s and 6s) then the dealer will not be disadvantaged.

In general play, the casino dealer will have an advantage of 0.5-1%, depending on the precise rules of play and how many decks of cards are in use simultaneously. So, in the long term the casino will make a profit, which is why they are in the gambling business in the first place. However, they make a smaller profit from blackjack than from any of their other games (for example, in roulette the casino's advantage is usually 5.3% in the USA and 2.7% in Europe), and this means that for blackjack the advantage gambler doesn't have to move the advantage very far for it to be in their favor instead of the casino's.

There is a Basic Strategy in blackjack, which stipulates what the gambler should do when their hand has any specified total against that of the dealer's — that is, whether they should Stand, Hit, Double Down, or Split. This was first explained by Roger Baldwin, Wilbert E. Cantey, Herbert Maisel and James P. McDermott in 1956 (Optimum strategy in blackjack. Journal of the American Statistical Association 51: 429-439); and Wikipedia provides a simple exposition. For the gambler, this strategy will lose the least amount of money to the casino in the long term (ie. lose only the 0.5-1% referred to above), as determined by mathematical analysis.

The advantage gambler wants to change these odds. The most common advantage play for blackjack is card counting, and it can change the advantage to be up to 2% in the gambler's favor. The essential idea is to keep a running track of whether the remaining undealt cards are biased towards small values (2, 3, 4, 5, 6) or large values (10, J, Q, K, A). To do this, a pre-specified value is added to the running total for each of the small cards that have already been dealt (and therefore can't still be in the deck), and a pre-specified value is subtracted for each of the large cards. The value of the running count will then indicate how much the advantage is in favor of the gambler. The gambler can then bet according to the size of their advantage.

There is nothing unique about this: "anyone who aspires to play Bridge, Stud Poker, Rummy, Gin, Pinochle, or Go Fish knows that you must keep track of the played cards" (Norman Wattenberger. 2009. Modern Blackjack: an Illustrated Guide to Blackjack Advantage Play). It requires no especial mathematical ability, although you do have to pay attention, and not forget what your count currently is (this is far simpler than playing bridge, where to play well you need to keep precise information about the remaining cards). Blackjack has apparently increased in popularity over the last 40 years because it is one of the few casino games that can consistently be won using expert play (maybe also video poker). However, the casinos will not unexpectedly try to stop you from winning via card counting.

The idea of counting cards in blackjack has been around since at least the 1950s, but the first popular text on the subject was Edward O. Thorp's book Beat the Dealer: a Winning Strategy for the Game of Twenty-One (1962). Since then, oodles of card counting systems have been devised, which differ in how many points are to be added to or subtracted from the running total for each card that is dealt. They range from relatively easy to implement to unnecessarily difficult.


We can look at the relationship between the different counting systems using a phylogenetic network. The data for 24 of these systems are available at Norman Wattenberger's Card Counting page (see also Popular Card Counting Strategies). The above graph is a NeighborNet (based on the manhattan distance) of these data. Systems near each other in the network have a similar assignment of points to cards, while systems further apart are progressively more different from each other. The network shows a simple trend of increasing complexity of the systems from the top-right to the bottom-left. [Note that some of the systems use the same points, and thus appear at the same place in the network, but these do differ in other ways.]

This trend correlates quite well with the perceived ease of use of the systems, with the hardest ones to use being highlighted in red in the network and the medium ones in blue. The hardest ones do seem to be the most successful at predicting good betting situations. However, the consensus seems to be that the most complex systems are not that much better than some of the simpler ones — these are slightly less powerful but far easier to use. That is, the differences in difficulty are much greater than are the differences in performance, and so the complex ones are rarely recommended these days.

The powerful but simple systems include KISS III, K-O, REKO and Red Seven. Indeed, K-O appears to be becoming one of the most popular card counting systems. However, the older Hi-Lo is probably the most used counting strategy in existence.

Other games

Actually, consistently winning at blackjack is now old hat. What is far more interesting is trying to be an advantage gambler at games like lotto and the lotteries. Advantage gambling at lotto turns out sometimes to be an investment strategy rather than a gamble. For example, there have been times when the prize money has actually been greater than the cost of the betting tickets required to cover all of the needed number combinations (see The International Lotto Fund) and other times when the prize distribution has made each ticket worth more than it costs (see Massachusetts' Cash WinFall). My favorite, though, is trying to work out how to use advantage gambling for scratch lotteries, the gambling that usually has the worst chance of winning (see this article about Joan Ginther, who has clearly tried).

August 5, 2014

22:30

Data-display networks are a means of visualizing complex patterns in multivariate data. One particular use is for displaying the patterns in a set of trees. For example, Consensus Networks and SuperNetworks are splits graphs that display the patterns common to some specified subset of a collection of trees (eg. a set of equally optimal trees, or a set of trees sampled by a bayesian or bootstrap analysis). Alternatively, Parsimony Networks try to simultaneously display all of the trees in a collection of most-parsimonious trees for a single dataset.

Another display method for multiple trees is what has been called a Cloudogram (see the post Cloudograms and data-display networks). These superimpose the set of all trees arising from an analysis, so that dark areas in such a diagram will be those parts where many of the trees agree on the topology, while lighter areas will indicate disagreement.

Yet another method for combining trees into a graph while retaining all of the original information from the source trees is the Tree Alignment Graph (TAG), an idea introduced by Stephen A. Smith, Joseph W. Brown and Cody E. Hinchliff (2013. Analyzing and synthesizing phylogenies using tree alignment graphs. PLoS Computational Biology 9: e1003223).


The authors note:
These methods address the problem of identifying common nodes and edges across sets of phylogenetic trees and constructing a data structure that efficiently contains this information while retaining original source information ... Mapping trees into a TAG exploits the fact that rooted phylogenetic trees are in fact a specific type of graph: they are directed, acyclic, and require that each node has, at most, one parent. By relaxing these requirements, we can combine multiple trees into a common graph, while minimizing changes to the semantic interpretations of nodes and edges in the trees. Because they contain nodes and edges directly analogous to those from their source trees, TAGs have the desirable quality of retaining the full identifiability of the original source trees they contain. Additionally, because they are not restricted to the bifurcating model of evolution, TAGs may represent conflict among source trees as reticulations in the graph.The basic principal is illustrated in the first figure (about). Internal nodes represent collections of terminal nodes, and arcs (directed edges) represent their relationships. Nodes and arcs are added to the growing TAG, each of which represents one relationship shown in one of the original trees. TAG A in the figure shows the result of combining the black, blue and orange trees, while TAG B shows the result of then adding the gray and green trees to TAG A (the arcs are colour-coded). The resulting TAG is thus a database of all of the original information, which can then be queried in any way to provide summaries of the data. In particular, standard network summaries can be used, such as node degree, which will highlight parts of the TAG with interesting characteristics.


The authors provide two empirical examples of applications. The one shown here involves 100 bootstrap trees for 640 species representing the majority of known lineages from the Angiosperm Tree of Life dataset (chloroplast, mitochondrial, and ribosomal data). The TAG is shown lightly in the background. Superimposed on this, the nodes are coloured to represent the effective number of parent nodes, and their size represents node bootstrap support. Highly supported nodes with a low number of effective parents (large blue nodes) are frequently recovered and confidently placed in the source trees, while highly supported nodes with a low number of effective parents (large and pink or orange) are frequently resolved in the source trees but their placement varies among bootstrap replicates. So, the three largest problem areas as illustrated in the TAG correspond to the Malpighiales, Lamiales and Ericales.

For comparison, a NeighborNet analysis of the same data is shown in the blog post When is there support for a large phylogeny? This simply shows an unresolved blob.

August 3, 2014

16:30

Cheese making is about 8,000 years old, and there are now about 1,000 distinct types of cheese throughout the world. As with most ancient crafts, the art of making cheese is to get the microbes to do most of the work for you.

To this end, there has been much interest in the microbial communities that occur in cheese rinds (the bit around the outside). Different communities are expected to be associated with different styles of cheese, since the production process can be quite different. This is shown in the first figure, which emphasizes that much of the difference between cheeses is due to different maturation procedures.

From Wolfe et al. (2014).
Recently, Wolfe BE, Button JE, Santarelli M, and Dutton R (2014. Cheese rind communities provide tractable systems for in situ and in vitro studies of microbial diversity. Cell 158: 422-433) had a look at the dominant genera of bacteria and microfungi in the rind communities of 137 different types of cheese. They don't actually tell us much about which cheeses these were, merely claiming:
We attempted to evenly sample across rind type (24 bloomy rind cheeses, 52 washed rind cheeses, and 61 natural rind cheeses) and geographic regions (87 European cheeses across 9 countries; 50 American cheeses across 13 states from the West Coast to the east Coast). We also attempted to sample across different milk types (77 cow milk, 34 goat milk, 21 sheep milk, and 5 mixed milk) and milk treatments (99 raw milk, 38 pasteurized).Based on sequencing the bacterial 16S and fungal ITS loci, the authors identified 14 bacterial and 10 fungal genera (moulds and yeasts) that occurred with an average abundance of >1%, as shown in the next figure.

The 137 rind samples with their bacterial (middle row) and fungal (bottom row) genera indicated
by different colours. The order of the samples was determined by UPGMA clustering (top row).
The authors also used shotgun metagenomic sequencing to identify a range of genes in the microorganisms. They present a phylogeny of one particular gene (shown in the next figure) that shows a close relationship between some of the cheese microbes and marine bacteria:
The widespread distribution and high abundance of marine-associated gamma-Proteobacteria, enriched in both washed and bloomy rind cheeses, was an unexpected finding in our survey of taxonomic diversity ... One possible source of these marine microbes is the sea salt used in cheese production.[Note: the other cheese rind bacterium shown in the phylogeny, Brevibacterium linens, is the one responsible for the unbelievable smell of washed-rind cheeses such as Epoisses, Münster and Limburger. It is also responsible for personal-hygiene issues such as foot odour. You can imagine how it first got into cheese making!]


However, Ropars J, Cruaud C, Lacoste S, and Dupont J (2012. A taxonomic and ecological overview of cheese fungi. International Journal of Food Microbiology 155: 199-210), in a related study, have pointed out the usual problem with microbial phylogenies: gene trees are frequently incongruent. So, the gene phylogeny shown above is not likely to be the species phylogeny. It would thus be of great interest to investigate the full microbial network, rather than looking at a single tree.

July 29, 2014

22:30

This post is just to let everyone know that Dan Gusfield's long-awaited book on the interface between phylogenetics and population genetics is now available.


The book is targeted for mathematically inclined readers. It has a few contributions from Charles H. Langley, Yun S. Song and Yufeng Wu. The title is described as "a portmanteau word derived from the single-crossover recombination of the words 'recombination' and 'combinatorics'."

Hardcover 448 pp; ISBN: 9780262027526; $60.00 £30.95
More information is available from The MIT Press.

This new book joins these previous contributions to the genre:

Image from Celine Scornavacca.

July 27, 2014

16:30

On pages 72-73 of the book Guide to Urban Moonshining: How to Make and Drink Whiskey (written by Colin Spoelman and edited by David Haskell, 2013, published by Harry N. Abrams), there is an illustration of something called the "American whiskeys family tree". This is reproduced in in the article The Bourbon Family Tree for GQ magazine, from where I sourced the copy here.


The author describes it as follows:
This chart shows the major distilleries operating in Kentucky, Tennessee, and Indiana, grouped horizontally by corporate owner, then subdivided by distillery. Each tree shows the type of whiskey made, and the various expressions of each style of whiskey or mash bill, in the case of bourbons. For instance, Basil Hayden's is a longer-aged version of Old Grand-Dad, and both are made at the Jim Beam Distillery. So, while the vertical axis is indeed a time scale, the trees are only marginally family trees in the genealogical sense. This is much more an attempt to  illustrate the corporate ownership of American whiskey, which is made principally from corn (and thus is generically called bourbon, although in Tennessee they seem to rarely use this word). The main distinctions among the brands are (i) whether the non-corn part is made from rye, a little bit more rye, or wheat, and (ii) the length of time it is aged between distillation and sale.

The reticulations among the trees apparently refer to blends. The ghost lineages at the right are described thus:
Willett, formerly only a bottler as Kentucky Bourbon Distillers, has been distilling its own product for about a year; I include the brands that it bottles from other sources for reference.

July 22, 2014

22:30

I have written before about the expected genetic problems associated with inbreeding, including consanguinity and incest (relationships between people who are first cousins or closer). Conventionally, the evolutionary advantage of sexual over non-sexual reproduction is considered to be the creation of genetic diversity through heterozygosity. Inbreeding, by reducing heterozygosity, then seems to negate the advantages of sexual reproduction — it leads to the propagation of deleterious recessive alleles and thus inbreeding depression. So, there is a clear evolutionary dimension to the fact that incest avoidance is nearly universal in humans.

The best known exceptions to this situation are among royalty, including the family "trees" of the ancient Egyptian 18th Dynasty (see Tutankhamun and extreme consanguinity) and the Egyptian Ptolemaic dynasty (see Cleopatra, ambition and family networks), which were hybridization networks rather than conventional trees. The presence of consanguinity and incest among royal families then requires a biological explanation. As noted by van den Berghe & Mesher (1980):
Royal incest is best explained in terms of the general sociobiological paradigm of inclusive fitness ... Royal incest (mostly brother-sister; less commonly father-daughter) represents the logical extreme of hypergyny. Women in stratified societies maximize fitness by marrying up; the higher the status of a woman, the narrower her range of prospective husbands. This leads to a direct association between high status and inbreeding.The benefits of inclusive fitness refer to the increased number of offspring in future generations that result from increasing the reproductive success of close relatives. This is achieved via choice of mate. In other words, close relatives share genes, and the success of any relative in leaving offspring is a success for all relatives. Therefore, evolutionary fitness is a combination of individual fitness plus the fitness of close relatives. Inbreeding may reduce individual fitness but can increase inclusive fitness, as noted by Puurtinen (2011):
Theoretical work has shown that inclusive fitness benefits can favor close inbreeding even when this results in substantial reduction in offspring fitness. These models have identified the boundary level of inbreeding depression limiting the evolution of inbreeding among first-order relatives, that is, between full siblings, or between parents and offspring.So, there is a stable level of inbreeding in those populations that practice mate choice for optimal inbreeding. For example, the genetic risks of close inbreeding can be more than accounted for by the production of a highly related heir who has access to a wide choice of mates. Nevertheless:
For a wide range of realistic inbreeding depression strengths, mating with intermediately related individuals maximizes inclusive fitness.In other words, mating with very close relatives is unlikely to evolve via natural selection because it is not an optimal strategy; and we must thus look to a sociological component to incest (such as retaining wealth within the family), as well as a biological one.


In this context, it is interesting to note exceptions to the usual restriction of incest to the aristocracy. The society of Graeco-Roman Egypt (from c. 300 BCE to 300 CE) provides the best-documented case (eg. see Hopkins 1980; Shaw 1992; Parker 1996; Scheidel 1997; Huebner 2007; Remijsen & Clarysse 2008). [This era starts with the Ptolemaic dynasty, which marks the collapse of Egyptian rule of Egypt.] During this time a significant proportion of all marriages noted in official Roman census declarations were between full brothers and sisters. That is, the Roman-era Egyptians did not limit this type of inbreeding to any small group, but spread it across several social classes (mainly Greek settlers rather than native Egyptians).

As noted by Schiedel (1997):
According to official census returns from Roman Egypt (first to third centuries CE) preserved on papyrus, 23·5% of all documented marriages in the Arsinoites district in the Fayum (n=102) were between brothers and sisters. In the second century CE, the rates were 37% in the city of Arsinoe and 18·9% in the surrounding villages. Documented pedigrees suggest a minimum mean level of inbreeding equivalent to a coefficient of inbreeding of 0·0975 in second century CE Arsinoe. Undocumented sources of inbreeding and an estimate based on the frequency of close-kin unions indicate a mean coefficient of inbreeding of F=0·15-0·20 in Arsinoe and of F=0·10-0·15 in the villages at the end of the second century CE. These values are several times as high as any other documented levels of inbreeding.For comparison, the inbreeding F values for these family relationships are:
self
parent-offspring = siblings
uncle-niece = double first cousins
first cousins
first cousins once removed
second cousins 0.500
0.250
0.125
0.063
0.031
0.016
However, inbreeding depression seems not to have been a notable problem during this historical time. As noted by John Hawkes:
There is not a single mention in the evidence that links sibling marriage to negative genetic effects or unhappy marriages.This does not mean that there were no problems, but merely that any problems were not documented, as noted by Scheidel (1997):
Even in the absence of explicit references to inbreeding depression from Roman Egypt, there is no compelling reason to assume that brother–sister marriage could have remained entirely without negative consequences for the Arsinoites. It is however possible that, due to a low incidence of lethal recessives, such effects were considerably weaker than in some western samples. The census returns do not suggest lower levels of fertility or smaller numbers of children among sibling couples ...The practice seems to have stopped solely because it was contrary to Roman Law:
Before a.d. 212 the Romans had accepted discrepancies between their own legal practice and prevailing local customs and traditions in the Eastern provinces. Papyri from Roman Egypt, the Talmud, and the Romano-Syrian law book indeed reveal legal procedures which differed significantly from Roman law in matters such as marriage, guardianship, paternal authority, sales, and debts. The Constitutio Antoniana, however, made all free men and women of the Roman Empire into Roman citizens, and so Roman law became applicable to all inhabitants of Egypt. Brother-sister marriages cease to be documented in our Roman census returns from the early third century on. Our last [incest] testimony dates to a.d. 229.
References

Hopkins K (1980) Brother-sister marriage in Roman Egypt. Comparative Studies in Society and History 22: 303-354.

Huebner SR (2007) "Brother-sister" marriage in Roman Egypt: a curiosity of humankind or a widespread family strategy? Journal of Roman Studies 97: 21-49.

Parker S (1996) Full brother-sister marriage in Roman Egypt: Another look. Cultural Anthropology 11: 362-376.

Puurtinen M (2011) Mate choice for optimal (k)inbreeding. Evolution 65: 1501-1505.

Remijsen S, Clarysse W (2008) Incest or adoption? Brother-sister marriage in Roman Egypt revisited. Journal of Roman Studies 98: 53-61.

Scheidel W (1997) Brother-sister marriage in Roman Egypt. Journal of Biosocial Science 29: 361-371.

Shaw BD (1992) Explaining incest: brother-sister marriage in Graeco-Roman Egypt. Man 27: 267-299.

July 20, 2014

16:30

I have commented before on fact that the general public associates an inappropriate "March of Progress" image with the concept of "evolution" (see Haeckel and the March of Progress, and especially Tattoo Monday VIII - the March of Progress). It therefore seems worthwhile to gather a few examples together in the one place. Most of these are abbreviated versions of the image in the book Early Man by Francis C. Howell (1965. Time-Life International, New York). There were originally 14 images (see the version here), but the modern versions have a half or fewer images.














July 16, 2014

01:14
We all worked hard during the workshop. Here is our fearless leader, in deep thought:

While some of the younger participants enjoyed drawing on the walls:

Professor Whitfield has come up with a great new model of evolution: phylogenetic windmills:


There was not only work, but also time to relax and enjoy the beautiful Dutch summer weather:

And not to forget the delicious Dutch food:

But really, most of the time we were busy touching the data, which you can find on this website:


For more photos, see the Touching the Data website.

July 11, 2014

14:33

We have now completed the workshop.

Since the first report, we have had three more talks. First, Mukul Bansal outlined the relationship between phylogenetic networks and reconciliation analysis, and the way in which the latter can be used to construct the former. Starting from an estimated species tree, the tree for each locus is optimized for fit to the species tree, which helps locate any areas of extensive gene flow (ie. reticulation). This can be done using a large number of loci and an even larger number of taxa.

Celine Scornavacca provided details of some of the fundamental limitations of network analysis.The most important of these is unidentifiability of network topologies -- there are classes of network topologies that cannot be distinguished based on the information that is currently used, so that we cannot guarantee that a unique optimal network will be found during an analysis. Branch lengths may help with this situation, but cannot guarantee to resolve it.

Jim Whitfield covered the advantages and potential problems of using genomic-scale data for phylogenetic analysis. The basic problem is the increased scope for error in moving to the genome data (genome assembly problems, gene homology issues, alignment difficulties), although the potential advantages are extensive.

Most importantly, we spent two days "touching" some data. The participants broke into smaller groups of continuously varying size, each of which focussed on a particular dataset (as supplied by some of the participants). These data were evaluated in many different ways, to assess the characteristics of the data as well as to evaluate the data-analysis methods. This not only allowed us to identify the current state of the art with respect to phylogenetic networks, but it also allowed computationalists to improve their understanding of biological data and how biologists proceed to analyze it, as well as allowing biologists to obtain immediate feedback with respect to their data-analysis issues.

Production of phylogenetic networks seems to have come a long way in the past few years, although there is still no single "one-stop shopping" software tool to use. Practical issues getting programs to perform on all computer types were identified, along with data-format issues. Nevertheless, all of the participants seemed to find that this was a very valuable exercise, as a means of focussing interactions among themselves.

Finally, we considered both European and U.S. funding for network research, in the latter case assisted by David Mindell (from the N.S.F.). In particular, we identified sources of funding for future workshops (either in the south of France or the north-eastern U.S.A.).

The canal-boat cruise turned out well, in spite of the somewhat uncooperative weather. The football, of course, has turned out to be rather disappointing for the hosts, although they have one more game to play.

July 8, 2014

13:15

We have now completed two days of the workshop. We have had a relaxed approach to progress, and are thus currently running behind the nominal schedule. Nevertheless, we are progressing splendidly.

We had three talks on the first day and one today. I tried to kick things off by asking a series of what I consider to be unanswered questions from observing practitioners and computationalists in action, although apparently several members of the audience already had their own answers to some of these. The bottom line is that phylogenetic analysis focuses on data patterns while interpretation focuses on processes / mechanisms, and this constitutes a large part of the apparent separation of practitioners and computationalists.

Steven Kelk and Luay Nakhleh introduced the diversity of computational approaches that we already have. These presentations neatly complemented each other, providing a valuable summary of the field as well as an overview of current limitations and future prospects. This topic was taken up later by various members of the audience, as one of the inherent problems for practitioners is how to navigate through the methods to choose a suitable one -- there are methods based on parsimony, likelihood and bayesian analysis, and methods that tackle de novo network construction, gene tree / species tree reconciliation, gene tree scoring, and network presentation.

This topic was followed up today by presentations introducing some of the currently available software. Some of these have progressed significantly in recent years, notably PhyloNet and Dendroscope, and there are some relatively new ones, as well as even newer ones in the pipeline. Based on the literature, these programs are being dramatically under-used compared to their actual usefulness.

This morning Scot Kelchner introduced us to the application of Zen Buddhism to science in general and phylogenetics in particular. This went down much better than he seemed to be expecting -- there were apparently a lot of  "Zen" people in the room. The basic idea is not to get trapped by preconceived expectations, especially arbitrary categorical notions, when interpreting the output of a phylogenetic analysis. You can consult The Nine-Headed Dragon River, by Peter Matthiessen, if you would like further information.

Finally, we got to the topic implied by the workshop's title: Touching the Data. We had a brief run-through of the pre-existing datasets stored with this blog (see the upper right-hand corner), which cover some of the diversity of what practitioners have provided to date in the way of usable datasets with "known" phylogenetic patterns.

By far the most interesting, however, was the presentation of some recent datasets made available by members of the workshop, notably Axel Janke (bear species), Scot Kelcher (bamboo species) and Mattis List (Indo-European languages) (Jim Whitfield will present his datasets tomorrow morning). These datasets generated much interest, as they provide a diversity of different possible applications for phylogenetic networks. The idea from here on in the workshop is to address what can currently be done with these datasets and what we might like to do with them if the tools were available. This will help focus the participants on specific practical issues, which should lead to the progress that we hope to achieve.

It has rained most of the day, which is actually unusual -- intermittent rain is more common in this climate. We are currently waiting for the football to start: Germany versus Brazil. Tomorrow will be the Netherlands versus Argentina. It is risky being in this country this week! The current local betting is for an all-European final,an assessment that involves no cultural bias whatsoever.