There are currently 0 users and 161 guests online.
The Genealogical World of Phylogenetic Networks
Biology, computational science, and networks in phylogenetic analysis
Last update25 min 43 sec ago
December 11, 2013
The question has been asked as to which of the current general books about phylogenetics actually cover phylogenetic networks. There are collections of essays where networks are covered, and there are specialist books, of course, but the question here is about general introductory books. While a number of books mention tree incongruence, and that this phenomenon could be represented using a reticulating graph, there appear to be only two books that specifically cover the topic of phylogenetic networks.
Barry G. Hall (2011) Phylogenetic Trees Made Easy: A How-To Manual, Fourth Edition. Sinauer Associates, Sunderland MA.
The first three editions (2001, 2004, 2008) discussed trees only, but the fourth edition has added a chapter on networks. Chapter 15 (pp. 219-248) explicitly notes that "The material presented here is drawn almost entirely from the new book Phylogenetic Networks: Concepts Algorithms and Applications", which is also noted was "made available to me in manuscript prior to its publication."
There are four sections in the chapter:
Why Trees Are Not Always Sufficient
Unrooted and Rooted Phylogenetic Networks
Learn More about Phylogenetic Networks
Using SplitsTree to Estimate Unrooted Phylogenetic Networks
Using Dendroscope to Estimate Rooted Networks from Rooted Tree
The first three sections are theoretical introductions to the topic, and the final two sections proceed through a worked example (a different one each).
The book provides a basic introduction to phylogenetics, which is its intent. So, the network topics are presented in a straightforward manner, which makes them easy to grasp. The worked examples are cookbook style, intended solely to get you started using the two chosen computer programs.
The author is to be congratulated for producing not only the first, but so far the only, general book that covers evolutionary networks.
Philippe Lemey, Marco Salemi, Anne-Mieke Vandamme (editors) (2009) The Phylogenetic Handbook: A Practical Approach to Phylogenetic Analysis and Hypothesis Testing, Second Edition. Cambridge Uni Press, Cambridge.
The first edition (2003) had a chapter on SplitsTree by Vincent Moulton, and this was revised in the current edition to Split Networks: a Tool for Exploring Complex Evolutionary Relationships in Molecular Data, Chapter 21 (pp. 631-653), by Vincent Moulton and Katharina Huber.
The chapter provides a general introduction to the theory of splits graphs and their uses; and the practical exercises use SplitsTree. This was the first general book on phylogenetics to include networks, although evolutionary networks are not covered.
The coverage of networks is the final topic in the book in both cases, so it can hardly be claimed to have an important place. Nevertheless, these books are at least one step ahead of their competitors.
All of these books are examples of the contemporary focus on congruent tree patterns in evolution, with reticulate relationships being almost an afterthought. There is nothing in the word "phylogeny" that specifies a shape for evolutionary history — it comes from the Greek phylon "race" + geneia "origin". Evolutionary groups may arise by either vertical or horizontal processes, and so evolution may be tree-like or it may not. The current focus almost exclusively on trees is therefore somewhat misplaced.
December 8, 2013
In 2008, Michael Barton conducted a Bioinformatics Career Survey. Since then, various groups have updated some of that information by conducting polls of their own. Below, I have included some of the more recent results, for your edification.
This first one comes from the Bioinformatics Organization, in response to the question: What is your undergraduate degree in? It is interesting to note that more bioinformaticians are biologists by training, rather than computational people.
The next one is actually an ongoing poll at BioCode's Notes, in response to the question: Which are the best programming languages for a bioinformatician? R is an interesting choice as the most useful language, given the more "traditional" use of Perl and Python.
That leads logically to another of the Bioinformatics Organization's questions: Which computer language are you most interested in learning (next) for bioinformatics R&D? I guess that if you already know R, then either Python or Perl is a useful thing to learn next.
Furthermore, the Bioinformatics Organization also asked: Which math / statistics language / application do you most frequently use? The choice of R here is more obvious, given that it is free, which most of the others are not. I wonder what the answer "none of the above" refers to.
December 3, 2013
A couple of weeks ago we received an unexpected influx of visitors to this blog, being directed here by at article at the NBC News site. This article cited one of our blog posts (Network analysis of Genesis 1:3) as an example of the use of phylogenetic analysis in stemmatology (the discipline that attempts to reconstruct the transmission history of a written text). The NBC article itself is about a recently published paper that applies these same techniques to an oral tradition instead — the tale of Little Red Riding Hood. This paper has generated much interest on the internet, being reported in many blog posts, on many news sites, and in many twitter tweets. After all, the young lady in red has been known for centuries throughout the Old World.
Needless to say, I had a look at this paper (Jamshid J. Tehrani. 2013. The phylogeny of Little Red Riding Hood. PLoS One 8: e78871). The author collated data on various characteristics of 58 versions of several folk tales, such as plot elements and physical features of the participants. These tales included Little Red Riding Hood (known as Aarne-Uther-Thompson tale ATU 333), which has long been recorded in European oral traditions, along with variants from other regions, including Africa and East Asia (where it is known as The Tiger Grandmother), as well as another widespread international folk tale The Wolf and the Kids (ATU 123), which has been popular throughout Europe and the Middle East. As the author notes: "since folk tales are mainly transmitted via oral rather than written means, reconstructing their history and development across cultures has proven to be a complex challenge."
He produced phylogenetic trees from both parsimony and bayesian analyses, along with a neighbor-net network. He concluded: "The results demonstrate that ... it is possible to identify ATU 333 and ATU 123 as distinct international types. They further suggest that most of the African tales can be classified as variants of ATU 123, while the East Asian tales probably evolved by blending together elements of both ATU 333 and ATU 123." His network is reproduced here.
There is one major problem with this analysis: all three graphs are unrooted, and you can't determine a history from an unrooted graph. A phylogeny needs a root, in order to determine the time direction of history. Without time, you can't distinguish an ancestor from a descendant — the one becomes the other if the time direction is reversed. Unfortunately, the author makes no reference to a root, at all.
So, his recognition of three main "clusters" in his graphs is unproblematic (ATU 333; East Asian; and ATU 123 + African) although the relationship of these clusters to the "India" sample is not clear (as shown in the network). On the other hand, his conclusions about the relationships among these three groups is not actually justified in the paper itself.
Rooting the trees
So, the thing to do is put a root on each of the graphs. We cannot do this for the network, but we can root the two trees, and we can take the nearest tree to the network and root that, instead.
There are several recognized ways to root a tree in phylogenetics (Huelsenbeck et al. 2002; Boykin et al. 2010):
I therefore did the following:
Having the East Asian samples as the sister to the other tales does not match what would be expected for the historical scenario suggested by the original author from his unrooted graphs — that the East Asian tales "evolved by blending together elements of both ATU 333 and ATU 123".
Inatead, this placement exactly matches an alternative theory that the author explicitly rejects: "One intriguing possibility raised in the literature on this topic ... is that the East Asian tales represent a sister lineage that diverged from ATU 333 and ATU 123 before they evolved into two distinct groups. Thus, ... the East Asian tradition represents a crucial 'missing link' between ATU 333 and ATU 123 that has retained features from their original archetype ... Although it is tempting to interpret the results of the analyses in this light, there are several problems with this theory."
The UPGMA root, on the other hand, would be consistent with the blending theory for the origin of the East Asian tales. However, this tree actually presents the African tales as distinct from ATU 123, rather than being a subset of it.
Anyway, the bottom line is that you shouldn't present scenarios without a time direction. History goes from the past towards the present, and you therefore need to know which part of your graph is the oldest part. A family tree isn't a tree unless it has a root.
Boykin LM, Kubatko LS, Lowrey TK (2010) Comparison of methods for rooting phylogenetic trees: a case study using Orcuttieae (Poaceae: Chloridoideae). Molecular Phylogenetics & Evolution 54: 687-700.
Huelsenbeck J, Bollback J, Levine A (2002) Inferring the root of a phylogenetic trees. Systematic Biology 51: 32-43.
December 1, 2013
The physical sciences have long had preprint archives, notably the arXiv (founded in 1991), which is managed by Cornell University Library. Bioinformaticians have been active users of these archives, at least partly because getting mathematical papers published can take up to 2 years (see Backlog of mathematics research journals). Bioinformatics moves faster than that. There have been more general preprint services, as well, such as Nature Precedings, which operated from 2007 to 2012.
There have recently been moves afoot to provide similar services specifically for biologists; and the beta version of the bioRxiv has now come online:
bioRxiv (pronounced "bio-archive") is a free online archive and distribution service for unpublished preprints in the life sciences. It is operated by Cold Spring Harbor Laboratory, a not-for-profit research and educational institution. By posting preprints on bioRxiv, authors are able to make their findings immediately available to the scientific community and receive feedback on draft manuscripts before they are submitted to journals.Many research journals, including all Cold Spring Harbor Laboratory Press titles, EMBO Journal, Nature journals, Science, eLife, and all PLOS journals allow posting on preprint servers such as bioRxiv prior to publication. A few journals will not consider articles that have been posted to preprint servers.Preprint policies are summarized here: List of academic journals by preprint policy.
Many people seem to see archives such as this as having their principal role in bridging the publication delay caused by the peer-review process (see The case for open preprints in biology for a summary of the argument). Indeed, much of the online discussion of preprints in biology seems to be about why biologists have not taken to preprints like ducks to water, asking the rhetorical question: "What are biologists afraid of?" This question pre-supposes that everyone should use preprints unless there is a good reason not to, rather than the more obvious assumption that no-one will use them unless there is a good reason to do so. On the whole, shortening the peer-review process by a few months (as is typical in biology) hardly seems like a sufficient incentive for mass usage of preprints.
However, there does seem to be a possible incentive beyond break-neck speed. An equally important point is that archives act as a powerful means of making unpublished work available online. Even if a particular manuscript is ultimately never published in a journal or book, it will still be available in the archive in its final draft form, since the archives are intended to be permanent repositories. That is, the archives are not only for pre-prints.
There are many reasons why some work never gets formally published, including incompleteness of the data, negative results, lack of perceived profundity, and being out of synch with current trends. If there is nothing inherently faulty about a manuscript, then there is no reason for it to remain unavailable to interested readers. We are no longer beholden to the publishers (or to the referees) for disseminating our data and/or ideas, although we may still prefer formal publication as the primary conduit.
For example, I started using the arXiv after it added a section on "Quantitative Biology" in 2003. I have several manuscripts in the ArXiv that, for one reason or another, have not (yet) made it into print:
So, preprint archives are a valuable tool for academics, especially when those pesky referees are not being co-operative.
PS. This is post number 200 for this blog.
November 26, 2013
In this blog we have consistently championed the idea that within-species relationships are better represented by a network than by a tree. We have done this for humans and their relatives:
Networks and human inter-population variation
Human races, networks and fuzzy clusters
Why do we still use trees for the Neandertal genealogy?and for other species as well:
Are phylogenetic trees useful for domesticated organisms?
Why do we still use trees for the dog genealogy?
Network of apple cultivarsGenetically, a within-species network is a haplotype network. Also, when dealing with individuals in a sexually reproducing species it is a hybridization network, as I have noted:
Family trees, pedigrees and hybridization networks
Charles Darwin's family pedigree network
Toulouse-Lautrec: family trees and networksWe are not the only blog to emphasize intra-species networks, of course. As far as humans are concerned, one of the more vocal blogs has been Gene Expression, run by Razib Khan over at Discover magazine. For example, when discussing phylogenetic trees (Burning down the trees in historical population genetics), Khan notes:
These sorts of trees range from Ernst Haeckel's classical attempt, depicting relationships which biologists derived from intuition within the framework of a grand evolutionary scheme, all the way down to modern methods implemented in software packages such as Mr. Bayes, which many frankly utilize in a "turnkey" manner. These trees are abstractions, in that they reduce down a wide range of phenomena into schematic representations which impart aspects of particular interest in a stylized form. This is important, because the actual nature of the phenomena being represented may be more complex than is being represented.Phylogenetic analysis involving distinct species has its own problems, but they are dwarfed by what must confront those who attempt to parse out relatedness of populations within species. Because of the ubiquity of gene flow across populations within species, attempts to generate a tree of relationships of populations is always bound to be a gross simplification. Instead of a sequence of bifurcations the true relationship of putative populations is more accurately represented by a networked graph.When discussing alternative evolutionary models (Unveiling the genealogical lattice), Khan notes:
It seems that the bifurcating model of the tree must now be strongly tinted by the shades of reticulation. In a stylized sense inter-specific phylogenies, which assume the approximate truth of the biological species concept (i.e., little gene flow across lineages), mislead us when we think of the phylogeny of species on the microevolutionary scale of population genetics. On an intra-specific scale gene flow is not just a nuisance parameter in the model, it is an essential phenomenon which must be accommodated into the framework.And here the takeaway for me is that we may need to rethink our whole conception of pure ancestral populations, and imagine a human phylogenetic tree as a series of lattices in eternal flux, with admixed nodes periodically expanding so as to generate the artifice of a diversifying tree. The closer we look, the more likely it seems that most of the populations which have undergone demographic expansion in the past 10,000 years are also the products of admixture. Any story of the past 10,000 years, and likely the past 100,000 years, must give space at the center of the narrative arc to lateral gene flow across populations.Mind you, the network and lattice metaphors are not the only ones he has up his sleeve (When trees turn into brambles):
With the expansion of genomics from humans to a wide range of species I suspect that we’ll see a lot more blurring of distinctions between species on the margins. This will be particularly true of those lineages with wide and continuous distributions. It will also be most salient and surprising for mammalian populations, where our prejudices about the primacy of a biological species concept are most strongly developed. In a phylogenetic sense when you shift the grain of analysis to a finer scale the tree of life becomes much more of a bramble in many cases.Indeed.
November 24, 2013
In a previous blog post (Charles Darwin's family pedigree network), I mentioned several well-known people who were involved in a consanguineous marriage, which is defined as the union of two people who are related as second cousins or closer. In that post I discussed in detail Charles Darwin (who married his first cousin); and in this post I discuss the artist Henri Toulouse-Lautrec, who was the offspring of a marriage between first cousins.
I thought that this would be a simple post, because there must be people who have studied the Toulouse-Lautrec-Montfa genealogy, given Henri's fame as a Post-Impressionist artist, and the widespread knowledge that his phyiscal disabilities were genetic. But it turned out not to be so — there is no broad family tree that I could find, and no detailed discussion of inbreeding. The main information easily available is the direct lineage of inheritance of the various noble titles to which Henri would have been heir (had he survived his father, the Comte de Toulouse-Lautrec-Montfa), which can be traced back for more than 1000 years (see Vizegrafschaft Lautrec). However, the main interest for biology lies in his genetic relationship with his cousins, as we shall see below.
So, I sat down for a day to compile it for myself. The resulting genealogy is incomplete, but all of the relevant people are in it. I could not find all of the details about some of these people, either, which are apparently not available on the web,and some of the actual dates are inconsistent across different sources. In general, I have followed Dupic (2012).
When genealogical trees become networks
The point of this post is that marriages within a family turn the family tree into a network. So, a pedigree can be tree-like or not. In the latter case it is an example of a hybridization network.
This first genealogy shows a standard family tree for a single individual, looking backward in time from the bottom. So, this person is #1, the parents are #2 (father) and #3 (mother), and so on back through the generations, always with the male parent on the left (as is the convention). This example covers six generations, showing that without inbreeding everyone has 32 great-great-great grand-parents. These 32 people's genes are mixed more-or-less randomly (depending on recombination and assortment) to produce person #1. This is a good thing, evolutionarily, because there is then genetic diversity within #1.
However, with inbreeding some part of the ancestry disappears (when looking backward in time), because another part of the ancestry is duplicated in its place.
The second genealogy shows what happens when person #7 is the daughter of someone else in the same pedigree. If she is the daughter of #10 and #11, for example, then #5 and #7 would be sisters, and #2 and #3 would be first cousins. Now, person #1 has only 24 great-great-great grand-parents, and some of them are contributing to their descendants twice, rather than once (ie. #40–#47). This means that the genetic diversity in person #1 is less than it would be without the inbreeding. More to the point, any recessive alleles that exist in the ancestry have an increased probability of being homozygous in #1, and thus being expressed in the phenotype.
This is, unfortunately, exactly what happened to Henri Toulouse-Lautrec, whose pedigree network is shown in the next figure. It is complete for six generations, plus an important part of the seventh. It is difficult to be complete beyond this generation, as the information becomes sparse, particularly about the female family members.
As shown, Henri's parents were first cousins, because their mothers were sisters. In addition, his maternal grandfather (#6) also had recent inbreeding in his history, because his mother (#13) was the daughter of a first-cousin marriage. This is not nearly as much inbreeding as has been implied by most commentators about Henri's life, but it is enough to potentially create genetic problems.
Note that it was Henri's mother's side of the family that was involved in the recent inbreeding, but the de Toulouse-Lautrec Montfa side was prone to the same thing, as are most titled families. As noted above, Henri died before inheriting his title. The title Comte de Toulouse-Lautrec-Monfa passed to Alphonse' next brother, Charles (1840-1917), who had no children, and thence to the next brother, Odon (1842-1937), and finally to Odon's son, Robert (1887-1972), who also had no children. The Internet seems to be silent about what happened to it after that.
Consequences of inbreeding
For Henri, life was tragic because he ended up with two copies of one particular recessive allele. The medical profession has been interested in this ever since his death, and much information is therefore available about his condition (eg. Albury & Weisz 2013; Leigh 2013).
Albury & Weisz (2013) note:
The condition from which he probably suffered was first described in 1954 by the French physician Robert Weissman-Netter. It was named pycnodysostosis in 1962 by Marateaux and Lamy and was soon attributed to this artist as the "Toulouse-Lautrec Syndrome" ... Pycnodysostosis is a hereditary autosomal recessive dysplasia caused by an enzyme deficiency, namely of cathepsin K (cysteine protease deficiency in osteoclasts), reducing the normal bone resorption and leaving an incomplete matrix decomposition ... Toulouse Lautrec had a short stature with shortened legs, a large head due to a lack of closure of the fontanellae (which he usually covered with a hat), a shortened mandible with an obtuse angle (covered with a thick beard), dental deformities that required several surgical interventions, a large tongue, thick lips, profuse salivation, and a sinus obstruction with post-nasal drip. With fractures of the long bones during childhood, later on of the clavicle, with progressive hearing problems and cranio-facial deformities, Lautrec’s condition would complete the diagnosis of pycnodysostosis.It seems to be widely recognized that Henri threw himself into his art at least partly to compensate for the psychological damage produced by his physical condition (he also became an alcoholic). As Leigh (2013) notes, his mother's side of the family had money (his father's side had a title but little money), and so Henri was financially free to do what he liked. He worked at a prodigious rate, and produced a life-time's worth of art in just 15 years — perhaps most famously his flamboyant lithograph posters (still as popular today as they were in his own time), but also oil paintings, watercolours, sculptures, ceramics and stained glass. He died at his mother's Château Malromé at age 36, after a stroke, but ultimately probably from tuberculosis (Albury & Weisz 2013).
Further inbreeding in the family
I noted in my previous post about Charles Darwin that, not only did he marry his cousin, his own sister married his wife's brother, thus literally keeping things in the family. In Henri Toulouse-Lautrec's case, the same thing happened: his paternal aunt married his maternal uncle, as shown in the next figure. This pedigree shows some more information about Henri's closest relatives, emphasizing the pair of consanguineous marriages.
There are 14 people shown in Henri's generation, all born to first-cousin marriages. (There may have been two more children in the Alix–Amédée marriage, but I have been unable to find any direct reference to them.) Of these people, six seem to have had disabilities similar to Henri's: Henri himself; his brother, who died the day before his first birthday; Madeleine, who died as a teenager; Geneviève; Béatrix; and Fides. The latter was so small that apparently she lived her entire life in a baby carriage (Rosenhek 2009). The photo below shows Henri with most of the Tapié de Céleyran family. It was taken in the summer of 1896 at Château du Bosc, where Henri had been born.
The two elderly women in he middle are Gabrielle (left) and Louise (right), the maternal and paternal grandmothers (they were sisters, remember). The father, Amédée, is at the rear centre (sticking his tongue out at the photographer), and the mother, Alix, is standing at the far right. Standing next to her is the oldest son, Raoul; and his wife, Elisabeth, is seated at the far left. The next two sons, Gabriel and Odon, are absent, along with their wives. The next son, Emmanuel, is standing at the back left; and his wife, Marie-Thérèse, is seated next to the pram (middle right). The youngest sons are sitting on the ground at the front centre, with Alexis on the left and Olivier on the right. The first-born daughter, Madeleine, was already dead when the photo was taken. The next three daughters are sitting at the middle left, with Germaine sitting on Elisabeth's lap, Geneviève in front of her, and then Marie seated on the ground. Béatrix is at the middle right, sitting next to Marie-Thérèse, and Fides is in her pram. Henri himself is seated on the ground at the far left. His brother, Richard, had also died before the photo was taken. The remaining four people (standing either side of Amédée) are other relatives.Nevertheless, this large family did manage to survive the effects of inbreeding, unlike Henri's own family. At least seven of the children survived to have children of their own (~19 grand-children):
Elisabeth DAUDÉ de LAVALETTE (1870-1956)
Anne de TOULOUSE-LAUTREC (1873-1944)
Marguerite TAILLEFER de LAPORTALIÈRE (1878-1958)
Marie-Thérèse des CORDES
Alexandre d'ANSELME (1876-1912)
Adrien de RODAT d'OLEMPS (1806-1884)
Anne Marie de MALVIN de MONTAZET (1885-1974)
4 childrenNote that Gabriel and Anne were third cousins, since they had great-grand-fathers who were brothers; nevertheless, they had 3 female children, at least one of whom also had 3 children. One of Alexis' sons (ie. Henri's second cousin once removed) was well-known art critic Michel Tapié de Céleyran (1909-1987), who married and had seven children, two of whom died in infancy.
Inbreeding increases the probability that recessive alleles will be expressed, but it does not make this inevitable. In Henri's case, two disabled children in succession seems to have dissuaded his parents, and they separated, whereas his aunt and uncle had a healthy child the second time, and so they continued producing a family. However, these days it is not recommended that you marry any of your first cousins.
Evolution is about biodiversity at all hierarchical levels, not just between or within species, but within individuals as well. Average intra-individual genetic diversity reaches a maximum when the ancestry is tree-like, and reduces with each instance of inbreeding, which turns the tree into a network of increasingly greater complexity.
I have discussed an even more extreme example of consanguinity in a previous post (Family trees, pedigrees and hybridization networks), in which the inbreeding became so severe that the royal family lineage actually came to an end.
Albury WR, Weisz GM (2013) Toulouse-Lautrec and medicine: a triumph over infirmity. Hektoen International 5: 3.
Dupic S. (2012) Toulouse-Lautrec - Généalogie 87 le site de référence de la généalogie de la haute-vienne.
Leigh FW (2013) Henri Marie Raymond de Toulouse-Lautrec-Montfa (1864-1901): artistic genius and medical curiosity. Journal of Medical Biography 21: 19-25.
Rosenhek J (2009) Picture imperfect: tiny Henri de Toulouse-Lautrec’s talent – and troubles – were larger than life. Doctor's Review Oct 2009.
November 19, 2013
Bioinformatics as a term dates back to the 1970s, usually credited to Paulien Hogeweg, of the Bioinformatics group at Utrecht University, in The Netherlands, although it apparently did not make it into print until 1988 (Paulien Hogeweg. 1988. MIRROR beyond MIRROR, puddles of Life. In: Artificial Life, C. Langton, ed. Addison Wesley, pp. 297-315.).
In the 1990s the field expanded rapidly and became recognized as a discipline of its own, as a subset of computational science. However, Christos A. Ouzounis (2012. Rise and demise of bioinformatics? Promise and progress. PLoS Computational Biology 8: e1002487) has noted a distinct decrease in the use of the term itself, as shown by this graph.
Ouzounis recognizes three (admittedly artificial) periods in the history: Infancy (1996-2001), Adolescence (2002-2006) and Adulthood (2007-2011). Along the way, the practice of bioinformatics has received a lot of criticism. I have noted some of this before, in previous blog posts:
Archiving of bioinformatics software
What is perhaps most important is that much of this criticism comes from bioinformaticians themselves, rather than from biologists. Moreover, this criticism does not seem to have had much effect on how bioinformatics is practiced, given the length of time over which it has been made.
For example, Carole Goble (2007. The seven deadly sins of bioinformatics. Keynote talk at the Bioinformatics Open Source Conference Special Interest Group at the 15th Annual International Conference on Intelligent Systems for Molecular Biology (ISMB 2007) in Vienna, July 2007) produced this list of what she called "intractable problems in bioinformatics":
1. Parochialism and insularity.
3. Autonomy or death!
4. Vanity: pride and narcissism.
5. Monolith megalomania.
6. Scientific method sloth.
7. Instant gratification.More recently, Manuel Corpas, Segun Fatumo & Reinhard Schneider (2012. How not to be a bioinformatician. Source Code for Biology and Medicine 7: 3) pointed out what they call "a series of disastrous practices in the bioinformatics field", which look very similar:1. Stay low level at every level.
2. Be open source without being open.
3. Make tools that make no sense to biologists.
4. Do not provide a graphical user interface: command line is always more effective.
5. Make sure the output of your application is unreadable, unparseable and does not comply to any known standards.
6. Be unreachable and isolated.
7. Never maintain your databases, web services or any information that you may provide at any time.
8. Blindly believe in the predictions given, P-values or statistics.
9. Do not ever share your results and do not reuse.
10. Make your algorithm or analysis method irreproducible.You can peruse the originals to check out the details of these problems, and whether they sound uncomfortably familiar.
November 17, 2013
Native speakers of any language will judge the "difficulty" of another language by how much it differs from their own. For example, the Foreign Service Institute (FSI) of the U.S. Department of State lists five categories of increasing time taken for native English speakers to acquire "General Professional Proficiency" in other languages. This refers to an average, of course, and anyone may personally find one language or another more easy or difficult than others.
FSI Category I (the least time needed) includes most of the Germanic and Romance languages, since English was originally a Germanic language that received a huge Romance input after the Normans turned up in Britain in 1066. The exception is German itself, which is alone in Category II (needing longer), because of its more complex grammar. Category V (the longest time needed for proficiency) consists of Arabic, Cantonese, Japanese, Korean and Mandarin, with Japanese being considered the most difficult.
Most languages are in Category IV, including the rest of the Indo-European languages. The recognizably tougher ones in that group are the Uralic languages (Estonian, Finnish and Hungarian), because of their countless noun cases. Interestingly, Category III (easier than IV) consists of Indonesian, Malaysian and Swahili, which have no known historical connection to English — they just happen to have fewer linguistic differences than do the other languages.
And that is the point of this post — linguistic similarities don't necessarily reflect the evolutionary history of the languages. There are trees allegedly showing the genealogy of languages, because there is vertical transfer of information in the history of languages (generation to generation), but horizontal transfer has also been a powerful evolutionary force, as cultures come in contact with each other. The history of English, as noted above, shows both vertical (Germanic) and horizontal (Romance) influences. Language history is a reticulating network, not an evolutionary tree.
Just as importantly, though, languages can have coincidental similarities. There are, after all, not that many different ways of constructing a language, and there are reported to be ~6,900 distinct languages on this planet. So, chance similarities must abound — what in biology we would call parallelisms and convergences. This makes constructing the evolutionary history of languages difficult.
The complexity created by coincidences has lead some people to wonder about how "unusual" any one language might be. This can be defined as how many of its characteristics occur commonly in other languages, and how many of them occur more rarely. The most unusual languages will be those that have lots of the rare features; and we might call them linguistic outliers. The Idibon blog has already had a look at this topic (The weirdest languages), and here I reconsider their data in the light of a phylogenetic network.
The original data come from the World Atlas of Language Structures, which describes itself as "a large database of structural (phonological, grammatical, lexical) properties of languages gathered by a team of 55 authors". There are apparently 2,676 different languages in the database, coded for 192 linguistic features. Sadly, the database is very sparse, so that most languages have not yet been coded for most of the features (there are 5–1,519 languages coded for each feature).
So, the Idibon people selected a subset of the data: 1,693 languages and 21 features. These features were chosen to be an uncorrelated subset of those 165 features that have at least 100 languages coded; and the selected languages each have at least 10 features coded.
The features are certainly an eclectic collection, which you can read about on the WALS site:
19A: Order of Object and Verb
Order of Adjective and Noun
Order of Negative Morpheme and Verb
Minor Morphological Means of Signaling Negation
Position of Tense-Aspect Affixes
Position of Pronominal Possessive Affixes
Expression of Pronominal Subjects
Hand and Arm
Finger and Hand
Gender Distinctions in Independent Personal Pronouns
Fixed Stress Locations
The Velar Nasal
Nonperiphrastic Causative Constructions
Nominal and Verbal Conjunction
'Want' Complement Subjects
Presence of Uncommon ConsonantsFrom the subset of languages, I chose all of those languages with at least 12 of these features coded, plus Icelandic (10 features), and Cornish and Gaelic(Scots) (11 features).
I then tried to fill in some of the missing data, to get as many languages as easily possible up to having 14 features coded (ie. two-thirds of the features). For the phonology features (6A, 9A, 19A), the relevant information can be looked up on the web, particularly in Wikipedia and the Native American Language Net. For the word features (129A, 130A), I used the LEXILOGOS Online Translation.
In the process, I found that Idibon has at least one feature mis-coded compared to the WALS web site: for feature 14A, some of the languages that should be coded "Second " have been coded as "Antepenultimate", and all of the others that should be coded "Second" have missing data.
I also found a few contradictions between the WALS coding and the information elsewhere on the web. In some of these cases I re-coded the WALS data.
My final spreadsheet is available online. There are 280 languages coded for at least 14 of the 21 features, compared to 239 such languages in the Idibon analysis. There are 19% of the data still missing, varying from 0–53% across the 21 features.
My network is intended as an exploratory data analysis, rather than some attempt at an evolutionary diagram. Thus, the network simply displays the apparent similarity among the languages. That is, languages that are closely connected in the network are similar to each other based on their linguistic features, and those that are further apart are progressively more different from each other.
First, I recoded the multivariate linguistic data as 59 binary characters. Then the similarity among the 280 languages was calculated for each pair of languages using the Gower similarity index, which can accommodate missing data (by ignoring features that are missing for each pairwise comparison). A Neighbor-net analysis was then used to display the between-language similarities as a phylogenetic network.
The network is not very tree-like, is it? A few tentative groups can be recognized, as indicated by my colouring, but that is all. These groups do not correspond to any known language groups, meaning that the language features chosen do not reveal a traditional tree-like genealogy. Whether this reflects horizontal transfer of linguistic features, coincidence, or simply inadequate data, is not necessarily clear.
However, it seems most likely that much of the complexity represents coincidence. In the study of language evolution, parallelism and convergence are not nuisances, which is the way they are treated when constructing phylogenies of organisms. Coincidental similarities are a fundamental part of language history, but they are not necessarily the product of processes like natural selection, as they often are in biology.
If we look at some of the details, the nature of the complexity becomes clearer, as shown in the next figure. Here, I have colour-coded the Indo-European family of languages by their so-called "genus", plus the other languages that occur in Europe (the Uralic group, and Basque):
Albanian - pale brown
Armenian - dark brown
Baltic - orange Celtic - pale blue
Germanic - black
Greek - pale green Indic - pink
Iranian - blue
Romance - purple Slavic - green
Uralic - red
Basque - grey
Note that the seven Germanic languages are clustered in a single location, as are the two Baltic languages. The others appear in either two (Celtic, Romance, Iranian) or four (Indic, Slavic, Uralic) locations. This implies considerable linguistic variation within most of what are considered to be closely related languages (that is why they are called language genera). A larger collection of features might change the pattern, of course, but I still reckon that there is a large component of non-vertical transmission here. This is either coincidence or horizontal transmission. For the Indo-European languages, the latter is perhaps quite likely; but it is equally likely that it is simply coincidence, even at this relatively fine scale.
The weirdest languages
The Idibon blog tried to reduce the multivariate data down to a single number for each language (scaled 0–1), representing its "weirdness" in terms of how many uncommon features it has. So, I have performed the same calculation for my expanded dataset.
The complete list is in the spreadsheet, but here are the top and bottom most-unusual languages:
Diegueño (Mesa Grande)
0.8247 Bottom 20
My results differ from those of the Idibon blog for two reasons: more languages, and more data for some of the languages. Some of my added languages make it to the top of the weirdness list, including Seri, Danish and Swedish; and some of the other languages considerably change their score — for example, Hebrew, Welsh, Portuguese and Chechen are now near top of the list, and Quechua, Basque, Saami and Cornish are no longer near bottom. All of the big changes are increases in weirdness, suggesting that the missing data are important for this calculation.
Nevertheless, it is worth noting that five of the seven Germanic languages are in the top 15 (plus English is at 40 and Icelandic 47). Unusually, most of the Germanic languages still use cases (modifications to words that show how they relate to other words in a sentence). This means that you have to memorize a lot of different versions of each noun, just as you do in Latin. Moreover, these languages change the word order when asking a question as opposed to making a statement, whereas most languages add a participle instead. (In the most unusual language, Mixtec, a native language from Mexico, there is apparently no difference between a question and statement!)
English has a lower score than other Germanic languages presumably because of the French influence mentioned above (French is ranked 42). For example, in English there are now very few cases (only for some pronouns), as in the other Germanic languages, but instead it uses a fairly strict word order to express grammatical relationships. (You will note that two of the English-speaking authors of this blog now live in countries with other Germanic languages, and so we know just how big a pain it is to learn illogical case endings.)
English does have one really odd feature, though, which is the use of the sound "th" (which is part of feature 19A). There are two forms of this sound, voiced (as in "the") and unvoiced (as in "thing"). These sounds do not exist in most languages, and they are rare even among the other Indo-European languages. That is why you often hear non-native speakers say "dis" and "zis" instead of "this" — "th" is a sound that they have no experience making.
Actually, the Indo-European languages are very diverse in their weirdness. Many of them are at the top of the list, but there are also some at the bottom, including Hindi which is dead last. Notably, three of the Romance languages are at the top (Spanish, Portuguese, French) and two are at the bottom (Romanian, Italian). This seems unlikely, given the overall similarity of Spanish and Italian, for example; and so it probably reflects the specific choice of linguistic features.
The data are also potentially sensitive to some of the feature coding. One notable example is for feature 19A in Arabic. WALS codes Arabic as having pharyngeals but not "th", while Wikipedia says that the pharyngeals are doubtful, but that Arabic has "th". So, the possble codings of Arabic, and their resulting weirdness, are:
"Th" sounds only
Pharyngeals and "th" Score
0.9245So, this feature alone can potentially change Arabic from "normal" to "very weird", depending on how it is coded.
Languages do not have a tree-like evolutionary history. Even the relatively small dataset presented here seems to show the influence of horizontal evolution. But, more importantly, we should not underestimate the coincidental occurrence of language features (parallelism and convergence). These have usually been treated as a nuisance in phylogenetic studies of organisms, but they are likely to be important for the study of languages. I have discussed this further in a previous post (False analogies between anthropology and biology).
November 12, 2013
I have noted before that taxonomic groups that are represented in any tree-like parts of a phylogeny can be considered to be monophyletic, but those that consist of hybrids cannot, unless we hypothesize a single hybrid origin for each group (How should we treat hybrids in a taxonomic scheme?). This issue arises from the concept that monophyletic groups must share an exclusive Most Recent Common Ancestor (MRCA), and this concept is not straightforward for a network compared to a tree.
This topic has been tackled mathematically a couple of times (see Huson and Rupp 2008; Fischer and Huson 2010), resulting in the recognition that for a network there are three main types of MRCA: conservative MRCA (or stable MRCA), Lowest Common Ancestor (or minimal common ancestor), and Fuzzy MRCA (see Networks and most recent common ancestors). These have definitions based on the Least Lower Bound and Greatest Lower Bound of mathematical lattices.
Unfortunately, there has been very little discussion of the topic in the biological literature. However, recently Wheeler (2013) has made a start. There is no reference to the mathematical work on MRCAs, but he considers what to do about the concepts of monophyly, paraphyly and polyphyly with respect to networks.
Basically, he suggests three new types of phyletic group: periphyletic, epiphyletic, and anaphyletic. He provides algorithmic definitions of these groups, relating them to the previous algorithmic definitions of monophyly, paraphyly and polyphyly. These new types concern groups that are monophyletic on a tree, but have additional gains or losses of members from network edges — that is, they lie somewhere between monophyletic and paraphyletic.
For example, an epiphyletic group would be one that is otherwise monophyletic but also contains one or more hybrids that have one of their parents from outside the group, while a periphyletic group would be monophyletic but has contributed as a parent to at least one hybrid outside the group. An anaphyletic group would have done both of these things. For clarification, Wheeler provides the following empirical example, based on Indo-European languages (where English is recognized as a "hybrid" of Germanic and Romance languages).
Reproduced from Wheeler (2013).
In terms of MRCA, it seems to me that all three new group types use the Lowest Common Ancestor model, which is the shared ancestor that is furthest from the root along any path (ie. the LCA is not an ancestor of any other common ancestor of the taxa concerned). However, this is only clear when we consider hybrids, in which the two (or more) parents contribute equally to the hybrid offspring. When dealing with introgression or horizontal gene transfer, where the parentage is unequal, then we approach the Fuzzy MRCA model, in which only a specified proportion of the paths (representing some proportion of the genomes) needs to be accommodated by the MRCA, thus keeping the MRCA close to the main collection of descendants.
What is not yet clear is whether we would want to recognize any of these new group types in a taxonomic scheme. I guess that this is something that the PhyloCode will have to think about, since it is based strictly on clades (although they are allowed to overlap).
Fischer J, Huson DH (2010) New common ancestor problems in trees and directed acyclic graphs. Information Processing Letters 110: 331–335.
Huson DH, Rupp R (2008) Summarizing multiple gene trees using cluster networks. Lecture Notes in Bioinformatics 5251: 296–305.
Wheeler WC (2013) Phyletic groups on networks. Cladistics (online early).
November 10, 2013
One common problem when presenting results using a data-display network is the complexity of the relationships among the samples, especially when there is a large number of them. It is often the case that the relationships among closely related samples are impossible to see clearly.
A recent paper (El Baidouri F, Diancourt L, Berry V, Chevenet F, Pratlong F, Marty P, Ravel C (2013) Genetic structure and evolution of the Leishmania genus in Africa and Eurasia: what does MLSA tell us. PLoS Neglected Tropical Diseases 7: e2255) presents an interesting solution to this problem. Basically, the idea is to present a series of graphs, with the main graph showing the overall relationships and a collection of small graphs showing the details of different parts of the network.
This takes longer, of course, as it involves doing a series of analyses, one for each subset of the data, but this is easy enough to do in programs like SplitsTree. It seems to be an idea worth considering.
November 5, 2013
The following two problems will be familiar to researchers working on evolutionary phylogenetic networks.
1) The severe computational intractability associated with globally optimizing most objective functions over the space of phylogenetic networks.
2) The fact that within the space of potential solutions, there are typically very many that an end-user biologist will want to exclude from consideration, for context-specific biological reasons that the software does not know about. This hidden information often only becomes available at the end of the analysis. It is not unusual to receive comments such as: "Thanks for the networks but they can't be good, because experimentalists strongly believe that taxon X is a hybrid of taxa Y and Z, and we also think that taxon group C should be monophyletic ... this is not visible in your networks."
In a recent opinion piece added to the Arxiv ("Fighting network space: it is time for an SQL-type language to filter phylogenetic networks") myself, David and Simone Linz pose the question of whether it might be possible to address both these questions at the same time, using constraint-based modelling. The core of the idea is that, via some kind of comparatively easy-to-use modelling language (e.g. something with an SQL flavour), the end-user biologist should be able to specify characteristics that all candidate solutions must (or must not) have.
The win-win scenario would be that this (a) tempers the intractability of the search problem, by cutting out large swathes of irrelevant networks in the vast search space and (b) invites biologists to incorporate their context-specific knowledge "upstream", reducing the risk that the networks generated by the software are mis-interpreted. In the context of phylogenetic trees, the idea is not new: in 1986 Constantinescu and Sankoff showed that the use of a constrained tree indeed reduces the search space remarkably.
It seems a natural idea to do this for networks, but the question of course is how feasible all this is. Constraint-based pruning of intractable search spaces is seductive but technically challenging for all kinds of reasons. Depending on the constraints used it might help a lot or a little, it is certainly no silver bullet. We might nevertheless hope that in many cases end-user biologists have so much implicit knowledge that the search space is massively shrunk. The question of the modelling language is also tricky because we need to decide upon a set of network constraints that biologists want and need: the dominant topological feature of trees, the clade, is no longer sufficient to describe (or constrain) the topologically richer space of phylogenetic networks. Furthermore, the constraints themselves should not become a new source of intractability.
In the opinion piece we make a few basic suggestions for atomic network constraints and how they might be combined via an SQL-style language. This, of course, is only the starting point for what we hope will be a wider discussion.
We're very keen to hear your thoughts about this!
November 3, 2013
This week, for Monday we have a phylogenetic tree constructed using examples of the organisms whose relationships are being represented.
In 1867, Franz Martin Hilgendorf published the tree shown in the first figure, which is illustrated with pictures of the fossil snails being discussed. This may well have been the first time that this form of illustrated phylogeny was produced.
From Hilgendorf (1867).
In 2000, Hilgendorf's papers and original fossil materials were re-discovered in the Palaeontological Collections of the Natural History Museum, Berlin, to which they had been donated by Hilgendorf's heirs. Among these was a series of cards to which snails had been glued, illustrating the morphological transitions within and between the taxa, as described in the original paper. One of these cards illustrates the phylogeny, as shown in the next figure.
From Glaubrecht (2012).
This is not the only example of this type produced by Hilgendorf. Another one had previously been discovered in the State Museum of Natural History, Stuttgart, as shown in the final figure. This copy was apparently produced much later.
From Rasser (2006).
Glaubrecht M. (2012) Franz Hilgendorf's dissertation "Beiträge zur Kenntnis des Süßwasserkalks von Steinheim" from 1863: transcription and description of the first Darwinian interpretation of transmutation. Zoosystematics & Evolution 88: 231-259.
Hilgendorf F. (1867) Über Planorbis multiformis im Steinheimer Süsswasserkalk. Monatsberichte der Königliche Preussischen Akademie der Wissenschaften zu Berlin 1866: 474-504.
Rasser M.W. (2006) 140 Jahre Steinheimer Schnecken-Stammbaum: der älteste fossile Stammbaum aus heutiger Sicht. Geologica et Palaeontologica 40: 195-199.
October 29, 2013
I have recently been doing a course (along with a bunch of postgraduate students) on Massively Parallel Sequencing, also known as Next Generation Sequencing (NGS). This was a partially successful attempt to teach an old dog some new tricks. More to the point, it has prompted me to think about NGS in relation to phylogenetic networks. Most of the published discussions have focussed on trees, rather than networks.
NGS can potentially provide a fast and cost-effective means of generating multilocus sequence data for phylogenetics (Rannala & Yang 2008; McCormack et al. 2013; Moriarty Lemmon & Lemmon 2013). Unfortunately, the cost for the number of samples typically employed in phylogenetics is currently still beyond the reach of most researchers.
NGS and phylogenetics
Nevertheless, we are sometimes told things like: "The fields of phylogenetics and phylogeography are on the cusp of a revolution, enabled by the rapid expansion of genomic resources and explosion of new genome sequencing technologies." This is probably over-stating the case, as noted by McCormack et al. (2013):
Despite this obvious potential, NGS has been slow to take root in phylogeography and phylogenetics compared to other fields like metagenomics and disease genetics. We suggest that this lag has been caused by four specific aspects of phylogeographic and phylogenetic research: the predominant focus on non-model organisms, the need for sequencing large numbers of samples per species, the lack of consensus regarding library preparation protocols for particular research questions, and the transitional state of the technology (whole-genome data are still neither cost-effective, nor even desirable for phylogeography and phylogenetics, but are paradoxically easier to collect).Another issue is the historical importance of utilizing gene trees in phylogeography and phylogenetics. Gene trees are most robustly inferred from loci with high information content, for example, a non-recombining locus containing a series of linked SNPs. Individual SNPs, on the other hand, have low information content on a per-locus basis and have been used predominately with classification methods such as Structure and Principal components analysis ... While distance-based genealogies and phylogenies can be built from unlinked SNPs, this ignores models of molecular substitution and probabilistic tree-searching algorithms that have led to more robust phylogenetic inference in the last several decades.Furthermore, no-one has yet shown that many of the questions currently being asked by phylogeneticists will actually benefit from genomic data. We may well be able to answer some new questions, but that is quite a different thing from a revolution. The essence here is that in science the questions must come first. Collecting data for the sake of it is usually unproductive. So, we need a clear demonstration that genomics is actually needed in phylogenetics (as opposed to other disciplines, where it may indeed be very useful). If increased volume of data will solve a phylogenetic problem then that is good, but there is no necessary reason to expect that it will happen. Statistically, the extra data can lead to improved precision but not necessarily improved accuracy. In science, targeted data collection has always been the most productive approach to any clearly stated experimental question.
For example, the estimated relationships among humans, chimpanzees, and gorillas did not change as a result of genome sampling (Galtier and Daubin 2008), nor did those of malaria species (Kuo et al. 2008), nor those of mammal superorders (Hallström and Janke 2010). (I have discussed the mammal example in a previous blog post: Why are there conflicting placental roots?). In all three cases, the relationships were just as complex after the genome sequencing as before — the resolution of controversial branches in our trees did not occur as a result of increased access to character data.
In this sense, a small sample of representative gene sequences should reveal just as much of the genealogical truth as will a genome-wide sample. A more recent empirical example is presented by O'Neill et al. (2013), who found that including less informative loci added so much noise to the phylogenetic signal that the analysis eventually broke down. The issue here is that as data volume increases so does the potential occurrence of systematic bias due to model mis-specification.
This sort of problem can easily be visualized using phylogenetic networks, in which genome-scale data frequently produce unresolved bushes rather than tree-like phylogenies. I have provided a couple of examples in a previous post (When is there support for a large phylogeny?). Another example is provided by Beiko (2011), which I have reproduced below.
This all suggests that we will need to think carefully about how to apply phylogenetic networks to genome-scale data. Much of the lack of resolution may very well come from the nature of NGS, rather than from the actual evolutionary history.
NGS and networks
There are a number of potential problems with NGS. These may not matter so much for tree-building algorithms, but it is a different matter for networks.
 Increased homoplasy due to sequencing errors
An error rate of even 0.01% is considered good in NGS (eg. Roche 454: 1%; Illumina HiSeq: 0.1%; Life SOLiD: 0.01%), but when this is extrapolated to the genome scale it results in thousands of errors. Networks are sensitive to this magnitude of stochastic error. Indeed, I have already written about the use of phylogenetic networks specifically to identify data errors (Checking data errors with phylogenetic networks).
 Increased homoplasy due to intra-gene processes
These include substitutions, deletions, duplications (especially tandem repeats), inversions, and translocations. These processes can potentially reveal evolutionary history, but we have little idea about how best to process the data in a way that will reveal that history. Currently, we deal with this by lumping most of the processes together as "indels".
 Increased homoplasy due to inter-gene processes
The most common processes known to confound attempts to identify reticulate evolution are incomplete lineage sorting and gene duplication–loss. There are several methods available for addressing these issues in the context of estimating phylogenetic trees, but their applicability to networks is still being assessed.
 Increased homoplasy in non-coding regions
Sanger-style sequencing is usually targeted towards gene-coding regions or their introns, but genome-scale data can include what is currently called "junk DNA". The evolutionary processes in these regions are unknown, as is their applicability to phylogenetic analysis.
 Inadequacies due to data-processing methods
The analysis of NGS data is often a black art — each paper seems to provide its own way of processing the data. This has been a cause of concern expressed in the literature (e.g. Check Hayden 2012; Editorial 2012a, 2012b; MacArthur 2012), especially in the light of the poor documentation and archiving of bioinformatics programs. I have discussed this issue in some previous posts (Poor bioinformatics?; Archiving of bioinformatics software). Perhaps the most talked-about problem is ascertainment bias — there is a brief discussion of this at the end of this post.
Network analysis of NGS data
All of this might make the application of networks to phylogenomics problematic in many cases, because we already have enough challenges dealing with the data from Sanger-style sequencing, without having them be orders of magnitude worse. It will therefore be very interesting to see what emerges from the current attempts to apply phylogenetic networks to NGS data.
There have been a few applications of EDA (exploratory data analysis) programs such as SplitsTree, mostly involving bacteria and viruses, and often in the context of detecting recombination. Not all of these studies have produced networks that look bushy, as shown by the example below (from Söderlund et al. 2013). SplitsTree is mostly limited by the number of samples not by the number of characters, so that genomic data are not a particular analysis issue for algorithms such as neighbor-net. However, you might like to calculate your inter-sample distances outside the program, unless you want the simple p-distance. (Popular genome-scale alternatives include Fst.)
There have also been programs developed for the study of admixture (a.k.a. introgression) in human genomes, such as TreeMix, AdmixTools, and MixMapper, and these might repay wider exploration. I have discussed some of these programs in a previous post (Admixture graphs – evolutionary networks for population biology). Essentially, they first construct a tree and then add reticulations based on various criteria. As is usual with this approach, there is the problem of constructing the initial tree in the presence of reticulation processes, and there seems to be no clear criterion about when to stop adding reticulations — optimization criteria always increase as reticulations are added, so that increasingly complex networks will always be preferred mathematically.
Note — a common data-processing problem
The following explanation of one type of ascertainment bias is adapted from the Fluxus Engineering web site:
For each DNA sample, a large number of short sequences are generated by the NGS sampling. Genomic variants are estimated from the consensus of these NGS sequences, after filtering the sequences for artifacts. Variant lists are never complete — the greater the sequence length, the greater the fraction of the genome that can be sequenced, but there are always uncharted regions which vary from sample to sample. The sampled genome sequences are then compared to a reference genome. NGS software usually reports SNP variants only if they do not match the reference genotype, and if there is sufficient evidence that they are non-reference. Non-reported variants do not necessarily match the reference genotype — they can just as well be sequencing failures, or coverage gaps, or insufficient evidence for a non-reference variant. Networks generated from such data are likely to consist largely of artifacts.References
Beiko RG (2011) Telling the whole story in a 10,000-genome world. Biology Direct 6: 34.
Check Hayden E (2012) RNA studies under fire. Nature 484: 428.
Editorial (2012a) Must try harder. Nature 483: 509.
Editorial (2012b) Error prone. Nature 487: 406.
Galtier N, Daubin V (2008) Dealing with incongruence in phylogenomic analyses. Philosophical Transactions of the Royal Society of London, Series B: Biological Sciences 363: 4023-4029.
Hallström BM, Janke A (2010) Mammalian evolution may not be strictly bifurcating. Molecular Biology & Evolution 27: 2804-2816.
Kuo C-H, Wares JP, Kissinger JC (2008) The Apicomplexan whole-genome phylogeny: an analysis of incongruence among gene trees. Molecular Biology & Evolution 25: 2689-2698.
Moriarty Lemmon E, Lemmon AR (2013) High-throughput genomic data in systematics and phylogenetics. Annual Review of Ecology, Evolution & Systematics 2013. 44: 19.1–19.23.
MacArthur D (2012) Face up to false positives. Nature 487: 427-428.
McCormack JE, Hird SM, Zellmer AJ, Carstens BC, Brumfield RT (2013) Applications of next-generation sequencing to phylogeography and phylogenetics. Molecular Phylogenetics and Evolution 66: 526-538.
O'Neill EM, Schwartz R, Bullock CT, Williams JS, Shaffer HB, Aguilar-Miguel X, Parra-Olea G, Weisrock DW (2013) Parallel tagged amplicon sequencing reveals major lineages and phylogenetic structure in the North American tiger salamander (Ambystoma tigrinum) species complex. Molecular Ecology 22: 111-129.
Rannala B, Yang Z (2008) Phylogenetic inference using whole genomes. Annual Review of Genomics and Human Genetics 9: 217-231.
Söderlund R, Jernberg C, Källman C, Hedenström I, Eriksson E, Bongcam-Rudloff E, Aspán A (2013) Rapid whole genome sequencing investigation of a familial outbreak of E. coli O121:H19 with a sheep farm as the suspected source. EMBnet Journal 19 suppl.A: 89-90.
October 27, 2013
The more sports-minded of you will know that Canada and Russia have at least one thing in common — ice hockey. Indeed, Canada dominated the sport at the international level from 1930–1953, and the Soviet Union from 1963–1976, with these two teams being equal rivals during the intervening decade.
The McGill University ice-hockey team in 1881
at the Crystal Palace Rink in Montreal.
Ice hockey is considered to have originated in the eastern parts of Canada, with the first informal rules appearing in 1873. The first organized game of hockey was apparently played on March 3 1875, at the Victoria Skating Rink in Montreal. The first Stanley Cup games were played in 1893; and the National Hockey League (NHL) was formed in 1917.
The first ice hockey games in Europe were played in 1902 at the Prince's Skating Club in Knightsbridge, London. On March 4 1905, Belgium and France played two international games in Brussels. Three years later, the Ligue International de Hockey sur Glace (LIHG) was founded in Paris, with representatives from Belgium, France, Great Britain and Switzerland, and later the same year also from Bohemia (now the Czech Republic). The first LIHG-organized games were played in Berlin, on November 3-5 1908, at which stage Germany also joined.
The 1920 Olympic Summer Games in Antwerp, Belgium, hosted the first international ice hockey tournament with North American participation, and it is from this date that World Championship ice hockey is considered to originate. The first World Championship outside the Olympics took place in 1930, although the Winter Olympics continued to host the Championships until 1972.
The LIHG became the International Ice Hockey Federation (IIHF) in 1954; and it currently has 52 full members, 18 associate members and 2 affiliate members. Only 48 of these members currently compete in the World Championships. It seems worthwhile to explore some of the Championship data, to look at the relative success of the different teams.
There have been 77 World Championships between 1930 and 2013, inclusive. The number of teams participating has varied dramatically, with as few as four, due to financial crises, political boycotts, and disputes over professional versus amateur status of the players. For this reason, I have restricted myself solely to the data concerning medal winners (ie. the top three teams).
The data are from Wikipedia. I scored Gold, Silver and Bronze medals as 3, 2 and 1 points, respectively, with 0 points for all other participants. So, the network applies only to those 14 teams that have won at least one medal over the years. I have kept the various teams separate, which means that Czechoslovakia appears along with both Slovakia and the Czech Republic, the Soviet Union appears along with Russia, and both Germany and West Germany are listed.
The network analysis method follows what I have previously used for the FIFA World Cup (soccer). The similarity among the 77 scores for each pair of teams was calculated using the Manhattan distance. A Neighbor-net analysis was then used to display the between-team similarities as a phylogenetic network. Thus, teams that are closely connected in the network are similar to each other based on their overall World Championship results, and those that are further apart are progressively more different from each other.
The network shows the four most successful teams on the left and the less successful teams on the right.
Canada have won 46 medals over the 77 Championships, the Czech Republic (plus Czechoslovakia) has been involved in 12+34=46 medals, Sweden has won 44 medals, and Russia (plus the Soviet Union) has been involved in 8+34=42 medals. So, these four teams have won 178 of the 231 medals (77%). The next best teams are the United States (17 medals), Finland (11) and Switzerland (10). (Note: Slovakia technically has 4+34=38 medals, but the IIHF officially attributes all Czechoslovakian medals to the Czech Republic alone.)
Great Britain won 5 medals in the first 12 Championships, but has won nothing since 1938. The remaining foundation members, Belgium and France, have never won a medal. However, France is still ranked among the 16 teams in the Championship Division, although Great Britain is currently (2013) among the 12 teams in Division I (it was relegated in 1995), and Belgium is among the 12 teams in Division II (relegated in 2005). The other teams currently in the Championship Division that have never won medals are: Denmark, Italy and Norway, plus Belarus, Kazakhstan and Latvia from the former Soviet Union. Austria is the only other medal-winning team not currently in the Championship Division (it was relegated to Division I earlier this year).
The IIHF has provided a World Ranking for 50 of the teams every year since 2003. This provides a more detailed look at the recent history of the various teams (ie. over the past 11 years). The annual ranking is based on the success of the teams in the previous three World Championships plus the most recent Winter Olympics, with each competition being assigned a set number of points and the teams sharing these points based on their finishing position.
I have analyzed these data in the same way as above, except that the data are the actual ranking points awarded to each team each year. I excluded Armenia, Bosnia & Herzegovina, Georgia, Greece, Mongolia and the United Arab Emirates because they were not ranked in all 11 of the years.
The network shows a simple gradient from the most successful teams at the top-left to the least successful teams at the bottom-right. This network arrangement implies that the relative rankings of the teams are very consistent from year to year.
As before, the same four teams have dominated across the past 11 years as they did for all 77 of the Championships (Canada, Sweden, Russia and the Czech Republic) but now also including Finland. These teams are followed by Slovakia, the United States and Switzerland. Only three of these teams have been raked first: Canada and Russia in four years each, and Sweden for three years. However, Sweden is the only team to have been ranked in the top four every year. These same eight teams dominate the current IIHF rankings (2013), with a clear points gap between the eighth and ninth ranked teams.
Note that Switzerland should currently be included in the upper echelon, even though the other teams have been referred to as the "Big Seven". Sadly, in the 2013 World Championships Switzerland won every one of their games except the final, even beating the host nation (Sweden) in their first game; but it is a bit hard to beat the Swedes on their home ice twice in one tournament.
October 22, 2013
The term "DNA barcoding" is a metaphor, and like all metaphors it is helpful only to the extent that it provides insight into the topic at hand. The metaphor concerns commercial barcodes, which were developed to provide a means of storing and retrieving information about manufactured products. Once a product exists we can create a barcode that uniquely identifies that product. At any future time we can invert this chain of logic, by reading the barcode and thus retrieving information about the product.
Does this metaphor apply in the biological world? Well, partly. Whenever biological variation is discontinuous then we could treat the delimited entities as analogous to products, and some part of the DNA must be unique and could be used as a unique identifier. However, much biological variation is more or less continuous, and at best delimits fuzzy (ie. overlapping) clusters rather than discrete entities; so even the theoretical idea that we could know about biodiversity by reading barcodes is not a forgone conclusion.
Just as importantly, however, barcodes apply to one part of the genome, while biodiversity is about whole organisms and their relationships. Barcodes do not apply to either genomes or organisms, they apply to genes. How many barcodes does a genome need before it is uniquely characterized? A product needs only one, but that is because we defined the product first and then applied the barcode to it. But in biology we read the barcode first and then try to work out what it might apply to.
Furthermore, does barcoding a genome also barcode the organisms? Not that we know of. Each organism is a phenotype, which is a genotype interacting with its environment (in the broadest sense). There is much more to biodiversity then merely a collection of genomes. So, even if we do have a DNA barcode, we don't really know what this tells us about biodiversity.
So, a DNA barcode provides information but not necessarily knowledge, whereas a product barcode provides both. Therein lies the major weakness of the metaphor.
DNA barcoding seems to have started as a means of identifying DNA in foodstuffs, and in this application the metaphor seems to have some use, because the weakness does not have much affect. After all, we are mainly trying to identify DNA that is foreign to the alleged ingredients, which merely asks the question: Is there more to this food item than meets the eye? Since the ingredients are all distinct entities, and we know about them beforehand, all we are doing is identifying the entities by examining their barcodes.
However, DNA barcoding is now being used to help create a catalogue of life, which is a completely different thing. In this application, we are trying to delimit entities based on their alleged barcodes — if they have different barcodes then they thus must be different entities. We are counting barcodes but we are not necessarily counting meaningful biological entities. Here, the metaphorical weakness seems like a major handicap, potentially leading to mis-interpretation of what DNA barcoding can and cannot achieve.
DNA barcoding is a viable technology for helping to quantify DNA diversity, which is what it is used for when examining foodstuffs. But the metaphor should not lead us to the conclusion that information about DNA diversity automatically provides much knowledge about biodiversity as a whole. We would end up with a catalogue, but we would not necessarily know what it refers to. This would be a data-base but it would not be a knowledge-base.
What does this have to do with phylogenetic networks? Well, the criteria for defining entities and identifying them based on DNA barcodes is usually a phylogenetic tree. We create a phylogenetic tree of the known barcodes, and the closest barcode in the tree is then used as the best "identification" of any newly discovered barcode. Remember, product barcodes are unique by definition, and we know what they refer to. But DNA barcodes are not unique unless we decide that they are; and we have no prior idea what they refer to. We make both decisions with reference to clades on a phylogenetic tree.
But a phylogenetic tree imposes a hierarchical structure on the data, irrespective of whether there actually is such a structure underlying the data. A phylogenetic network might reveal a very different pattern. In particular, when the data are forced into a tree then many of the shared characters become parallelisms and reversals, whereas the network can actually display them as shared characters.
To illustrate this, we can look at some of the data from the first published paper about DNA barcodes:
Hebert PD, Cywinska A, Ball SL, deWaard JR (2003) Biological identifications through DNA barcodes. Proceedings of the Royal Society of London B: Biological Sciences 270: 313-321.The authors evaluated the usefulness of cytochrome c oxidase I (COI or Cox1) sequences as a barcode. They analyzed sequences 223 amino-acids long from 100 members of the Bilateria. The original analysis was based on Poisson-corrected p-distances and the Neighbour-joining algorithm — chosen because of "its strong track record in the analysis of large species assemblages [and] the additional advantage of generating results much more quickly than alternatives." The tree was shown as rooted on the Platyhelminthes but without explanation (the other two analyses in the same paper had clearly specified outgroups). The tree itself looks like it might have a mid-point root.
No measure of branch support was provided, but the authors concluded that their analysis:
showed good resolution of the major taxonomic groups. Monophyletic assemblages were recovered for three phyla (Annelida, Echinodermata, Platyhelminthes) and the chordate lineages formed a cohesive group. Members of the Nematoda were separated into three groups, but each corresponded to one of the three subclasses that comprise this phylum. Twenty-three out of the 25 arthropods formed a monophyletic group, but the sole representatives of two crustacean classes (Cephalocarida, Maxillopoda) fell outside this group. Twelve out of the 25 molluscan lineages formed a monophyletic assemblage allied to the annelids, but the others were separated into groups that showed marked genetic divergence. One group consisted solely of cephalopods, a second was largely pulmonates and the rest were bivalves.I have tried to reconstruct the data (it is not available online), and re-analyzed it using Neighbor-Net (the closest network equivalent of Neighbour-joining) and uncorrected p-distances.
Some of the recognized taxonomic groups are, indeed, characterized by splits in the network, notably the Echinodermata, the Annelida, the Pulmonata (Mollusca), and the various parts of the Nematoda. However, the other groups are ambiguously defined. In particular, the Chordata, Arthropoda and most of the Mollusca are indistinct based on the gene sequence being analyzed, and there is no split supporting the Bivalvia (Mollusca). There is a split supporting the Platyhelminthes, but it has strong reticulate relationships with parts of the Nematoda — this is unfortunate since this is allegedly the root. Removing sample PL1 from the analysis makes the root a bit less ambiguous, and the network then unites most of the Nematoda as a single group.
This network does not really support the methodology used by the original authors. The authors tested the viability of DNA barcoding by adding a series of "test" sequences, one at a time, to the tree-based analysis, to see whether these sequences clustered with the "correct" group in the tree. However, most of the sequences don't form clear groups in the network, so it is not obvious how one would unambiguously decide which alleged group each test sequence clusters with.
The barcode metaphor looks very poor in this network. I wonder whether DNA barcoding would have taken off if the authors had presented this network rather than their original tree?
October 20, 2013
Some time ago I published a blog post on Faux phylogenies in which I included a phylogeny of cartoon animals by Mike Keesey. In this phylogeny, SpongeBob SquarePants was the outgroup. However, SpongeBob goes much further than this.
Importantly, the main characters in the cartoon have representative members of several phyla (notably, except the Cnidaria). Indeed, the List of SpongeBob SquarePants characters at Wikipedia makes this very clear. This opens up the possibility that they could be a means of using modern culture to introduce phylogenetics. This idea has been independently discovered at least twice.
Perhaps the best known usage is by Paul Arriola, produced for his freshman biology students, as shown in the first figure.
This has been reproduced in several places on the web, including Pinterest (e.g. here and here), Facebook (e.g. here and here), and academia (here).
Another, apparently independent, usage is by Rita Chen of the sister artists known as The Hurricanes.
Note that a few "extra" characters have been added (the planarian, ragworm and roundworm), and that the names are not all quite correct.
By the way, did you know that there is a species of sponge-like fungus (in the Boletaceae) called Spongiforma squarepantsii, and named after the character? If not, then see Wikipedia.
October 15, 2013
These days, there are many unrooted affinity-type networks used to display conflicting phylogenetic signals. There are many different methods available, although the various forms of splits graphs seem to dominate, especially NeighborNet and Consensus Networks (for species-level data), and Reduced Median Networks and Median Joining Networks (for population-level data). However, phylogeneticists are interested in genealogies, not just data displays.
Unfortunately, rooted evolutionary networks are not so well off. There is a great need for such networks in phylogenetics, but there are very few automated methods available for constructing them. These networks are needed whenever a genealogy involves reticulation processes rather than solely divergence. The latter produces a tree-like evolutionary history but the former do not, and these thus require network methods.
Due to the lack of obvious methods, most current research papers still do not illustrate reticulate evolution with a genealogy. A collection of ad hoc methods is usually applied to the data, and the evolutionary processes are then inferred from this. However, the use of a network to illustrate the inferred genealogy is rather rare.
Indeed, for species-level studies most papers simply present a set of incongruent gene trees, although some of them also illustrate either (i) the tree derived from the combined data, or (ii) a consensus tree with or without the conflicting relationships, or (iii) a pair of cophylogeny trees. Occasionally, the hybrid origin of some of the species, for example, is illustrated, but the putative parents are not connected in a phylogeny.
Population-level studies often present unrooted haplotype networks, illustrating processes such as hybridization and introgression between closely related species, or the evolution of domesticated species.
However, these ad hoc methods do not mean that evolutionary networks do not appear in the literature. In this blog post I include a representative sample of rooted networks that are intended to illustrate inferred genealogies. They are grouped according to the evolutionary processes being studied (see Reticulation patterns and processes in phylogenetic networks). I have also briefly indicated how the networks were constructed.
Hybridization is commonly studied in the literature, and phylogenetic networks appear not infrequently. This first example was constructed by the unreleased program HyperPars.
Dickerman AW (1998) Generalizing phylogenetic parsimony from the tree to the forest. Systematic Biology 47: 414-426.
This next example was constructed by program SplitsTree. Note that the root of the network is not clearly indicated.
Pirie MD, Humphreys AM, Barker NP, Linder HP (2009) Reticulation, data combination, and inferring evolutionary history: an example from Danthonioideae (Poaceae). Systematic Biology 58: 612-628.
This example was constructed manually from a set of gene trees. Note that it is drawn in a rather unusual style for indicating hybridization.
Sang T, Crawford D, Stuessy T (1997) Chloroplast DNA phylogeny, reticulate evolution, and biogeography of Paeonia (Paeoniaceae). American Journal of Botany 84: 1120-1136.
Polyploid hybridization is probably the most likely type of study to have a phylogenetic network. This is at least partly because there is a computer program, Padre, to automate much of the work. This program was used to construct this first network.
Marcussen T, Jakobsen KS, Danihelka J, Ballard HE, Blaxland K, Brysting AK, Oxelman B (2012) Inferring species networks from gene trees in high-polyploid North American and Hawaiian violets (Viola, Violaceae). Systematic Biology 61: 107-126.
This next example was also constructed by program Padre.
Sessa EB, Zimmer EA, Givnish TJ (2012) Unraveling reticulate evolution in North American Dryopteris (Dryopteridaceae). BMC Evolutionary Biology 12: 104.
This example constructed manually from a gene tree.
Marhold K, Lihová J (2006) Polyploidy, hybridization and reticulate evolution: lessons from the Brassicaceae. Plant Systematics and Evolution 259: 143-174.
Introgression is a widely studied phenomenon. However, rooted evolutionary networks are rarely presented. This first one was constructed manually from a set of gene trees.
Koblmüller S, Duftner N, Sefc KM, Aibara M, Stipacek M, Blanc M, Egger B, Sturmbauer C (2007) Reticulate phylogeny of gastropod-shell-breeding cichlids from Lake Tanganyika — the result of repeated introgressive hybridization. BMC Evolutionary Biology 7: 7.
The next example was also constructed manually from a set of gene trees.
Morgan DR (2003) nrDNA external transcribed spacer (ETS) sequence data, reticulate evolution, and the systematics of Machaeranthera (Asteraceae). Systematic Botany 28: 179-190.
This example was constructed by program SplitsTree.
Labate JA, Robertson LD (2012) Evidence of cryptic introgression in tomato (Solanum lycopersicum L.) based on wild tomato species alleles. BMC Plant Biology 12: 133.
Horizontal Gene Transfer
HGT is a hot topic these days, both among prokaryotes and among eukaryotes, although most papers do not present a phylogenetic network. The first example was constructed by program Sprit from the species tree and a gene tree.
Walsh AM, Kortschak RD, Gardner MG, Bertozzi T, Adelson DL (2013) Widespread horizontal transfer of retrotransposons. Proceedings of the National Academy of Sciences USA 110: 1012-1016.
This next example was constructed manually from a gene tree.
Delwiche CF, Palmer JD (1996) Rampant horizontal transfer and duplication of rubisco genes in eubacteria and plastids. Molecular Biology and Evolution 13: 873-882.
This example was constructed manually from incongruence among a series of gene trees.
Richards TA, Soanes DM, Foster PG, Leonard G, Thornton CR, Talbot NJ (2009) Phylogenomic analysis demonstrates a pattern of rare and ancient horizontal gene transfer between plants and fungi. The Plant Cell 21: 1897-1911.
Intra-genic recombination is often studied without reference to a network. Nevertheless, several programs exist, and this particular network was constructed by program Kwarg.
Jenkins PA, Song YS, Brem RB (2012) Genealogy-based methods for inference of historical recombination and gene flow and their application in Saccharomyces cerevisiae. PLoS One 7: e46947.
Chromosomal rearrangements are studied rather rarely. This network was constructed manually from a phylogenetic tree. Note that the root of the network is not clearly indicated.
Rumpler Y, Hauwy M, Fausser JL, Roos C, Zaramody A, Andriaholinirina N, Zinner D (2011) Comparing chromosomal and mitochondrial phylogenies of the Indriidae (Primates, Lemuriformes). Chromosome Research 19: 209-224.
Reassortment of segmented viruses produces very complex networks. This one is a partial network, constructed manually from a series of phylogenetic analyses.
Smith GJ, Vijaykrishna D, Bahl J, Lycett SJ, Worobey M, Pybus OG, Ma SK, Cheung CL, Raghwani J, Bhatt S, Peiris JS, Guan Y, Rambaut A (2009) Origins and evolutionary genomics of the 2009 swine-origin H1N1 influenza A epidemic. Nature 459(7250): 1122-1125.
This is a difficult topic to study. As is almost always done, this network was constructed manually from a phylogenetic tree.
Thiergart T, Landan G, Schenk M, Dagan T, Martin WF (2012) An evolutionary network of genes present in the eukaryote common ancestor polls genomes on eukaryotic and mitochondrial origin. Genome Biology and Evolution 4: 466-485.
This topic rarely involves networks. This network was constructed manually from the output of program SplitsTree.
Dyer RJ, Savolainen V, Schneider H (2012) Apomixis and reticulate evolution in the Asplenium monanthes fern complex. Annals of Botany 110: 1515-1529.
This is an unusual use of a network, but the author notes that "the use of reticulations clarifies the phylogeny by factoring out apparent convergence, even though there is no reason to think that actual hybridization or introgression has occurred." The network was constructed by an unreleased program.
Alroy J (1995) Continuous track analysis: a new phylogenetic and biogeographic method. Systematic Biology 44: 152-178.
October 13, 2013
Some time ago I published a blog post in which I used Google's Ngram Viewer to explore some of the history of phylogenetic nertworks (Ngrams and phylogenetics). Today I use Google Trends to look at the worldwide popularity of some phylogenetic terms in Google's web searches.
The data start in January 2004 and end in September 2013. According to Google, the vertical axis "numbers represent search interest relative to the highest point on the chart. If, at most, 10% of searches for the given region and time frame were for "pizza", then we'd consider this 100."
The first search term is for "Phylogenetics", which shows a depressing trend.
The next term is "Phylogeny", which shows the same trend.
The final term is "Phylogenetic Tree", which looks somewhat better.
Either the people have lost interest in phylogenetics, or they already know about it so they no longer need to do web searches to find out about it.
October 8, 2013
I have written before about the interpretation of splits graphs, and provided a simple worked example (How to interpret splits graphs). However, it seems to be worth re-emphasizing the issue here, as I have recently had a paper drawn to my attention that incorrectly infers "groups" of genes from a series of splits graphs.
The essential point to understand is that splits graphs are separation networks. That is, the edges in the graph represent separation between two clusters of nodes in the network; or, they split the graph in two. Formally, each edge (or set of parallel edges) represents a bipartition (or split) of the taxa/genes based on one or more characteristics.
Therefore, the only groups of nodes that are "supported" by a network are those that are represented by splits in the graph, or by some unique combination of splits.
I will illustrate this using the paper already mentioned:
Marz M, Kirsten T, Stadler PF (2008) Evolution of spliceosomal snRNA genes in metazoan animals. Journal of Molecular Evolution 67: 594-607.The authors describe their analyses thus:
We use split decomposition and the neighbor net algorithm (as implemented as part of the SplitsTree4 package) to construct phylogenetic networks rather than phylogenetic trees. The advantage of these method is that they are very conservative and that the reconstructed networks provide an easy-to-grasp representation of the considerable noise in the sequence data.Unfortunately, it is not clear which network algorithm was used for the networks actually presented in the paper. However, this does not affect the interpretation of the graphs (only the number of splits shown).
For Figure 1, the authors claim:
A phylogenetic analysis of the individual snRNA families, nevertheless, does not show widely separated paralogue groups that are stable throughout larger clades. Figure 1, for example, shows that the U5 variants described in Chen et al. (2005) do not form clear paralogue groups beyond the closest relatives of Drosophila melanogaster. On the other hand, there is some evidence for distinguishable paralogues outside the melanogaster subgroup.This interpretation of Figure 1 seems to be quite reasonable.
However, for Figure 2 they claim:
The situation is much clearer for the drosophilid U4 snRNAs, where three paralogue groups can be distinguished (see Fig. 2). One group is well separated from the other two and internally rather diverse. The other two groups are very clearly distinguishable for the melanogaster and obscura group (see Drosophila 12 Genomes Consortium 2007). For D. virilis, D. mojavensis, D. grimshawi, and D. willistoni we have two nearly identical copies instead of two different groups of genes.In Figure 2 (which is labelled as a "phylogenetic tree"), only the recognition of "group 1" is very well supported by a split in the network (ie. there is a long set of edges separating the "group 1" genes from the rest of the genes). The distinction between "group 2" and "group 3" does not correspond to any split in the network, although there are a few splits in the network shown that could be used to recognize groups (notably the "wi" genes).
Furthermore, for Figure 3 the authors claim:
In teleost fish, we find clearly recognizable paralogue groups for U2, U4, and U5 snRNAs. Surprisingly, the medaka Oryzias latipes has only a single group of closely related sequences, despite the fact that for U4, the split of the paralogues appear to predate the last common ancestor of zebrafish and fugu (Fig. 3).However, in Figure 3: the left-hand network shows three lines that allegedly define groups, only two of which are supported by splits; the middle network shows three lines that define groups, only one of which is supported by a split; and the right-hand network shows two lines that define groups, neither of which is supported by a split. Once again, there are splits in these networks that do form groupings. For example, in the third network, one of the largest splits supports a grouping of the "bfl" genes, while the other supports a grouping of "bfl" + "pma".
Thus, it seems that the authors' recognition of various paralogue groups is at not well supported by their network analyses. Nevertheless, there are reasonably well-supported splits in the networks shown, which therefore could be used to recognize groups, if desired.
October 6, 2013
I hate heights. This is a well-known syndrome (acrophobia), and so I am not alone. However, it does mean that I dislike being in airplanes, especially small ones. In turn, this means that I am interested in air disasters, because it gives me a very good reason to feel that I should dislike being in planes.
Airplane crashes are publicly documented in a way that car crashes, for example, are not. The latter are all too common, sadly, and so you will not find any lists online detailing them. You will, however, find plenty of information about airplane crashes, including a lot of details that you might be better off not knowing about. I will skip most of these morbid details, since this is a family blog, but in this post I will be looking at some of the actual data.
One of the few airlines never to have been involved in a fatal accident.
One internet site that you might like to peruse is the Aviation Safety Network. which has a database with details of all known aviation incidents worldwide. From 2 August 1919 to 1 October 2013, there were 16,844 recorded incidents, including 13,785 accidents, 1,045 hijackings, and 708 other criminal occurrences.
The information is mostly taken from the reports that arise from the official investigations (if there was one). If you read some of the descriptions, not only will you never fly again, you will never even set foot in another airborne conveyance, even while it is still on the ground. What this database does is itemize every single thing that could possibly go wrong with a plane, and what effect this has on the people in it.
There is a long-standing rumour that the most dangerous parts of a flight are take off and landing. However, the data make it clear that this is complete nonsense. Consider, for example, the circumstances of the 40 worst accidents in terms of number of fatalities per plane (excluding ground fatalities):
Take off phase
Initial climb phase
En route phase
Landing phase 1
3 If all planes ever did was take off and then immediately land, the passengers and crew would all be much better off.
However, the worst double-accident did occur while one plane was taxiing and another was taking off. Both were Boeing 747s, and their collision killed 335 of the 396 people in the taxiing plane and all 248 people in the plane that was taking off. This was in the Canary Islands in 1977.
The worst accident involving a single plane occurred in Japan in 1985, when 520 of 524 people died. The 747 plane had previously suffered damage, which apparently was not repaired properly, and the plane therefore ruptured in mid-air. [Note: 747s on domestic routes in Japan are configured to carry close to the maximum number of passengers.] The next worst accident (all 346 people died) occurred when the luggage compartment of a DC-10 opened shortly after taking off in France in 1974.
And so the list goes on, usually involving the failure of some part of the aircraft systems. However, an all too common cause of fatal airplane accidents is what is euphemistically called Controlled Flight Into Terrain (CFIT), which means that the pilot was in control of an undamaged plane at the time of the crash. This problem has been partly addressed by the introduction of Traffic Collision Avoidance Systems (TCAS) and Minimum Safe Altitude Warning (MSAW) devices.
Actually, when you look at it, 35 of the top 40 accidents occurred from 1972 to 1999, inclusive. Only 5 of them occurred after that, and none of them after June 2009. So, air safety is officially considered to be improving continuously through time. Prior to 1970, when the Boeing 747 was introduced, accidents involved fewer people because there were far fewer passengers per plane. [The original 747 had 2.5 times the capacity of the previous Boeing 707.] So, only two accidents from the 1960s make it even into the top 100 list, and none of them were prior to 1962.
The Aviation Safety Network site also provides summary lists concerning some of the accident situations, and it is this summary information that we are considering here. In particular, we are interested in the information regarding the nature of the flights. The data are the number of fatal hull-loss accidents (fuselage written off, damaged beyond repair) per year and the number of associated fatalities. The data cover the years 1942 to 2012, inclusive (71 years).
The flight categories include: Training flight (total of 136 accidents and 553 casualties over the 71 years), Ferry / positioning activity (137 accidents, 523 casualties), Cargo flight (712; 2,986), International scheduled passenger flight (367; 17,597), and Domestic scheduled passenger flight (1,299; 39,403). For the Training flights, Ferrying, and Cargo flights, this is an average of ~4 casualties per accident; for the Domestic passenger flights it is ~30 casualties per accident; and for the International passenger flights it is ~48 casualties per accident. [The summary data for both the International and Domestic Non-Scheduled Passenger flights are missing from the web site.]
I have analyzed the annual accident data using a phylogenetic network as a tool for exploratory data analysis. To create the network, I first calculated the similarity of the years using the Manhattan distance, and a Neighbor-net analysis was then used to display the between-year similarities as a phylogenetic network. So, years that are closely connected in the network are similar to each other based on the number and severity of the aircraft accidents, and those that are further apart are progressively more different from each other.
Basically, the accidents increase in number and severity from top to bottom in the network, with 1985 (see above) and 1972 being the worst years. The worst period was 1969-1980 (shown in purple), with one exception (1975). Note that 1977 was overall not a particularly bad year, in spite of the Canary Islands incident (see above).
Perhaps the most important message in the network is that the years 2000-2012 (in red) are generally clustered with the 1940s (green) and 1950s (blue). So, in spite of the massively greater volume of air traffic in this century, the number of fatal accidents is currently not much greater than it was before the advent of the jet age.
The years 2000, 2001 and 2009 are clustered away from the others (at the top left) because there were still a few bad accidents involving International flights even though the number of accidents involving Domestic flights was low. Indeed, 2000-2001 were the first years to return to the Domestic accident levels of 1942-1945 (which are clustered at the top right).
I presume that I should take great comfort from this overall trend. The reason for it is not hard to fathom, and indeed it is the purpose of the Aviation Safety Network. Every time there is a reported aviation incident it is investigated, and any lessons learned are disseminated. So, if the circumstances leading to an incident are avoidable, either by improving the technology or by changing the human operating procedures, then efforts are made by the authorities to implement those changes, so that the incidents will not be repeated. Safety must therefore increase, at least until radically different modes of transport are introduced. This is why the 1970s were so dangerous — the aviation authorities were suddenly confronted with the consequences of having jumbo jets in the skies.
This is the fundamental difference from car accidents, of course. On the ground, we insist on repeating the same types of incidents over and over again, with only improvements in technology to help us. The human operating procedures remain essentially the same, and fallable. I guess that is why car manufacturers have been developing Advanced Driver Assistance Systems (ADAS), as a step towards semi- or fully autonomous vehicles.
The Barcode of Life
The Genealogical World of Phylogenetic Networks
BMC Evolutionary Biology
Molecular Biology and Evolution