The Genealogical World of Phylogenetic Networks

Biology, computational science, and networks in phylogenetic analysis

URL

XML feed
http://phylonetworks.blogspot.com/

Last update

45 min 26 sec ago

May 21, 2013

22:30

When looking at the population genetics literature I have noticed that many papers still present very traditional phylogenetic analyses, particularly in what can broadly be called agricultural studies. For instance, genetic distances might be calculated between the samples and a "tree of genetic relationships" presented based on UPGMA clustering.

The problem with this sort of approach to genotype data analysis is that it forces the data into an ultrametric tree, which has long been shown to be inappropriate as a model for evolutionary relationships. Furthermore, there is no indication of the robustness of this tree, nor even whether a tree model is appropriate in the first place.

As a specific example, we can look at the microsatellite data presented by Carimi et al. (2010) for various Sicilian grape cultivars. For grape varieties, where hybridization among cultivars has been the historical norm, an ultrametric tree seems singularly inappropriate.

Wine grapes have been grown on Sicily for more than 2,000 years, and at least 120 grape-vine cultivar names are known in the literature. The authors sampled 82 of the cultivars from the Institute of Plant Genetics (Palermo) germplasm collection, with 1-5 clones sampled per cultivar. They assessed six polymorphic microsatellite loci, producing diploid (co-dominant) data. Only 70 distinct genotypes were detected, which were then subjected to data analysis.

The authors used the "Simple Matching coefficient for co-dominant and multiallelic data" to estimate the genetic distances between samples. Unfortunately, this has been shown to have odd properties for diploid  microsatellite data (Kosman and Leanard 2005). Therefore, in my analysis I have used the simple metric of Kosman and Leonard (2005), instead, in which genotype distances are calculated as a proportion of the shared alleles at each locus (averaged across loci). This was calculated using the mmod R package (Winter 2012).

The authors then used the "UPGMA (Unweighted Pair-Group Method with Arithmetical Averages)" clustering method to produce their ultrametric tree from the distance data. This is the most commonly encountered agglomerative hierarchical clustering method to be found in the literature. Instead, I used a NeighborNet network to evaluate whether the data are tree-like, calculated using the SplitsTree program.

The resulting network is shown in the first graph. Cultivars that are closely connected in the network are similar to each other based on their microsatellite profiles, and those that are further apart are progressively more different from each other.


The network shows that there is very little hierarchical structure to the grape-vine microsatellite data. The data do not clearly distinguish "six main groups", as interpreted by the original authors based on their tree (which is shown below). [Note that one of the authors' groups (cluster E) is more heterogeneous than the others, and to be comparable should be divided into either two or three groups.]


Note that the network emphasizes two things: (1) there are no clear groupings of the grape cultivars, and (2) the data are rather "noisy", as microsatellite data often are (e.g. Leroy et al. 2009), with many incompatible signals.

As far as the phylogenetic history is concerned, there is no evidence of "several origins for Sicilian grape-vine germplasm", as interpreted by the authors. Instead, there seems to have been continuous mixing of the genotypes, probably including cultivars from elsewhere in Italy, and even further afield around the Mediterranean. This type of complex genetic history seems to be quite common in domesticated organisms, and a tree-based analysis is therefore unlikely to be appropriate for studying them; see, for example, Decker et al. (2009) for cows, Leroy et al. (2009) for horses, and Kijas et al. (2012) for sheep.

References

Carimi F, Mercati F, Abbate L, Sunseri F (2010) Microsatellite analyses for evaluation of genetic diversity among Sicilian grapevine cultivars. Genetic Resources and Crop Evolution 57: 703–719.

Decker J.E., Pires J.C., Conant G.C., McKay S.D., Heaton M.P., Chen K., Cooper A., Vilkki J., Seabury C.M., Caetano A.R., Johnson G.S., Brenneman R.A., Hanotte O., Eggert L.S., Wiener P., Kim J.-J., Kim K.S., Sonstegard T.S., Van Tassell C.P., Neibergs H.L., McEwan J.C., Brauning R., Coutinho L.L., Babar M.E., Wilson G.A., McClure M.C., Rolf M.M., Kim J., Schnabel R.D., Taylor J.F. (2009) Resolving the evolution of extant and extinct ruminants with high-throughput phylogenomics. Proceedings of the National Academy of Sciences of the U.S.A. 106: 18644-18649.

Kijas J.W., Lenstra J.A., Hayes B., Boitard S., Porto Neto L.R., San Cristobal M., Servin B., McCulloch R., Whan V., Gietzen K., Paiva S., Barendse W., Ciani E., Raadsma H., McEwan J., Dalrymple B., other members of the International Sheep Genomics Consortium (2012) Genome-wide analysis of the world's sheep breeds reveals high levels of historic mixture and strong recent selection. PLoS Biology 10: e1001258.

Kosman E, Leonard KJ (2005) Similarity coefficients for molecular markers in studies of genetic relationships between individuals for haploid, diploid, and polyploid species. Molecular Ecology 14: 415–424.

Leroy G., Callède L., Verrier E., Mériaux J.C., Ricard A., Danchin-Burge C., Rognon X. (2009) Genetic diversity of a large set of horse breeds raised in France assessed by microsatellite polymorphism. Genetics Selection Evolution 41: 5.

Winter DJ (2012) mmod: an R library for the calculation of population differentiation statistics. Molecular Ecology Resources 12: 1158–1160.

May 19, 2013

16:30

In my previous blog post (Resistance to network thinking) I noted that a phylogenetic network is a generalization of a phylogenetic tree because "a network simplifies to a tree if there are no incompatible phylogenetic signals". Given this, to me it has often seemed somewhat odd that so many of the people who are interested in generalizing the Tree of Life into a Network of Life use metaphors suggesting that the tree first needs to be destroyed.

This approach was popularized by Ford Doolittle, who entitled his 2000 Scientific American [282(2): 90–95] article "Uprooting the Tree of Life", although this particular metaphor had previously been used by, for example, Elizabeth Pennisi [Science 284: 1305-1307].

This approach reached its apogee with the ridiculous cover of New Scientist in January 2009. The cover accompanied an article by Graham Lawton now mildly entitled: "Why Darwin was wrong about the Tree of Life" [201(2692): 34-39], although the editor (Roger Highfield) originally called it "Axing Darwin's tree".


As was noted at the time, this cover was "a misdirected and entirely inappropriate piece of sensationalism", which did no one any good (least of all the editor). A subsequent Letter to the Editor [by Dennett, Coyne, Dawkins and Myers] noted: "Nothing in the article showed that the concept of the Tree of Life is unsound; only that it is more complicated than was realised before the advent of molecular genetics."

So, it seems likely that the tree needs to be neither axed nor uprooted, nor "trashed" [Laura Franklin-Hall], nor even "politely buried" [Michael Rose]. In many cases all that is is needed is some osculations between the branches. Indeed, most of the scientific discussion is about how many osculations there are, and how we can best detect where they are, rather than about destroying the tree itself. A network is more general than a tree, rather than being a fundamentally different structure. Nevertheless, some people, such as Michael Syvanen, have been quoted as saying: "We've just annihilated the Tree of Life", when referring to their new network.

May 14, 2013

22:30

Phylogeneticists are used to the idea of tree thinking, in which evolutionary history is seen as a branching tree-like pattern. Clearly, for many phylogeneticists this has not yet been extended to network thinking, in which evolutionary history can also be seen as a reticulating network. Indeed, I have recently come across several people who have actively insisted that "trees are still central" to phylogenetics (to quote one of my correspondents). As Mindell (2013) has claimed, the Tree of Life is still a useful metaphor, model and heuristic device.

So, there is not just indifference to networks but there seems also to be some resistance to them. This is somewhat unexpected, as a network simplifies to a tree if there are no incompatible phylogenetic signals, and so there is no intrinsic reason to restrict phylogenies to being tree-like.

As a typical example from the literature, Losos et al. (2012) have recently commented:
Although molecular data have rarely changed our understanding of the major multicellular groups of the evolutionary tree of life, they have suggested changes in the relationships within many groups, such as the evolutionary position of whales in the clade of even-toed ungulates. Further investigation has usually resolved conflicts, often by revealing inadequacies in previous morphological studies. This has led to a presumption by many in favor of molecular data.Needless to say this is a biased point of view, because conflicts can also be resolved by revealing inadequacies in molecular studies. For example, molecular analyses involve many subjective decisions about substitution models and rates of molecular change, and any one of the underlying assumptions may be violated. There is no theoretical justification for favouring one source of data over another.

Similarly, there is no theoretical justification for trying to resolve conflicts by preferring one hypothesis over another. Phylogenetic conflicts can also be "resolved" by recognizing that evolutionary history is not necessarily tree-like. Losos et al. do not even consider this possibility:
When two phylogenies are fundamentally discordant, at least one data set must be misleading.In fact, the only misleading thing here is the word "must", because both datasets may be perfectly correct but are simply the product of two different evolutionary histories.

This point is perhaps most obvious when comparing molecular datasets. The evolutionary history revealed by between-gene evolutionary processes (e.g. recombination, hybridization, horizontal gene transfer) often conflicts with that from within-gene processes (e.g. nucleotide substitutions and insertions / deletions), and this leads to a reticulating evolutionary history.

Indeed, the more we learn about genomes the less tree-like does the evolutionary history of species seem to be. There are long-standing controversies regarding the evolutionary history of many taxonomic groups, and it has been hoped that genome-scale data would resolve these controversies. However, to date none of these controversies has been satisfactorily resolved into an unambiguous tree-like genealogical history using genome data. They all apparently involve reticulate evolutionary processes.

For example, the estimated relationships among humans, chimpanzees and gorillas did not change as a result of genome sampling (Galtier and Daubin 2008), nor did those of malaria species (Kuo et al. 2008) nor those of placental superorders (Hallström and Janke 2012). In all three cases the estimated relationships were just as complex after the genome sequencing as before. The resolution of controversial branches in our trees has not occurred as a result of increased access to character data or improved data analyses, but our recognition of reticulating relationships certainly has occurred.

There are many other examples where increased character sampling is yet to resolve long-standing controversies about branching patterns, and where reticulation may also be the true explanation. Birds seem to provide many of these examples (eg. Smith et al. 2013), but insects are a rich source as well (eg. Thomas et al. 2013), and sometimes even plants (eg. Goremykin et al. 2013).

Clearly, when two or more phylogenies are fundamentally discordant, none of the datasets needs to be misleading, because a reticulating history may be involved. Network thinking should thus be a standard tool in the arsenal of every phylogeneticist. Tree thinking excludes networks but network thinking does not exclude trees, and so the more general model will always be the more useful one.

References

Galtier N, Daubin V (2008) Dealing with incongruence in phylogenomic analyses. Philosophical Transactions of the Royal Society of London, Series B, Biological Sciences 363: 4023-4029.

Goremykin VV, Nikiforova SV, Biggs PJ, Zhong B, Delange P, Martin W, Woetzel S, Atherton RA, McLenachan PA, Lockhart PJ (2012) The evolutionary root of flowering plants. Systematic Biology 62: 50-61.

Hallström BM, Janke A (2012) Mammalian evolution may not be strictly bifurcating. Molecular Biology and Evolution 27: 2804-2816.

Kuo C-H, Wares JP, Kissinger JC (2008) The Apicomplexan whole-genome phylogeny: an analysis of incongruence among gene trees. Molecular Biology and Evolution 25: 2689-2698.

Losos JB, Hillis DM, Greene HW (2012) Who speaks with a forked tongue? Science 338: 1428-1429.

Minell DP (2013) The Tree of Life: metaphor, model, and heuristic device. Systematic Biology 62: 479-489.

Smith JV, Braun EL, Kimball RT (2013) Ratite nonmonophyly: independent evidence from 40 novel loci. Systematic Biology 62: 35-49.

Thomas JA, Trueman JW, Rambaut A, Welch JJ (2013) Relaxed phylogenetics and the Palaeoptera problem: resolving deep ancestral splits in the insect phylogeny. Systematic Biology 62: 285-297.

May 12, 2013

16:30

Some time ago I blogged about The mysterious rankings in Forbes' Celebrity 100. I noted at the time that "There are some other things that we can learn from an analysis of the Celebrity 100 list, but they have nothing to do with networks, so I will not cover them here." I will, however, cover them now.

Each year since 1999 Forbes magazine has produced a list called the Celebrity 100, which purports "to list the 100 most powerful celebrities of the year" within the USA. The list is based on entertainment-related earnings plus media visibility (exposure in print, television, radio, and online). The 2012 list generated plenty of negative comments around the web, and my network analysis of the data showed that there is little apparent mathematical logic to some of the rankings.

However, the data do also reveal interesting patterns about the perception of celebrity in the media, provided that we accept the quality of Forbes' data (even if we find fault with what Forbes did with those data). In the graphs below I have simply used the information provided by Forbes in order to take a look at some of the features that Forbes did not comment upon.

The first graph plots the celebrity ranking by sex and "profession" (each dot represents one celebrity). You will note that the data are not randomly distributed among the groups.


The graph shows that one third of the celebrities are female, and they dominate the top 10 and the bottom 30. So, in order to get a high ranking it is best to be female but that after that it becomes a handicap.

The other groupings are based on the Forbes description of each celebrity's principal claim to fame. Clearly, in terms of celebrity status: being a musician is better than being an athlete, which is better than being an actor, which is better than being an actress. Being a TV or radio personality is not bad, either. Note that this explains the bi-modal distribution of females: the music females are in the top 10 while the acting females are in the bottom 30.

For the rest, if you are a male, then being a producer/director is marginally better than being an author, which is marginally better than being a comedian. If you are female, then  being a model is much worse than being a singer or an actress. Being an entrepreneur works only if you are Donald Trump.

The second graph compares each celebrity's money ranking (based on an estimate of their earnings) with their overall ranking. This is an attempt to see who is financially benefitting from their celebrity status (or vice versa). The two lines on the graph show that for most celebrities (those between the lines) their financial status closely follows their celebrity status.


However, for those at the top-left of the graph their celebrity standing is greater than they are being paid. (They are ranked in the top 30 on overall celebrity status but are not in the top 25 money earners.) This means that their manager is "not getting them what they are worth". These people are, from top to bottom on the graph:
Jennifer Aniston
Kim Kardashian
Angelina Jolie
Brad Pitt
Adele Adkins
Beyoncé Knowles
Katy Perry
Jennifer Lopez
Stefani Germanotta (Lady Gaga)
Rihanna Fenty
Justin Bieber actress
television personality
actress
actor
singer
singer
singer
actress
singer
singer
singer You will note that there are nine females but only two males in this list. Note, also, the number of singers in the list, indicating that being a singer will get you more celebrity than money.

For those at the bottom-right of the graph their celebrity standing is less than their monetary worth. (They are in the top 25 money earners but are not ranked in the top 25 on overall celebrity status.) This means that their publicity agent is not doing their job (or not being asked to!). These people are, from right to left on the graph:
Mark Burnett
Kenny Chesney
Toby Keith
Jerry Bruckheimer
James Patterson
George Lucas
Michael Bay
Howard Stern   television producer
  country music singer
  country music singer
  film and television producer
  author
  film director and producer
  film director and producer
  radio personality These people are all male, so these males have more money than celebrity. Most of these men do not work directly in the public spotlight, or they prefer country music to pop music.

One can perform a similar analysis to compare the celebrities' TV/Radio rank with their Press rank. This produces a very similar graph. It turns out that the people whose TV/Radio rank is poor compared to their Press rank are mostly athletes (David Beckham, Roger Federer, Lionel Messi, Li Na, Cristiano Ronaldo, Maria Sharapova), along with one model (Kate Moss) and one producer/director (Steven Spielberg). The thirteen people whose Press rank is poor compared to their TV/Radio rank are almost all TV/Radio "personalities", as expected.

I am sure that there is more to be found in this dataset, if anyone cares to look.

May 7, 2013

22:30

Many of you will have recently received an email (or two) announcing the impending inaugural issue of the Journal of Phylogenetics & Evolutionary Biology, "an open access, peer-reviewed journal which aims to provide the most rapid and reliable source of information on current developments in the field of phylogenetics and evolutionary biology."

The journal promotional material notes that: "The emphasis will be on publishing quality papers [that will] help establish its high standard and facilitate the journal to be indexed by prestigious ISI and PubMed". Sadly, the journal's flyer indicates that the journal is unlikely to achieve any of these aims, because the people in charge have very little idea of what phylogenetics is:


Only one of these images explicitly relates to a rooted evolutionary history (and it even has reticulations!), but the other images vary from irrelevant to downright wrong.

Publishing "quality papers" will get them nowhere, since we cannot tell whether they will be high quality or low quality, good quality or poor quality. I am sure they will have some sort of quality, because even a used car has that. Caveat emptor. Moreover, perpetuating the transformational view of evolution will not attract the favourable attention of either ISI or PubMed, although this particular viewpoint might be appropriate for the evolution of scientific publishing:


May 5, 2013

16:30

Manhattan has been described as one of the most real estate obsessed neighborhoods on earth (after Monaco); and another thing it is especially obsessed about is prestige. So, a comparison of the most prestigious apartment (Co-operative and Condominium) buildings is of especial interest, I guess.

In Manhattan, prestige seems to result from such things as the building's overall architecture, the scale and layout of the apartments, the notoriety of its current and past residents, the sheer cost of buying any of the apartments, and the requirement that a purchaser be able to stomach the exorbitant monthly maintenance fees.

However, these are not readily quantifiable attributes, except for the monetary ones, which change from year to year.


CityRealty (a New York City apartment search and resource site) has addressed this conundrum by evaluating the best-known apartment buildings based on a consistent set of non-monetary criteria: CityRealty's New York City Condos & Co-ops. They note:
We rate each building based on its architecture, location and features, using the same scoring methods and criteria for all buildings. The maximum number of points for Architecture is 44, Location 36, and Features 39. However, it is virtually impossible for any building to get the full amount of points for any category.There are 18 criteria for Architecture (each scoring 1-8 points), 14 criteria for Location (1-5 points), and 22 criteria for Features (1-5 points). CityRealty list 3,085 apartment buildings, but only 1,943 of them have ratings.

I have concerned myself only with those buildings that have ratings ≥ 88 (the ratings vary from 30–99), which is 95 buildings. These buildings differ very little in the top-scoring criteria, which is to be expected if they are considered to be the top-rated ones. These criteria include: Distinction of exterior (8 points), Retail quality (5), Street ambience (5), Distance to business district (5), Distinction of lobby (5), and Number of units per floor (5).

However, these buildings do differ considerably on the other criteria, which include presence or absence of all sorts of "desirable" characteristics, such as: gargoyles, illumination, water element, recreational roof, garage, maid's room, elevator person, and external air conditioners. This makes a network analysis possible, which would summarize the similarities among the various buildings.

The analysis

I compiled a list of 57 of the buildings for analysis, including all of the top buildings as rated by CityRealty (37 buildings; scores 92–99), plus a selection of others (20 buildings; scores 88–91) that appear to be noteworthy as indicated by various internet lists (eg. based on architecture, prestige, history, cost of apartments). I then collated the data provided by CityRealty. (See the Postscript for a comment on the data.) There are 29 Co-operatives and 28 Condominiums in the analysis. However, two buildings have identical scores, because the Time Warner Center and the Mandarin Oriental are two towers in the same development — the apartments in the south tower have a One Central Park address (Time Warner Center) and those in the north tower are The Residences at the Mandarin Oriental.

As usual for my data analyses, I have compared the buildings based on what is (appropriately!) called the manhattan distance and then calculated a NeighborNet network. Buildings that are closely connected in the network are similar to each other based on the various criteria used by CityRealty, and those that are further apart are progressively more different from each other.


The network shows seven clusters, which I have color-coded. These clusters represent buildings that have many characteristics in common. Notably, these buildings are also clustered in space, as shown in the map below, which is color-coded to match the network. (Note that yellow is a bit hard to represent in the network.) In particular, the colors occur as follows:
  • light blue – around the fringe of the Upper East Side
  • red – Upper East Side next to Central Park
  • yellow – along the west side of Central Park
  • pink – Upper West Side and south-west Central Park
  • purple – south-east corner of Central Park + 2 Upper East Side + 1 Financial District
  • blue – mostly around the southern and eastern sides of Central Park + 2 Midtown East + 1 Financial District
  • green – Downtown + 2 west of Central Park


The strong geographical clustering of the different types of buildings within Manhattan is not unexpected, since many of the areas were developed at the same time and in a similar architectural style. (This geographical result is not because of the importance of Location in the CityRealty scores, since most of the buildings score very well on all of the Location criteria, except Traffic noise). Important differentiating Architecture criteria include: presence of a Plaza or Atrium, Water Element, Illumination, Non Rectilinear Form, Ceiling Height, and Balconies.

There are also perceived differences in the desirability of the various areas, which means that nearby buildings often provide similar Features (such as Recreational Roof, Elevator Person, Maids Room, Garage, or Catering). This is further related to the non-random distribution of the two apartment types: all green are condominiums; all light blue and yellow are co-operatives; all except one of the red are co-ops; all except two of the blues are condos; and only the pink and purple are mixtures of the two types. Co-operative buildings tend to provide a range of more expensive Features than do the condominiums (most of the co-ops are at the bottom-left of the network graph).

The purple group are all similar in ambience, in that they are buildings that include both a hotel and apartments (usually, the lower part is the hotel and the tower is a co-op or condo). The exception is the Cipriani building, which is part of a world-wide chain. The pink group were all built at a similar time (1902-1908, except the Dakota 1882) and in a similarly opulent style, and they are now designated as historic landmarks.

CityRealty also provides lists of the Top 10 Most Prestigious Co-ops and the Top 10 Most Prestigious Condos. These are indicated with numbers and letters, respectively, in the network diagram. These buildings are somewhat clustered in the network, but it is clear that "prestige" is not directly related to the criteria used by CityRealty in their ratings (if it was, then the buildings would be much more clustered in the network). Furthermore, buildings such as River House are not necessarily as prestigious as they once were (see here and here), and so their places in these lists might be contested.

It is also worth noting that not all of the most expensive buildings are necessarily in the list analyzed. For example, in 2012 very expensive apartments were also sold in the co-op buildings at 785 Fifth Avenue, 884 Fifth Avenue, and 1030 Fifth Avenue, which I did not include in the network analysis.

Finally, not all of the apartment buildings discussed here are necessarily lived in by their owners, particularly those in condominium buildings. For example, the New York Times has noted:
In a large swath of the East Side bounded by Fifth and Park Avenues and East 49th and 70th Streets, about 30 percent of the more than 5,000 apartments are routinely vacant more than 10 months a year because their owners or renters have permanent homes elsewhere, according to the Census Bureau’s latest American Community Survey.This is particularly true of the most expensive condo apartments:
Pieds-à-terre exist throughout the New York City condo market, a separate little world of vacation homes and investment properties. But the higher the price, the higher the concentration is likely to be of owners who spend only a few months, a few weeks or even just a few days each year in their apartments. This very costly form of desolation means that some of the city’s most expensive residential buildings stand mostly dark, lonesome and empty on the inside.Postscript

I should point out, in passing, that CityRealty have not been as consistent in their ratings as might be hoped. For example, they note that: "On occasion, we may add (or subtract) a few points based on our subjective view of the building, so if the numbers don't add up exactly as you expect, that's why." I have ignored these extra subjective points.

What I have not been able to ignore is some of the other inconsistencies. For example, "The Collection" building stands out like a sore thumb in the Sutton Place area (it is glass while its near neighbors are brick), as does "40 Bond Street" (it is bright green while its neighbors are brick or stone). Nevertheless, CityRealty have coded them: "Contextual Design: No, but Very Good", and scored both buildings 3 out of 3 on this criterion. Clearly, this confounds two criteria, Distinction of Exterior and Contextual Design, as CityRealty are allowing their claim that each building is an "Architectural Masterpiece" (and thus they score 8 out of 8 on Distinction of Exterior) to cloud their decision about whether each building also has "Contextual Design" (where CityRealty admit that each building should score 0 out of 3). Even more oddly, they also code "The Gainsborough" building exactly the same way ("Contextual Design: No, but Very Good") and yet, in the photo they show, this building seems to fit perfectly into its context. Indeed, "The Collection" might also fit in, if it was in a more modern location than Sutton Place, but there seems to be no such hope for "40 Bond Street".

May 1, 2013

09:00

One approach that I have taken in this blog to popularizing the use of networks in phylogenetic analysis has been to investigate published data using network techniques. However, this is often difficult because the data have not been publicly made available (eg. Phylogenetic position of turtles: a network view).

I am not the only person to find fault with the failure to release phylogenetic data, although there are recognized reasons why data sometimes cannot be released. Razib Khan at the Gene Expression blog recently had this to say (Why not release data for phylogenetic papers?):
Last month I noted that a paper on speculative inferences as to the phylogenetic origins of Australian Aborigines was hampered in its force of conclusions by the fact that the authors didn't release the data to the public (more accurately, peers). There are likely political reasons for this in regards to Australian Aborigine data sets, so I don’t begrudge them this (Well, at least too much. I’d probably accept the result more myself if I could test drive the data set, but I doubt they could control the fact that the data had to be private). This is why when a new paper on a novel phylogenetic inference comes out I immediately control-f to see if they released their data. In regards to genome-wide association studies on medical population panels I can somewhat understand the need for closed data (even though anonymization obviates much of this), but I don’t see this rationale as relevant at all for phylogenetic data (if concerned one can remove particular functional SNPs). Yesterday I noticed PLoS Genetics published a paper on the genomics of Middle Eastern populations ... The results were moderately interesting, but bravo to the authors for putting their new data set online. The reason is simple: reading the paper I wanted to see an explicit phylogenetic tree/graph to go along with their figures (e.g., with TreeMix). Now that I have their data I can do that.In this particular case the data were made available on the homepage of one of the authors, which is better than nothing but is clearly less than ideal. There are a number of formal repositories for phylogenetic data, all of which should have greater longevity than any personal homepage, including:
TreeBASE
DryadThe first of these databases has a long history of storing phylogenetic trees and their associated datasets. It has not yet lived up to its full potential, but people like Rod Page are pushing for it to do so eventually.

Dryad is a more general data repository (ie. not just for phylogenetic data), and its use is now encouraged by many of the leading journals — Systematic Biology, for example, makes its use mandatory, at least for data during the submission process, and also for "data files and/or other supplementary information related to the paper" for the published version.

Phylogeny databases are not without their skeptics, however. For example, Rod Page (Data matters but do data sets?) has noted:
How much re-use do data sets get? I suspect the answer is "not much". I think there are two clear use cases, repeatability of a study, and benchmarks. Repeatability is a worthy goal, but difficult to achieve given the complexity of many analyses and the constant problem of "bit rot" as software becomes harder to run the older it gets. Furthermore, despite the growing availability of cheap cloud computing, it simply may not be feasible to repeat some analyses. Methodological fields often rely on benchmarks to evaluate new methods, and this is an obvious case where a dataset may get reused ("I ran my new method on your dataset, and my method is the business — yours, not so much"). But I suspect the real issue here is granularity. Take DNA sequences, for example. New studies rarely reuse (or cite) previous data sets, such as a TreeBASE alignment or a GenBank Popset. Instead they cite individual sequences by accession number. I think in part this is because the rate of accumulation of new sequences is so great that any subsequent study would needs to add these new sequences to be taken seriously. Similarly, in taxonomic work the citable data unit is often a single museum specimen, rather than a data set made up of specimens.However, all of this begs the question that seems to me to be central to science. Science is unique in being based primarily on evidence rather than expert opinion, and therefore the core of science must be direct access to the original evidence, rather than some statistical summary of it or someone's opinion about it. How can I evaluate evidence if I don't have access to it? How can I verify it, explore it, or re-analyze it? Being given the raw data (eg. the sequences) is one thing, but being given the data you actually analyzed and based your conclusions on (eg. the aligned sequences) is another thing entirely.

In short, if you won't openly give me your dataset then I don't see how you can call yourself a serious scientist.

April 28, 2013

16:30

In April last year I noted that there are not many images of phylogenetic networks on the internet, suitable for use when an icon or symbol is required, so I provided one (Network road sign). This year I thought that I might point out a far more arty image of "The Tree of Life" that ends up looking more like a network.


The original painting is by Reneé Womack, and prints are available from Fine Art America. You can also have versions with a green or a blue background, but I prefer this one, possibly because it reminds me of the cover of the "Tree Thinking" book by David Baum and Stacey Smith.


April 23, 2013

22:30

I have previously noted that splits graphs are a logical way to present the results of Bayesian analyses (We should present bayesian phylogenetic analyses using networks). Bayesian analyses are concerned with estimating a whole probability distribution, rather than producing a single estimate of the maximum probability. Thus, the result of a Bayesian phylogenetic analysis should not be as a single tree (the so-called MAP tree or maximum a posteriori probability tree), but should instead show the probability distribution of all of the sampled trees. This can easily be done with a consensus network, as illustrated by example in the previous blog post.

An interesting alternative way of visualizing the probability distribution of trees is what has been called a Cloudogram, an idea introduced by Remco R. Bouckaert (2010, DensiTree: making sense of sets of phylogenetic trees. Bioinformatics 26: 1372-1373). This diagram superimposes the set of all trees arising from an analysis. Dark areas in such a diagram will be those parts where many of the trees agree on the topology, while lighter areas will indicate disagreement. This idea can be best illustrated by a few published examples.

The first cloudogram is from Figure 4 of Chaves JA, Smith TB (2011) Evolutionary patterns of diversification in the Andean hummingbird genus Adelomyia. Molecular Phylogenetics and Evolution 60: 207-218.

In this case the MAP tree has been superimposed on the cloudogram.

Species-tree with the highest posterior probability (PP > 80) superimposed upon
a cloudogram of the entire posterior distribution of species-trees recovered in BEAST.
Areas where the majority of trees agree in topology and branch length are shown as
darker areas (well-supported clades), while areas with little agreement as webs.
The next one is from Figure 2 of Pabijan M, Crottini A, Reckwell D, Irisarri I, Hauswaldt JS, Vences M (2012) A multigene species tree for Western Mediterranean painted frogs (Discoglossus). Molecular Phylogenetics and Evolution 64: 690-696.

Posterior density of 2700 species trees (‘‘cloudogram’’) representing the entire posterior distribution
of species trees (270,000 trees post-burnin) from the BEAST analysis based on seven nuclear loci and
4 mitochondrial gene fragments. The species tree with the highest posterior probability is nested within
the set; values indicate posterior probabilities associated with this consensus tree. Areas where many
species trees agree on topology and/or branch lengths are densely colored.
The next one is from Figure 1 of Lerner HR, Meyer M, James HF, Hofreiter M, Fleischer RC (2011) Multilocus resolution of phylogeny and timescale in the extant adaptive radiation of Hawaiian honeycreepers. Current Biology 21: 1838-1844.

In this case the data are more tree-like than the previous two examples.

Cloudogram showing all trees resulting from a Bayesian analysis of whole
mitogenomes (19,601 trees; 14,449 bps). Variation in timing of divergences is
shown as variation (i.e., fuzziness) along the x axis. Darker branches represent a
greater proportion of corresponding trees. All nodes have support values >0.99.
The final one is from  Figure 2 of McCormack JE, Faircloth BC, Crawford NG, Gowaty PA, Brumfield RT, Glenn TC (2012) Ultraconserved elements are novel phylogenomic markers that resolve placental mammal phylogeny when combined with species-tree analysis. Genome Research 22: 746-754.

This analysis involves bootstraps rather than Bayesian samples, showing that the same principle applies.

Evolutionary history of placental mammals resolved from conflicting
gene histories. Widespread consensus among 1000 species-tree bootstrap
replicates of the same 183-locus data set. STEAC trees are depicted because
the branch lengths allow for better visualization of branching patterns, but
STAR results supported the same topology. Cones emanating from terminal
tips of species trees (red arrows) indicate disagreement among bootstrap
replicates.
It would be nice to illustrate this further by direct comparison with a splits graph of the same dataset that I used in the previous blog post. Unfortunately, the computer program available (DensiTree) has the same practical limitation as the SplitsTree program (as mentioned in the previous post) — it does not read the MrBayes ".trprobs" file because it ignores the tree weights. This means that one has to enter the entire treefile (with thousands of trees), and I have not yet done that. Moreover, the program relies very much on having branch lengths for each tree — the output is really quite odd without them, with the taxa appearing in a series of steps rather than connected by straight branches. My previous analysis did not use branch lengths, as they are not needed for the consensus network, in which edge lengths represent support rather than character evolution.

April 21, 2013

16:30

As usual at the beginning of the week, this blog presents something in a lighter vein. However, this week we depart from phylogenetic networks in the strict sense, and delve into the broader statistical life of biologists.

Statistics is a curious thing, which allows scientists to make probability errors of two types: Type I (also known as false positives) and Type II (also known as false negatives). Importantly, these errors can accumulate in any one experiment, so that we can also recognize an Experimentwise Error Rate, which is the sum of the individual errors associated with each experimental hypothesis test.

However, what is not widely recognized is that these errors apply in life, as well. In particular, biologists accumulate statistical errors throughout their lives, so that we all have a Personal Lifetime Error Rate.

I once wrote a tongue-in-cheek article about the accumulation of Type I errors throughout the working life of a biological scientist, and the consequences for the experiments conducted by that scientist. This article appeared in 1991 in the Bulletin of the Ecological Society of Australia 21(3): 49–53, which means that I used an ecologist as my specific example of a biologist. The principle applies to all biologists, however.

Since this issue of the Bulletin is not online, presumably no-one has read this article since 1991, although it has recently been referenced on the web (see the sixth comment on this blog post).** You, too, should read it, and so I have linked to a PDF copy [1.7 MB] of the paper:
Personal Type I error rates in the ecological sciences


** Note that I am alternately referred to as an "inveterate mischief maker" and "a very wise man"!

April 16, 2013

22:30

I have noted before today that many people seem to treat non-biological phylogenetic attributes as being analogous to genotypes whereas most such data are much more similar to phenotypes (eg. False analogies between anthropology and biology; The Music Genome Project is no such thing). This inappropriate analogy can lead to problems, such as incorrect conclusions regarding familial relationships.

In a similar vein, another problem is the appropriation of the word "phylogeny" to refer to non-evolutionary types of tree. A web search for phylogeny will lead you to many sites where the tree structure being referenced is very unlike an evolutionary history.

Systematists have long dealt with this issue as manifest in the confusion between classification and phylogeny. Biological classification is usually treated as most informative (eg. explanatory, predictive) when based on a phylogeny, but a phylogeny is not automatically a classification, and a classification is not automatically a phylogeny.

The best known example is the NCBI Taxonomy, as used by the GenBank database. This is one of the most commonly used classification schemes today, but in bioinformatics it is frequently used as a phylogeny as well as a classification. This is in spite of the fact that NCBI offers the following disclaimer:
The NCBI taxonomy database is not a primary source for taxonomic or phylogenetic information. Furthermore, the database does not follow a single taxonomic treatise but rather attempts to incorporate phylogenetic and taxonomic knowledge from a variety of sources, including the published literature, web-based databases, and the advice of sequence submitters and outside taxonomy experts. Consequently, the NCBI taxonomy database is not a phylogenetic or taxonomic authority and should not be cited as such.The issue here is that the classification is hierarchical and can therefore be expressed as a tree, and the same can said of the nested relationships in a phylogeny. However, not all trees are phylogenies, and the NCBI Taxonomy is a classification that is not necessarily phylogenetic.

More recently, the word phylogeny has been adopted by the computational word to refer to many hierarchical clustering patterns. For example, consider this definition from FreeBase:
The phylogeny pattern is a major pattern within ontology / schema modelling, and is prevalent in many schemas in Freebase. Commonly related are the parent-child pattern and the containment pattern.In other words, parent-child patterns are phylogenetic, which is literally true as far as it goes, but a two-level hierarchy fits this pattern without being anything more than a trivial phylogeny in the biological sense. An example is the Wikipedia music entries (eg. Rock music), which have a genre and several subgenres, along with fusion genres — this produces a shallow but broad "tree". Indeed, FreeBase has this to say about their own attempt to implement this idea:One issue is that the some of the data in the music genre hierarchy in Freebase seems to attempt to show a genealogy of genres, rather than family groupings, which is counter to the way that parent and child Media genres are defined.This seems to be a rather confused set of analogies involving families and genealogies. The false analogy between a tree and a phylogeny seems to have created this confusion. A genealogy expresses family groups (as does a phylogeny), but not all of those potential groups need be expressed in a classification.

It seems to me that it would be simpler for the computational world to refer to a hierarchy rather than a phylogeny.

April 14, 2013

16:30

Every decade or so a record company releases a compilation album from the best-selling musical duo of Paul Frederic Simon and Arthur Ira Garfunkel. There are currently five such albums that have been given a worldwide release:
  • Simon and Garfunkel's Greatest Hits (1972)
  • The Simon and Garfunkel Collection (1981)
  • The Concert in Central Park (1982)
  • The Definitive Simon & Garfunkel (1992)
  • The Essential Simon & Garfunkel (2003)


This is not bad considering that the duo released only 5 original albums in the first place (Wednesday Morning, 3 A.M.; Sounds of Silence; Parsley, Sage, Rosemary and Thyme; Bookends; Bridge Over Troubled Water), plus one album shared with Dave Grusin (The Graduate). It also means that we are overdue for another compilation.

Each of these compilation albums has been released in a number of different countries, where they have had a greater or lesser success in terms of sales. Some of the relevant information about the resulting chart positions for these 5 albums is available from Wikipedia. This means that we could examine how the different countries compare in their enthusiasm for Simon and Garfunkel's songs.

Unfortunately, not all of the albums were released in all of the countries for which information has been compiled. For example, the USA has data for only 3 of the 5 albums, and Finland and Norway each has data for only 4 of them. Nevertheless, there is complete information for 8 countries, for which I can perform an exploratory data analysis using a network.

As usual, I have used the manhattan distance and a NeighborNet network to produce the graph. The proximity of the countries in the network shows how similarly the records sold — countries near each other had similar sales of the records, while distant countries had different sales patterns.


Only 1 of the 5 albums sold well in all 8 countries: The Concert in Central Park (the chart positions were 1, 1, 1, 2, 3, 5, 5, 6).

Sweden holds a unique position in the network because the populace did not buy The Simon and Garfunkel Collection (chart position 49) but did buy all of the other albums (positions 3-5).

The Netherlands and Japan are linked together in the network because they did not support The Essential Simon & Garfunkel (positions 64 & 104, respectively). Indeed, the Dutch did not like The Definitive Simon & Garfunkel (35), either, while the Japanese seemed to like only  The Concert in Central Park (2) and Simon and Garfunkel's Greatest Hits (3).

Australia, Germany and the United Kingdom are linked by sharing their ranking of The Essential Simon & Garfunkel (20-25). France and New Zealand are linked by sharing their ranking of both Simon and Garfunkel's Greatest Hits (16, 22) and The Essential Simon & Garfunkel (33, 38).

Note that it is therefore the sales of The Essential Simon & Garfunkel that has the largest effect on the network pattern:
    4 Sweden
20-25 Australia, Germany, UK
33-38 France, New Zealand
   64 Netherlands
  104 Japan

So, it turns out that the popularity of Simon and Garfunkel does, indeed, have a geographical pattern, although probably not an expected one.

April 9, 2013

22:30

A splits graph is interpreted in terms of splits, or bipartitions, which divide the graph into two non-overlapping parts. If one wishes to refer to particular splits in a graph then one needs a way of highlighting those splits.

This can be done in a number of ways, some of them derived from conventions originating for the presentation of rooted phylogenetic trees. These include highlighting the taxa in one of the partitions, which is analogous to highlighting a clade in a rooted phylogenetic tree. Alternatively, we could colour the edges associated with each of the two partitions, as shown in this previous blog post (How to interpret splits graphs); however, this works only for a single split at a time.

Alternatively, it is also possible to label the edges of the splits themselves, as shown in this previous blog post (Representing evolutionary scenarios using splits graphs). Dabert et al. (Dabert M, Witalinski W, Kazmierski A, Olszanowski Z, Dabert J (2010) Molecular phylogeny of acariform mites (Acari, Arachnida): strong conflict between phylogenetic signal and long-branch attraction artifacts. Molecular Phylogenetics and Evolution 56: 222-241) present another possibility, which is to colour only the edges that separate to two partitions of each split, as shown in the figure.


This works very well visually. However, there is still the matter of actually labelling the coloured edges. Unfortunately, Dabert et al. chose to do this using terminology that is more appropriate for a rooted phylogenetic tree than an unrooted data-display network. That is, they refer to "clades", which can be recognized only in a rooted graph. Their diagram is clearly labelled with a root taxon, even though the graph itself is unrooted. The implication here is that interpreting the unrooted graph as a rooted network is straightforward, but it is not. It would be better to use the standard terminology, which refers to "splits" or "partitions", rather than to "clades".

April 7, 2013

16:30

In 2009 and 2010 a group named the Florida Citizens for Science ran what they called a Stick Science Contest, in which the participants contributed stick cartoons that could "be used to educate the general public and especially decision makers about the truth behind one false science argument." In practice, many (but not all) of the contributions concerned mis-interpretations about evolution.

Only the top ten ranked entries for each year were published online. From these finalists, the top three entries received prizes. Here are links to the finalists for 2009 and for 2010.

While the prize-winning entries are good, some of the others are more interesting from the phylogenetic perspective. My favourite one is no. 6 from 2010, by Matthew Bonnan:


I also like Entry F from 2009, by Jan Stephan Lundquist:


This theme was repeated in entry no. 5 from 2010, by Glen Wolfram:


April 2, 2013

22:30

Splits graphs are basically data-display networks, since their intended purpose is to graphically display the patterns of variation in a dataset. These patterns may relate to evolutionary history, or they may not.

A couple of weeks ago I discussed a paper by Myles et al. concerning the genetics of grape cultivars, and this paper provides an interesting example where the patterns of genetic variation seem to be strongly phylogenetic in nature (Myles S, Boyko AR, Owens CL, Brown PJ, Grassi F, Aradhya MK, Prins B, Reynolds A, Chia JM, Ware D, Bustamante CD, Buckler ES. 2011. Genetic structure and domestication history of the grape. Proceedings of the National Academy of Sciences of the USA 108: 3530-3535).

Myles et al. note that: "Archaeological evidence suggests that grape domestication took place in the South Caucasus between the Caspian and Black Seas and that cultivated vinifera then spread south to the western side of the Fertile Crescent, the Jordan Valley, and Egypt by 5,000 y ago." They provide an explicit historical scenario of the evolutionary history of cultivated grapes (Vitis vinifera):
  1. There are two species involved (V.sylvestris, V.vinifera), both distributed along the eastern and northern part of the Mediterranean basin;
  2. V.vinifera was domesticated from V.sylvestris in the eastern part of the distribution;
  3. V.vinifera then spread geographically from east to west;
  4. This spread was followed by introgression of V.sylvestris into V.vinifera in the western part of their joint distribution.
Myles et al. generated genotype data from a custom microarray, which assayed 5,387 SNPs genotyped in 570 V.vinifera samples and 59 V.sylvestris accessions from the US Department of Agriculture (USDA) germplasm collection. Average population-pairwise Fst estimates were then calculated from all 5,387 SNPs weighted by allele frequency, based on species and geographical region.

I constructed a NeighborNet splits graph from these Fst data, as shown in the graph. According to Myles et al., the geographic regions are defined as follows: "east" includes locations east of Istanbul, Turkey; "west" includes locations west of Slovenia, including Austria; and "central" refers to locations between them.

Each of the splits (bipartitions) in the graph represents one of the four steps in the hypothesized scenario, as labelled in the figure. Thus, there is apparently phylogenetic signal remaining from all of these proposed historical events that can be detected in the genetic distances. As the authors note: "Our analyses of relatedness between vinifera and sylvestris populations are consistent with the archaeological data".

Note, however, that one cannot infer the scenario from the splits graph, because the data analysis is not intended for direct evolutionary inference. The graph is undirected, and there are therefore several possible scenarios that could be derived from the graph. For example, the graph shown is also compatible with the domestication of V.vinifera from V.sylvestris in the western part of the distribution.

Thus, a splits graph can be used to suggest scenarios (ie. hypothesis generation) and it can be used to test scenarios (hypothesis testing), but the latter is a weak test because there will always be several phylogenetic scenarios with which it is compatible.

March 31, 2013

16:30

Empedocles (c. 490–430 BCE) and Lucretius (c. 99-55 BCE) have been credited with first articulating the theory of "survival of the fittest" (Sedgley 2003). However, this is of interest only to Darwinian scholars, who focus solely on trees. What is of more interest to scholars of phylogenetic networks is that these same two philosophers have also been credited with first suggesting the doctrine of horizontal gene transfer (Wilkins 2009). Gene transfer is, of course, an important source of reticulate evolution.

Empedocles was a Greek philosopher, a citizen of what is now Agrigento, in Sicily. He is perhaps most famous for first outlining the elemental theory of the physical world (ie. Air, Earth, Fire, Water). Moreover, he identified two fundamental forces, which he called love and strife. Love is the force that brings objects together, while Strife is the force that drives them apart. Empedocles postulated that the universe was once condensed into a tight sphere by the force of love, and strife later exploded this into an expanding mass. This has been seen as a forerunner of modern ideas about the Big Bang and the subsequent expanding universe.

More importantly for our purposes, Empedocles had a physical theory about the random development of living forms. According to this theory, Life first emerged as a collection of disassociated body parts, which wandered about on their own, without the intervention of divine power. These were not parts severed from previously complex beings, but each functioned in its own right as an independent "single-limbed" being. Complex creatures were then created by the accidental combination of these disparate limbs and organs. If the correct parts combined, then the creature would survive and go on to found a species, but if the wrong combination occurred then the creature would perish — only those with the most suitable combinations survived, by a process that we now call natural selection.

Empedocles' hypothesized hybrid creatures were literally mocked by later Greek philosophers, notably Aristoteles (384-322 BCE) and Epicurus (341-270 BCE), and their followers. They derided these monsters as "roll-walking creatures with hands not properly articulated or distinguishable" and as "ox-headed man-creatures". It was Lucretius who resurrected Empedocles' idea, in his only known work (De Rerum Natura), which was about the beliefs of Epicureanism — Lucretius was the first writer to introduce Roman readers to Epicurean philosophy.

Titus Lucretius Carus was a Roman poet and philosopher, apparently resident in Rome itself. He is perhaps most famous for his atomistic view of the physical world (everything is built up from collections of indivisible particles). More importantly for our purposes, Lucretius expounded a similar theory to that of Empedocles, namely that originally a set of randomly composed monsters sprang up, of which only the fittest survived. However, whereas Empedocles described isolated limbs as the starting point, Lucretius described whole organisms with defective combinations of body parts (what we would now call congenital defects), so that his maladapted creatures were formed at the atomic level rather than at the macroscopic level of whole limbs. Also, in Lucretius' theory there was apparently no inter-species mingling of limbs, as there was in Empedocles' version.

These two related theories of zoogony appear to have lain dormant for a couple of thousand years, crushed under the iron fist of Aristoteleanism. Even into the 1900s, biology could be best described as being essentially an extension of Aristoteles' philosophical ideas (Mayr 1982). Nevertheless, slowly the idea of natural selection was re-introduced to biology, notably with the work of Étienne Geoffroy Saint-Hilaire (1772-1844), and culminating in the work Alfred Russel Wallace (1823-1913) and Charles Robert Darwin (1809-1882).

However, even after the introduction of this evolutionary idea, the focus was on the inheritance of morphological modifications, not on the admixture of parts inherited from different organisms; and so only half of Empedocles' ideas were accepted.

It took until the dawn of the 20th century for the Russian lichenologist Constantin Sergeevich Mereschkowsky (1855-1921) to first outline a cellular version of Empedocles' vision. It had recently been shown that lichens involve a symbiotic relationship between fungi and algae, very much along the lines first envisioned more than 2,200 years before. Mereschkowsky extended this idea to the sub-cellular level, with the explicit goal of explaining the evolutionary development of land plants from algae-like forms of life, postulating that chloroplasts originated as symbiotic blue-green algae. The German histologist Richard Altman (1852-1900) had already hinted that what we now call mitochondria (he called them bioblasts) are bacterial symbionts. It was some time later that the American anatomist Ivan Emanuel Wallin (1883-1969) published Symbionticism and the Origin of Species, in which he explicitly suggested that symbiotic bacteria have played a fundamental role in the evolution of species.

This development culminated in the suggestion that genes themselves can be transferred between distant organisms, thus bringing thought down to the atomistic level envisioned by Lucretius. This revealed the hybrid nature of many genomes, even in situations where phenotypic admixture is not manifest. The first description of horizontal gene transfer is usually credited to Victor J. Freeman (in 1951), who demonstrated that the transfer of a viral gene into a bacterium could create a virulent strain from a non-virulent strain. Since then, lateral gene transfer has been widely reported as an important component of prokaryote evolution; and it has increasingly been reported in eukaryotes as well.

We have thus come full circle. Empedocles first introduced the theory of "survival of the fittest", which took nearly 2,300 years to be re-discovered by science, as well as outlining the basic concept of "horizontal gene transfer", which took an extra century for its renaissance.

All of the information presented here is factually correct. However, only on All Fool's Day can such a history actually be told with a straight face.

References

Mayr E. (1982) The Growth of Biological Thought: Diversity, Evolution and Inheritance. Belknap Press, Cambridge MA.

Sedgley D. (2003) Lucretius and the new Empedocles. Leeds International Classical Studies 2.4.

Wilkins J.S. (2009) New work on lateral transfer shows that Darwin was wrong. ScienceBlogs Evolving Thoughts March 31 2009.

March 26, 2013

23:30

The Music Genome Project is a database in which 1 million pieces of music (currently) have been coded for 450 distinct musical characteristics. The main use of the database at the moment is to provide the data from which predictions can be made about which other pieces of music might appeal to listeners of any nominated musical set; this is implemented in the Pandora Radio product.

The use of the word "genome" is an analogy, in which the set of musical characteristics is seen as creating a sort of genetic fingerprint for a song. According to one of the originators, Nolan Gasser:
The basic idea ... was to see if we could approach music from almost a scientific perspective; that's why it's called the Music Genome Project, named not accidentally after the Human Genome Project.
     I've always taken that metaphor very seriously:  biologists have come to understand the human species by identifying all the individual genes in our genome; it's then how each individual gene is manifest or expressed that makes us who we are as individuals — as well as defines how we're related to others: most closely to those in our family, but also indirectly to people who share our same physical attributes or capabilities in sports, and so forth.
     That orientation was paramount to my thinking in designing the Music Genome Project.There seems to be a major misunderstanding here, since the mere idea of atomizing something does not make the atoms genes. After all, the idea behind the Project is simply one of taking music apart and evaluating it by its acoustic elements.

The first problem is that the study of musical attributes is clearly a study of phenotype not genotype, as Gasser alludes in the quote above — there are no hereditary units in music. Unfortunately, phenotype and genotype, are frequently confused in the social sciences, with serious consequences when the wrong analogy is used (see the blog post False analogies between anthropology and biology). As noted by LessWrong user jmmcd:
I think the Music Genome project is misleadingly-named. A genome is generative: there is a mapping from a genome to an organism. There is no reverse mapping. In the case of music, there is a reverse mapping from a piece of music to these 400 odd features, but there's no forward mapping ... Knowledge of a phenotype is not constructive, because there are many ways of constructing that phenotype; a genotype is unique, and is thus constructive.Equally importantly for the Music Genome Project, the musical attributes themselves cannot easily be related to genes as a metaphor — they are simply observed features of the music. The attributes cover musical ideas such as genre, type of instruments, type of vocals, tempo, etc. Most of these attributes are objective and observable (e.g. vocal duets, acoustic guitar solo, percussion, triple meter style, etc), although there are some that are more nuanced (e.g. driving shuffle feel, wildly complex rhythm, epic buildup / breakdown, etc) and thus involve expert subjective judgment. The attributes are coded on a 10-point scale for the "amount" of each attribute.

Given the quantitative nature of the attributes, the only possible analogy with genetics is that of gene expression, not the genome itself (as Gasser also alludes in the quote above). This is a very different metaphor, at least to a biologist. The power of a metaphor is that if it is a good one then it can give you insights that you might not otherwise have; the danger is that a false metaphor will probably lead you up the garden path. In this case, the genome analogy does seem to lead people astray, because they think that Pandora is picking "related" music in a genealogical sense (a "family resemblance") when it is doing no such thing. After all, trying to construct a phylogeny from gene expression data is not something that biologists have attempted.

Thus, if the Music Genome Project did live up to its name then it would be a very valuable thing for musical anthropologists, because then it would be possible to reconstruct a phylogeny of music. Indeed, such a thing has been proposed for popular music: The Music Phylogeny Project. Furthermore, such phylogenies have already been constructed: A Phylogenetic Tree of Musical Style. In the latter case, the author notes: "Needless to say, the tree is not automatically produced by the raw data itself, but by my own interpretation of the data", which gives you some idea of the technical problems involved.

Finally, I will note that what I have said above applies to the other projects based on a supposed analogy with the Human Genome Project. These include the Book Genome Project and the Game Genome Project. Indeed, the blurb for the Book Genome Project makes it sound even more wildly inappropriate:
The genomic analogy is imperfect but useful nevertheless: we defined the three elements of Language, Story, and Character as the literary equivalent of DNA and RNA classifications. Each gene category contains its own subset of measurements specific to its branch of the book genome structure ... Each individual book produces 32,162 genomic measurements.

March 24, 2013

17:30

Trying to quantify the characteristics of a neighborhood is a tricky business. Part of the problem is trying to define the nebulous idea of "livability" with respect to a geographical area, and part is due to the impracticality of collecting most of the data that might allow us to quantify the various aspects of life in that area.

Nevertheless, in New York magazine Nate Silver had a go at this in 2010, by trying to identify The Most Livable Neighborhoods in New York. He tried this because:
there is a wealth of information to study. The Bloomberg administration gathers reams of data about almost every element of life in the city — from potholes to infant-mortality rates— as do New York University's Furman Center and the U.S. Census Bureau. Sites like Yelp provide a reasonably objective perspective on the popularity of neighborhood bars and restaurants. StreetEasy.com and Zillow.com publish the costs of apartment space per square foot. Ethnic diversity is now broken down in much finer gradients than black and white ... Our goal was to take advantage of this wealth of data and apply a little bit of science to the question. If there was anything that could plausibly affect one's quality of life in a particular neighborhood, we tried to incorporate it.New York thus provides a unique opportunity to try quantifying the nedbulous, and I think that it is worth looking at these data in more detail.

The data

The data were compiled into twelve broad categories, representing different characteristics about the various New York neighborhoods:
 Affordability / Housing Cost (as measured on a price-per-square-foot basis, for both renters and buyers), Housing Quality (historic districts, code violations, cockroaches), Transit and Proximity (commute times to lower Manhattan and midtown, the density of subway coverage), Safety (as measured by violent- and nonviolent-crime rates), Public Schools (test scores and parent satisfaction), Shopping & Services (the number of neighborhood amenities, especially supermarkets), Food & Restaurants (judged by density and quality of options), Nightlife (ditto), Creative Capital (arts venues as well as the number of residents engaged in the arts), Diversity (in terms of both race and income), Green Space (park and waterfront access, street trees), and Health & Environment (noise, air quality, overall cleanliness).The data were gathered from the stated sources, and are presented in the original magazine article for 50 of the 60 neighborhoods that were assessed. The data for all of the characteristics were then summed for each neighborhood, based on a particular weighting scheme for the 12 categories. This provided "a quantitative index of the 50 most satisfying places to live."

The sum total of the scores is not actually very different among the neighborhoods (score 73–78 / 100), and therefore the choice between them on that basis is (as the author admits) "splitting hairs". More particularly, neighborhoods with very different characteristics can end up with the same total score — they simply get that total by combining the category scores in very different ways (ie. the neighborhoods have different strengths and weaknesses).

So, this is a rather limited approach to assessing the data. Surely we can get more out of the data than this? What would be more useful is a picture showing which neighborhoods are similar to each other based on the way the scores are distributed across the different categories. This will tell us which neighborhoods have the same characteristics, and which are different from each other. This avoids splitting hairs, because it uses all of the data simultaneously, rather than summarizing the data down to a single number for each neighborhood.

The analysis

A phylogenetic network is ideal for doing this sort of thing, as I have emphasized many times in this blog, and so I have constructed one. As my analysis of choice, I have used the manhattan distance (appropriately enough!) combined with a NeighborNet network. Neighborhoods that are closely connected in the network are similar to each other based on the various characteristics, and those that are further apart are progressively more different from each other.

Click to enlarge.
I have color-coded the neighborhoods based on their borough, using roughly the same colors as in the map shown above.

I have also placed an asterisk next to the top five neighborhoods based on their total scores. Two of these neighborhoods are near each other in the graph, with two a bit further away, and one is quite distant from the others. This indicates that, even though they have very similar total scores, these neighborhoods are actually quite different.
In general, the network shows a trend from Manhattan (at the right-hand end of the graph) to Queens and the Bronx (at the left-hand end), via Brooklyn (stretching through the middle). This seems to neatly summarizes the overall impression of the relationships among the areas of New York, at least as it is usually presented to outsiders. So, I think that the network analysis has been a successful one, in the sense that it provides a useful picture of the relationships between the neighborhoods.

Going deeper, many of the detailed patterns in the network graph are fairly obvious. For example, (at the right-hand end of the graph) the association of the southern Manhattan neighborhoods of Soho, Central Greenwich Village, Tribeca, Battery Park City, and the Financial District should surprise no-one. Similarly, (at the top of the graph) the linking of Manhattan's Inwood with the nearby Bronx neighborhoods of Belmont, Bedford Park and Riverdale is not unexpected. Furthermore, (at the left-hand end of the graph) the connection of Astoria, Woodside, Jackson Heights, and Flushing (in north-western Queens) with Cobble Hill, Boerum Hill, and Bay Ridge (in western Brooklyn) is hardly surprising, even though the two borough areas are geographically separated.

Other patterns are less obvious, and thus more intriguing, such as the apparent similarity of Chinatown (southern Manhattan), Central Harlem (northern Manhattan), Co-op City (the Bronx), and West Brighton and New Dorp (both Staten Island) (at the bottom-left of the graph). This bears looking into, should you be looking for somewhere to live in New York. Perhaps the oddest juxtaposition is that of Chelsea (midtown Manhattan) with Corona Park (Queens) and Washington Heights (northern Manhattan).

Another possible use of the graph is that it makes suggestions for areas that might be suitable as alternatives to any neighborhood that is out of reach on the Affordability / Housing Cost criterion. That is, we might consider areas that are similar based on the other criteria and yet differ in Affordability. For example, Park Slope (northern Brooklyn) differs dramatically in Affordability from the Nolita & Little Italy neighborhood (lower Manhattan), and yet the only other characteristic they differ greatly on is Shopping & Services. Williamsburg, Greenpoint, and Carroll Gardens & Gowanus are indicated in the network as other neighborhoods worth considering.

It seems unlikely, however, that anyone looking for a substitute for the Upper East Side of Manhattan (one of the most expensive neighborhoods in the USA) is going to look at Sheepshead Bay, as suggested by the network — the two neighborhoods differ dramatically in Transit Proximity, since Sheepshead Bay is way down on the Atlantic coastline. Nor are those looking for a replacement for the Upper West Side going to consider Brooklyn's Prospect Heights — these two differ more than somewhat in Housing Quality, for example. So, good though it is, the suggestions made by the network graph are not perfect!

Postscript

There is one other ranking scheme that I know of, at the StreetAdvisor Best Neighborhoods web page [on that page, click on Neighborhoods]. It is described as follows:
Our rankings begin with reviews written by locals. Each review contains certain scoring elements that tell us how good, or how bad a place is. We then combine all the scores and apply a 'fairness' factor that takes into account things such as volume of reviews, age of reviews and the type of person writing a review. We then apply a rank so we can compare and sort locations.You will find many of the rankings odd, to say the least. For example, it seems doubtful that Country Club (the Bronx) is the "3rd best neighborhood in New York City" (after Carnegie Hill and Gramercy Park).

Not the least of the oddities is that the Upper East Side (7.6) scores much less than neighboring Carnegie Hill (9.4) and Lenox Hill (8.1). Indeed, it scores worse than parts of Brooklyn (Carroll Gardens, Clinton Hill, Brooklyn Heights, Park Slope, Bay Ridge), Queens (Glendale, Richmond Hill, Forest Park), the Bronx (Country Club, Schuylerville) and Staten Island (Huguenot).

You can access the individual neighborhoods within the boroughs at these web pages:
Manhattan
Brooklyn
Queens
Bronx

March 19, 2013

23:30

I have noted before that a pedigree is a network not a tree, and specifically it is a hybridization network (Family trees, pedigrees and hybridization networks). That is, in sexually reproducing species, every offspring is the hybrid of two parents. If we include both parents in the pedigree, plus all of their relatives, then this will form a complex network every time inbreeding occurs.

This situation can be generalized to groups of closely related individuals, such as cultivated plants and domesticated animals, where human-mediated inbreeding has resulted in the formation of new breeds and cultivars with limited genetic diversity. In the extreme case, the network will consist of first-degree relationships, where the branches connect parent-offspring relationships or sibling relationships.

An example of this is provided by the work on the genetics of grape cultivars by Myles et al. (Myles S, Boyko AR, Owens CL, Brown PJ, Grassi F, Aradhya MK, Prins B, Reynolds A, Chia JM, Ware D, Bustamante CD, Buckler ES. 2011. Genetic structure and domestication history of the grape. Proceedings of the National Academy of Sciences of the USA 108: 3530-3535).


The genotype data were generated from a custom microarray, which assayed 5,387 SNPs genotyped in 583 unique Vitis vinifera samples from the US Department of Agriculture (USDA) germplasm collection. Estimates of identity-by-descent (IBD) were calculated based on linkage analysis for all pairwise comparisons of samples. These IBD values were calibrated based on known pedigree relationships (ie. confirmed parent-offspring relationships), and this was used to differentiate between parent-offspring and other pedigree relationships. For each cultivar that was related to at least two other cultivars by an estimated parent-offspring relationship, the proportion of SNPs consistent with Mendelian inheritance was used to determine the two parents.

The authors found that 75% of the grape cultivars were related to at least one other cultivar by a first-degree relationship. The first figure (above) shows the frequency histogram of these first-degree relationships, along with the resulting complex pedigree structure, which can be visualized as a set of undirected networks. This set is dominated by a single network with 58% of the cultivars, each related to at least one other cultivar by a first-degree realtionship.

Fig. 3. Network of first-degree relationships among common grape cultivars.
Solid edges represent likely parent-offspring relationships. Dotted edges represent sibling
relationships or equivalent. Arrows point from parents to offspring for the inferred triplets.
The authors inferred that about half of the first-degree relationships were likely to be parent-offspring, with the other half being labeled "sibling or equivalent" (because complex crossing schemes can generate IBD values that are indistinguishable from sibling relationships). By evaluating Mendelian inconsistencies, they assigned parentage for 83 triplets of cultivars. The second figure shows a directed hybridization network of some well-known grape cultivars that includes several resolved triplets.

Note that the hybridization network is only partly directed — quite a few of the edges do not have a uniquely identified direction, based on the SNP data. This is an issue that I have not seen directly addressed in the literature. Practitioners tend to treat phylogenetic networks (and trees) as either directed or undirected, rather than a mixture of both, as this characteristic is determined by the presence or absence of a root node. However, in the grape case there is no root identifiable based on the cultivar SNP data. (There is a scenario for the origin of modern grape cultivars from Vitis sylvestris around the eastern Mediterranean, but even this is complicated by hypothesized later gene flow between V. sylvestris and V. vinifera.)

Perhaps the possibility of partly directed phylogenetic networks needs more consideration.

March 17, 2013

17:30

The "Tree of LIfe" is an expression that you will find all over the web, usually referring to little more than a phylogenetic tree with only a few species in it, and certainly not all of Life, nor even the major groups of LIfe. More specifically, however, it seems commonly to refer to any tree that has Homo sapiens in it.


One tree that has intrigued me is found on Wikipedia's page called Tree of life (biology). It is labelled as "Haeckel's Stambaum der Primaten (1860s)", but in the text it is referred to as "the first sketch of the famous Haeckel's Tree of Life in the 1870s which shows 'Pithecanthropus alalus' as the ancestor of Homo sapiens."

The original JPEG file of the tree, dated 18 February 2009, has a compromise between these two contradictory statements: "The first sketch of the famous Heackel's Tree of Life which shows 'Pithecanthropus alalus' as the ancestor of Homo sapiens. Date: 1860s." No source is given for the picture.

Ernst Haeckel was the most famous popularizer of phylogenetic trees in the 19th century (he called them Stammbaum, literally "stem tree"). However, the illustration itself is not in the style of Haeckel's trees from the 1860s, which were drawn as realistic trees (see Who published the first phylogenetic tree?), nor is it in the style of his most famous tree from the 1870s, which is drawn as an oak tree (see Evolutionary trees: old wine in new bottles?). So, I decided to trace the history of this tree.

Haeckel published a slightly modified version of the sketch, with a different title, in:
Ernst Haeckel (1899)
Ueber Unsere Gegenwärtige Kenntniss vom Ursprung des Menschen.
[About Our Current Knowledge of Human Origins]
Emil Straws, Bonn.
It is worth noting that all of the names along the central axis are hypothetical, except for Homo sapiens. Pithecanthropus alalus, however, came to be associated with what is colloquially called Java Man, now named Homo erectus.

As far as I can determine, the hand-drawn version of the illustration (ie. the one in Wkipedia) first appeared in:
Herbert Wendt (1954)
Ich Suchte Adam: Roman einer Wissenschaft.
[In Search of Adam: a Science Novel]
Zweite, erweiterte Auflage. [Second, enlarged edition]
Grote Verlag, Hamm.It did not appear in the first edition of the book, published the year before. It appears as Figure 22 (page 310):
Abb. 22:
Haeckels klassisch gewordener Menschenstammbaum, eigenhändig als erste Skizze entworfen – ein historisches Dokument. [Haeckel's classic people pedigree, designed by hand as the first sketch – a historical document.]Note that, contrary to the claim, this is not the first pedigree from Haeckel, nor is it even the first primate pedigree from him. The source of the document is noted on page 581 as:
Verzeichnis der Textabbildungen
Abb. 22: Haeckels Stammbaum des Menschen. (Prof. Dr. Heberer, Göttingen)Gerhard Heberer was a German anthropologist and phylogeneticist, who studied Haeckel's work closely. He apparently passed a copy of the illustration to Herbert Wendt when the latter was expanding his book with many more illustrations. This book was a best-seller from the start, going through five German editions, before re-appearing in 1961 as Ich Suchte Adam: Die Entdeckung des Menschen [The Discovery of Humans], whence it went through another seven editions. It was translated into English, appearing in 1955 as I Looked for Adam, in 1956 as In Search of Adam: The Story of Man's Quest for the Truth About His Earliest Ancestors, and in 1972 as From Ape to Adam: Search for the Evolution of Man. It was also translated into several other languages (including Swedish in 1955 as I Urmänniskornas Spår: Förhistoriens Forskaräventyr).

The hand-drawn tree has appeared in print at least twice since its first appearance in Wendt's book. The most important of these is in:
Thomas Junker and Uwe Hoßfeld (2001)
Die Entdeckung der Evolution: Eine revolutionäre Theorie und ihre Geschichte.
[The Discovery of Evolution: a Revolutionary Theory and its History]
Wissenschaftliche Buchgesellschaft, Darmstadt.The picture appears as Figure 18 on page 125:
Abb. 18: Handschriftlicher Entwurf des Stammbaums der Primaten von Ernst Haeckel (1895). (Bildmaterial im Nachlass Heberer; im Besitz von Uwe Hoßfeld).
[Hand-drawn sketch of the family tree of primates by Ernst Haeckel (1895). (Artwork in the estate of Heberer; in the possession of Uwe Hoßfeld.]Uwe Hoßfeld has told me: "I got the whole archive material from the Heberer family in 1990 and found the photo in his diaries." This explains the later history of the sketch, although not how Gerhard Heberer acquired the photo in the first place.

The date 1895 makes much more sense than do the 1860s and 1870s dates (as given in Wikipedia), especially given the publication date of the printed version.

The other appearance of the hand-drawn tree is in this book:
Winfried Henke and Ian Tattersall (editors) (2007)
Handbook of Paleoanthropology, Volume 1.
Springer‐Verlag, Berlin.The tree appears as Figure 1.4 on page 16 (in the chapter by Winfried Henke 'Historical overview of paleoanthropological research'), and is labelled: "First pedigree designed by Ernst Haeckel". As noted above, it is not the first pedigree from Haeckel, nor the first primate pedigree from him. Winfried Henke has told me that he got the sketch from Wendt's book.

As a final point, Haeckel's hand-drawn trees usually seem to match the published versions rather more closely than the one above does. For example, here is the hand-drawn original of his famous oak tree. Perhaps, the more stick-like tree was not treated as being a "real" picture.


Thanks to Winfried Henke and Uwe Hoßfeld for their email correspondence regarding my quest.