There are currently 0 users and 36 guests online.
April 14, 2015
This is a guest blog post by:
Johann-Mattis ListCentre des Recherches Linguistiques sur l'Asie Orientale, Paris, France
What we know, what we know we can now, and what we know we cannot know:Ontological facts and epistemological reality in historical linguistics and evolutionary biology
In a recent blog post (Multiple sequence alignment), David wrote about some theoretical issues regarding the concept of homology in evolutionary biology, and specifically its impact on the design of sequence alignment programs. In that post, he mentioned a recently published paper, where he discusses algorithms for sequence alignment and notes that "there is no known objective function for identifying homology" (Morrison 2015: 14).
This statement triggered my interest, since I was immediately reminded of problems that have been occupying historical linguists for a long time now. These problems arise from the fact that in historical disciplines, such as evolutionary biology or historical linguistics (but also in general history or some parts of geology), scholars are not trying to infer general laws of nature, but rather use knowledge of general laws to infer unique events.
The tasks of scholars working in these disciplines is similar to the task of a crime investigator or a doctor: Detectives use the evidence from a crime scene to infer the individual events that led to the crime (and arrest the culprit), and doctors use the symptoms of patients to identify their individual diseases (and then look for a way to cure them). Similarly, evolutionary biologists and historical linguists try to identify the evolutionary events that lead to the observed diversity of life and languages, respectively.
What unites all these disciplines is the specific mode of reasoning that they employ. Charles Sanders Peirce (1839-1914) was among the first to investigate this reasoning mode in detail (Peirce 1931/1958: 7.202). He called it abduction, and contrasted it with induction and deduction, the traditional modes of logical reasoning. Induction is used to infer a currently unknown general rule from an initial state and its result state, while deduction infers the result state of an initial state and a general rule. On the other hand, abduction seeks to infer initial states from result states by employing a general rule.
What further complicates the task of evolutionary biologists and historical linguists is that we have only limited means to verify or falsify a given hypothesis, since, in contrast to detectives and doctors, our research objects usually do not confess, nor do they give positive feedback when we propose the right hypothesis. We never know whether we found the true murderer or whether we proposed the right cure.
Historical linguistics and the limits of knowledge
In historical linguistics, discussions regarding the limits of our knowledge have been centered around the question of the "nature of the proto-language". Using comparative techniques, in the second half of the 19th century linguists started to reconstruct ancestral words of languages that are not attested in any written source. Thus, linguists would first try to identify cognate (homologous) words in Indo-European languages, and then infer how these words were pronounced in the Indo-European language which was spoken some 8,000 years ago. This technique, which was originally introduced by August Schleicher (1821-1868) in 1861, became very popular, and has remained the standard way of knowledge representation in historical linguistics. Whenever linguists propose such a reconstructed form, based on various pieces of evidence, they use an asterisk symbol * to indicate that the word has been inferred, and that there is no written source that would confirm its existence.
As an example, consider some of the words for "sun" in Indo-European languages (discussed in detail in List 2014: 136):
These techniques are generally thought to be quite reliable, and they provided concrete help in the decipherment of many ancient languages (including the Egyptian hieroglyphes, Linear B, and Hittite). The status of the reconstructions that scholars produced was, however, controversially debated. While some scholars claimed that there was a high probability that the proposed reconstructions would come close to the original pronunciation, others would classify them as a pure fiction (Schmidt 1872).
While it is obvious that reconstructions represent hypotheses and not indisputable truths, it is less clear how they relate to the actual historical facts. First of all, we know for sure that our hypotheses are not stable over time. As our knowledge of the evidence increases, as we include more languages in our comparison, or get deeper insights into the major processes underlying language history, our hypotheses will also constantly be changed and refined. This is nicely reflected in August Schleicher's Fable (a short parable called "The Sheep and the Horses"), a text that he wrote in his reconstructed version of Proto-Indo-European, in order to illustrate what was by then known about the origin of the Indo-European language. When looking at the many later versions, written by scholars in order to illustrate how our knowledge of Indo-European had changed since then, the differences in the pronunciations are really striking (see this summary in Wikipedia), but so are the similarities.
Judging from the degree to which these reconstruction hypotheses evolved over about 150 years, we can reach an important, apparently paradoxical, conclusion: While our reconstructions in historical linguistics are far from being realistic (in the sense of representing actual pronunciations of an Indo-European people), they are by no means fictions, as Johannes Schmidt claimed long ago. The reconstructions are not (and never will be) realistic, since they will always be preliminary, depending on our currently available data and the theoretical development in our field. On the other hand, the reconstructions are also not necessarily unrealistic, since they reflect scientific hypotheses that have been constantly refined and independently developed using the best knowledge we have at that moment. So, although we know that our hypotheses do not truly reflect what really happened, we have good reasons to assume that they come much closer to the real story than any random hypothesis.
As reflected in David's aforementioned statement regarding the lack of an objective function for homology identification in evolutionary biology, the problem of assessing the realism of our hypotheses is not unique to historical linguistics. In a similar way to that with which we discuss the realism of our reconstructed forms in historical linguistics, one may discuss the realism behind any multiple sequence alignment in evolutionary biology. The objects of investigation in historical linguistics and evolutionary biology are not directly accessible to the researchers, but can only be inferred by tests and theories.
Interestingly, this problem also occurs in the social sciences. In psychology, for example, such attributes of people as "intelligence" cannot be directly observed, but have to be inferred by measuring what they provoke or how they are "reflected in test performance" (Cronbach and Meehl 1955: 178). What is inferred by psychological tests is usually called a construct, and is strictly separated from the underlying quality that scholars originally wanted to measure. The construct is thereby understood as the "fiction or story put forward by a theorist to make sense of a phenomenon" (Statt 1981 : 67). As in the case of reconstruction in linguistics or homology assessment in biology, it is not the "real" object or process.
What can we conclude from this? Or, to put it differently, why should we care about constructs or the degree of fiction behind our claims in historical linguistics and evolutionary biology? I see two important reasons to do so.
First, we can avoid confusion in our fields by strictly separating ontological facts and epistemological reality. In evolutionary biology, this would help to avoid the confusion that often arises when scholars talk about homologous genes, when in practice what they mean is that they applied some similarity threshold and some cluster procedure to cluster genes in sets of presumed homologs. In historical linguistics, on the other hand, it would help us to get rid of the tiresome debate between formalists (who emphasize that reconstructed forms are simple formulas) and realists (who take reconstructed forms as realistic representations) in reconstruction.
Second, from a broader viewpoint, as scientists, we should always try to be explicit in our claims, and we should also always try to be honest about what we know, what we know we can know, and what we know we cannot know.
Cronbach LJ, Meehl PE (1955) Construct validity in psychological tests. Psychological Bulletin 52: 281-302.
List J-M (2014) Sequence comparison in historical linguistics. Düsseldorf: Düsseldorf University Press.
Morrison DA (2015) Is multiple sequence alignment an art or a science? Systematic Botany 40: 14-26.
Peirce CS (1931/1958) Collected papers of Charles Sanders Peirce. Ed. by C Hartshorne and P Weiss. Cont. by AW Burke. 8 vols. Cambridge MA: Harvard University Press.
Schleicher A (1861) Compendium der vergleichenden Grammatik der indogermanischen Sprache. Vol. 1: Kurzer Abriss einer Lautlehre der indogermanischen Ursprache. Weimar: Böhlau.
Schmidt J (1872) Die Verwantschaftsverhältnisse der indogermanischen Sprachen. Weimar: Hermann Böhlau.
Statt DA, comp. (1981 ) Concise Dictionary of Psychology, 3rd ed. London and New York: Routledge.
Phylogeny and evolution of plant macrophage migration inhibitory factor/D-dopachrome tautomerase-like proteins
Background: The human (Homo sapiens) chemokine-like protein macrophage migration inhibitory factor (HsMIF) is a pivotal mediator of inflammatory, infectious and immune diseases including septic shock, colitis, malaria, rheumatoid arthritis, and atherosclerosis, as well as tumorigenesis. HsMIF has been found to exhibit several sequential and three-dimensional sequence motifs that in addition to its receptor binding sites include catalytic sites for oxidoreductase and tautomerase activity, which provide this 12.5 kDa protein with a remarkable functional complexity. A human MIF paralog, D-dopachrome tautomerase (HsDDT), has been identified, but its physiological relevance is incompletely understood. MIF/DDT-like proteins have been described in animals, protists and bacteria. Although based on sequence data banks the presence of MIF/DDT-like proteins has also been recognized in the model plant species Arabidopsis thaliana, details on these plant proteins have not been reported. Results: To broaden the understanding of the biological role of these proteins across kingdoms we performed a comprehensive in silico analysis of plant MIF/DDT-like (MDL) genes/proteins. We found that the A. thaliana genome harbors three MDL genes, of which two are chiefly constitutively expressed in aerial plant organs, while the third gene shows stress-inducible transcript accumulation. The product of the latter gene likely localizes to peroxisomes. Structure prediction suggests that all three Arabidopsis proteins resemble the secondary and tertiary structure of human MIF. MIF-like proteins are found in all species across the plant kingdom, with an increasing family complexity towards evolutionarily advanced plant taxa. Plant MDL proteins are predicted to lack oxidoreductase activity, but possibly share tautomerase activity with human MIF/DDT. Conclusions: Peroxisome localization seems to be a specific feature of a subset of MIF/DDT orthologs found in dicotyledonous plant species, which together with its stress-inducible gene expression might point to convergent evolution in higher plants and vertebrates towards neofunctionalization of MIF/MDL proteins in stress response pathways including innate immunity.
Source: BMC Evolutionary Biology
Here we introduce a general class of multiple calibration birth–death tree priors for use in Bayesian phylogenetic inference. All tree priors in this class separate ancestral node heights into a set of "calibrated nodes" and "uncalibrated nodes" such that the marginal distribution of the calibrated nodes is user-specified whereas the density ratio of the birth–death prior is retained for trees with equal values for the calibrated nodes. We describe two formulations, one in which the calibration information informs the prior on ranked tree topologies, through the (conditional) prior, and the other which factorizes the prior on divergence times and ranked topologies, thus allowing uniform, or any arbitrary prior distribution on ranked topologies. Although the first of these formulations has some attractive properties, the algorithm we present for computing its prior density is computationally intensive. However, the second formulation is always faster and computationally efficient for up to six calibrations. We demonstrate the utility of the new class of multiple-calibration tree priors using both small simulations and a real-world analysis and compare the results to existing schemes. The two new calibrated tree priors described in this article offer greater flexibility and control of prior specification in calibrated time-tree inference and divergence time dating, and will remove the need for indirect approaches to the assessment of the combined effect of calibration densities and tree priors in Bayesian phylogenetic inference.
Mollusks are the most morphologically disparate living animal phylum, they have diversified into all habitats, and have a deep fossil record. Monophyly and identity of their eight living classes is undisputed, but relationships between these groups and patterns of their early radiation have remained elusive. Arguments about traditional morphological phylogeny focus on a small number of topological concepts but often without regard to proximity of the individual classes. In contrast, molecular studies have proposed a number of radically different, inherently contradictory, and controversial sister relationships. Here, we assembled a data set of 42 unique published trees describing molluscan interrelationships. We used these data to ask several questions about the state of resolution of molluscan phylogeny compared with a null model of the variation possible in random trees constructed from a monophyletic assemblage of eight terminals. Although 27 different unique trees have been proposed from morphological inference, the majority of these are not statistically different from each other. Within the available molecular topologies, only four studies to date have included the deep sea class Monoplacophora; but 36.4% of all trees are not significantly different. We also present supertrees derived from two data partitions and three methods, including all available molecular molluscan phylogenies, which will form the basis for future hypothesis testing. The supertrees presented here were not constructed to provide yet another hypothesis of molluscan relationships, but rather to algorithmically evaluate the relationships present in the disparate published topologies. Based on the totality of available evidence, certain patterns of relatedness among constituent taxa become clear. The internodal distance is consistently short between a few taxon pairs, particularly supporting the relatedness of Monoplacophora and the chitons, Polyplacophora. Other taxon pairs are rarely or never found in close proximity, such as the vermiform Caudofoveata and Bivalvia. Our results have specific utility for guiding constructive research planning to better test relationships in Mollusca as well as other problematic groups. Taxa with consistently proximate relationships should be the focus of a combined approach in a concerted assessment of potential genetic and anatomical homology, whereas unequivocally distant taxa will make the most constructive choices for exemplar selection in higher level phylogenomic analyses.
A major concern in molecular clock dating is how to use information from the fossil record to calibrate genetic distances from DNA sequences. Here we apply three Bayesian dating methods that differ in how calibration is achieved—"node dating" (ND) in BEAST, "total evidence" (TE) dating in MrBayes, and the "fossilized birth–death" (FBD) in FDPPDiv—to infer divergence times in the royal ferns. Osmundaceae have 16–17 species in four genera, two mainly in the Northern Hemisphere and two in South Africa and Australasia; they are the sister clade to the remaining leptosporangiate ferns. Their fossil record consists of at least 150 species in ~17 genera. For ND, we used the five oldest fossils, whereas for TE and FBD dating, which do not require forcing fossils to nodes and thus can use more fossils, we included up to 36 rhizomes and frond compression/impression fossils, which for TE dating were scored for 33 morphological characters. We also subsampled 10%, 25%, and 50% of the 36 fossils to assess model sensitivity. FBD-derived divergence ages were generally greater than those inferred from ND; two of seven TE-derived ages agreed with FBD-obtained ages, the others were much younger or much older than ND or FBD ages. We prefer the FBD-derived ages because they best fit the Osmundales fossil record (including Triassic fossils not used in our study). Under the preferred model, the clade encompassing extant Osmundaceae (and many fossils) dates to the latest Paleozoic to Early Triassic; divergences of the extant species occurred during the Neogene. Under the assumption of constant speciation and extinction rates, the FBD approach yielded speciation and extinction rates that overlapped those obtained from just neontological data. However, FBD estimates of speciation and extinction are sensitive to violations in the assumption of continuous fossil sampling; therefore, these estimates should be treated with caution.
Taxon-Rich Phylogenomic Analyses Resolve the Eukaryotic Tree of Life and Reveal the Power of Subsampling by Sites
Most eukaryotic lineages are microbial, and many have only recently been sampled for phylogenetic studies or remain in the "dark area" of the tree of life where there are no molecular data. To assess relationships among eukaryotic lineages, we perform a taxon-rich phylogenomic analysis including 232 eukaryotes selected to maximize taxonomic diversity and up to 1554 genes chosen as vertically inherited based on their broad distribution among eukaryotes. We also include sequences from 486 bacteria and 84 archaea to assess the impact of endosymbiotic gene transfer (EGT) from plastids and to detect contamination. Overall, our analyses are consistent with other less taxon-rich estimates of the eukaryotic tree of life, and we recover strong support for five major clades: Amoebozoa, Excavata (without the genus Malawimonas), Opisthokonta, Archaeplastida, and SAR (Stramenopila, Alveolata, and Rhizaria). Our analyses also highlight the existence of "orphan" lineages, lineages that lack robust placement in the eukaryotic tree of life, and indicate the possibility of as yet undiscovered diversity. In analyses including bacteria and archaea, we find that approximately 10% of the 1554 genes, which we choose because they are found in four or five of the five major eukaryotic clades and hence may be more likely to be inherited vertically, appear to have been acquired from cyanobacteria through EGT in photosynthetic lineages. Removing these EGT genes places the green algae as sister to the glaucophytes instead of the red algae, suggesting that unknowingly including genes of plastid origin, and combining them with genes of nuclear origin, may mislead phylogenetic estimates. Finally, the large size of our data set allows comparative analyses of subsets of data; alignments built from randomly sampled sites provide greater support, particularly for deep relationships, than do equivalent-sized data sets built from randomly sampled genes.
Despite an increasingly vast literature on cophylogenetic reconstructions for studying host–parasite associations, understanding the common evolutionary history of such systems remains a problem that is far from being solved. Most algorithms for host–parasite reconciliation use an event-based model, where the events include in general (a subset of) cospeciation, duplication, loss, and host switch. All known parsimonious event-based methods then assign a cost to each type of event in order to find a reconstruction of minimum cost. The main problem with this approach is that the cost of the events strongly influences the reconciliation obtained. Some earlier approaches attempt to avoid this problem by finding a Pareto set of solutions and hence by considering event costs under some minimization constraints. To deal with this problem, we developed an algorithm, called Coala, for estimating the frequency of the events based on an approximate Bayesian computation approach. The benefits of this method are 2-fold: (i) it provides more confidence in the set of costs to be used in a reconciliation, and (ii) it allows estimation of the frequency of the events in cases where the data set consists of trees with a large number of taxa. We evaluate our method on simulated and on biological data sets. We show that in both cases, for the same pair of host and parasite trees, different sets of frequencies for the events lead to equally probable solutions. Moreover, often these solutions differ greatly in terms of the number of inferred events. It appears crucial to take this into account before attempting any further biological interpretation of such reconciliations. More generally, we also show that the set of frequencies can vary widely depending on the input host and parasite trees. Indiscriminately applying a standard vector of costs may thus not be a good strategy.
Tens of thousands of phylogenetic trees, describing the evolutionary relationships between hundreds of thousands of taxa, are readily obtainable from various databases. From such trees, inferences can be made about the underlying macroevolutionary processes, yet remarkably these processes are still poorly understood. Simple and widely used evolutionary null models are problematic: Empirical trees show very different imbalance between the sizes of the daughter clades of ancestral taxa compared to what models predict. Obtaining a simple evolutionary model that is both biologically plausible and produces the imbalance seen in empirical trees is a challenging problem, to which none of the existing models provide a satisfying answer. Here we propose a simple, biologically plausible macroevolutionary model in which the rate of speciation decreases with species age, whereas extinction rates can vary quite generally. We show that this model provides a remarkable fit to the thousands of trees stored in the online database TreeBase. The biological motivation for the identified age-dependent speciation process may be that recently evolved taxa often colonize new regions or niches and may initially experience little competition. These new taxa are thus more likely to give rise to further new taxa than a taxon that has remained largely unchanged and is, therefore, well adapted to its niche. We show that age-dependent speciation may also be the result of different within-species populations following the same laws of lineage splitting to produce new species. As the fit of our model to the tree database shows, this simple biological motivation provides an explanation for a long standing problem in macroevolution.
Prior distributions can have a strong effect on the results of Bayesian analyses. However, no general consensus exists for how priors should be set in all circumstances. Branch-length priors are of particular interest for phylogenetics, because they affect many parameters and biologically relevant inferences have been shown to be sensitive to the chosen prior distribution. Here, we explore the use of outside information to set informed branch-length priors and compare inferences from these informed analyses to those using default settings. For both the commonly used exponential and the newly proposed compound Dirichlet prior distributions, the incorporation of relevant outside information improves inferences for data sets that have produced problematic branch- and tree-length estimates under default settings. We suggest that informed priors are worthy of further exploration for phylogenetics.
Assignment of Homoeologs to Parental Genomes in Allopolyploids for Species Tree Inference, with an Example from Fumaria (Papaveraceae)
There is a rising awareness that species trees are best inferred from multiple loci while taking into account processes affecting individual gene trees, such as substitution model error (failure of the model to account for the complexity of the data) and coalescent stochasticity (presence of incomplete lineage sorting [ILS]). Although most studies have been carried out in the context of dichotomous species trees, these processes operate also in more complex evolutionary histories involving multiple hybridizations and polyploidy. Recently, methods have been developed that accurately handle ILS in allopolyploids, but they are thus far restricted to networks of diploids and tetraploids. We propose a procedure that improves on this limitation by designing a workflow that assigns homoeologs to hypothetical diploid ancestral genomes prior to genome tree construction. Conflicting assignment hypotheses are evaluated against substitution model error and coalescent stochasticity. Incongruence that cannot be explained by stochastic mechanisms needs to be explained by other processes (e.g., homoploid hybridization or paralogy). The data can then be filtered to build multilabeled genome phylogenies using inference methods that can recover species trees, either in the face of substitution model error and coalescent stochasticity alone, or while simultaneously accounting for hybridization. Methods are already available for folding the resulting multilabeled genome phylogeny into a network. We apply the workflow to the reconstruction of the reticulate phylogeny of the plant genus Fumaria (Papaveraceae) with ploidal levels ranging from 2$$x$$ to 14$$x$$. We describe the challenges in recovering nuclear NRPB2 homoeologs in high ploidy species while combining in vivo cloning and direct sequencing techniques. Using parametric bootstrapping simulations we assign nuclear homoeologs and chloroplast sequences (four concatenated loci) to their common hypothetical diploid ancestral genomes. As these assignments hinge on effective population size assumptions, we investigate how varying these assumptions impacts the recovered multilabeled genome phylogeny.
In order to gain an understanding of the effectiveness of phylogenetic Markov chain Monte Carlo (MCMC), it is important to understand how quickly the empirical distribution of the MCMC converges to the posterior distribution. In this article, we investigate this problem on phylogenetic tree topologies with a metric that is especially well suited to the task: the subtree prune-and-regraft (SPR) metric. This metric directly corresponds to the minimum number of MCMC rearrangements required to move between trees in common phylogenetic MCMC implementations. We develop a novel graph-based approach to analyze tree posteriors and find that the SPR metric is much more informative than simpler metrics that are unrelated to MCMC moves. In doing so, we show conclusively that topological peaks do occur in Bayesian phylogenetic posteriors from real data sets as sampled with standard MCMC approaches, investigate the efficiency of Metropolis-coupled MCMC (MCMCMC) in traversing the valleys between peaks, and show that conditional clade distribution (CCD) can have systematic problems when there are multiple peaks.
Phylogenetic methods typically rely on an appropriate model of how data evolved in order to infer an accurate phylogenetic tree. For molecular data, standard statistical methods have provided an effective strategy for extracting phylogenetic information from aligned sequence data when each site (character) is subject to a common process. However, for other types of data (e.g., morphological data), characters can be too ambiguous, homoplastic, or saturated to develop models that are effective at capturing the underlying process of change. To address this, we examine the properties of a classic but neglected method for inferring splits in an underlying tree, namely, maximum compatibility. By adopting a simple and extreme model in which each character either fits perfectly on some tree, or is entirely random (but it is not known which class any character belongs to) we are able to derive exact and explicit formulae regarding the performance of maximum compatibility. We show that this method is able to identify a set of non-trivial homoplasy-free characters, when the number $$n$$ of taxa is large, even when the number of random characters is large. In contrast, we show that a method that makes more uniform use of all the data—maximum parsimony—can provably estimate trees in which none of the original homoplasy-free characters support splits.
Müllerian mimicry among Neotropical Heliconiini butterflies is an excellent example of natural selection, associated with the diversification of a large continental-scale radiation. Some of the processes driving the evolution of mimicry rings are likely to generate incongruent phylogenetic signals across the assemblage, and thus pose a challenge for systematics. We use a data set of 22 mitochondrial and nuclear markers from 92% of species in the tribe, obtained by Sanger sequencing and de novo assembly of short read data, to re-examine the phylogeny of Heliconiini with both supermatrix and multispecies coalescent approaches, characterize the patterns of conflicting signal, and compare the performance of various methodological approaches to reflect the heterogeneity across the data. Despite the large extent of reticulate signal and strong conflict between markers, nearly identical topologies are consistently recovered by most of the analyses, although the supermatrix approach failed to reflect the underlying variation in the history of individual loci. However, the supermatrix represents a useful approximation where multiple rare species represented by short sequences can be incorporated easily. The first comprehensive, time-calibrated phylogeny of this group is used to test the hypotheses of a diversification rate increase driven by the dramatic environmental changes in the Neotropics over the past 23 myr, or changes caused by diversity-dependent effects on the rate of diversification. We find that the rate of diversification has increased on the branch leading to the presently most species-rich genus Heliconius, but the change occurred gradually and cannot be unequivocally attributed to a specific environmental driver. Our study provides comprehensive comparison of philosophically distinct species tree reconstruction methods and provides insights into the diversification of an important insect radiation in the most biodiverse region of the planet.
Phycas is open source, freely available Bayesian phylogenetics software written primarily in C++ but with a Python interface. Phycas specializes in Bayesian model selection for nucleotide sequence data, particularly the estimation of marginal likelihoods, central to computing Bayes Factors. Marginal likelihoods can be estimated using newer methods (Thermodynamic Integration and Generalized Steppingstone) that are more accurate than the widely used Harmonic Mean estimator. In addition, Phycas supports two posterior predictive approaches to model selection: Gelfand–Ghosh and Conditional Predictive Ordinates. The General Time Reversible family of substitution models, as well as a codon model, are available, and data can be partitioned with all parameters unlinked except tree topology and edge lengths. Phycas provides for analyses in which the prior on tree topologies allows polytomous trees as well as fully resolved trees, and provides for several choices for edge length priors, including a hierarchical model as well as the recently described compound Dirichlet prior, which helps avoid overly informative induced priors on tree length.
Virtually all models for reconstructing ancestral states for discrete characters make the crucial assumption that the trait of interest evolves at a uniform rate across the entire tree. However, this assumption is unlikely to hold in many situations, particularly as ancestral state reconstructions are being performed on increasingly large phylogenies. Here, we show how failure to account for such variable evolutionary rates can cause highly anomalous (and likely incorrect) results, while three methods that accommodate rate variability yield the opposite, more plausible, and more robust reconstructions. The random local clock method, implemented in BEAST, estimates the position and magnitude of rate changes on the tree; split BiSSE estimates separate rate parameters for pre-specified clades; and the hidden rates model partitions each character state into a number of rate categories. Simulations show the inadequacy of traditional models when characters evolve with both asymmetry (different rates of change between states within a character) and heterotachy (different rates of character evolution across different clades). The importance of accounting for rate heterogeneity in ancestral state reconstruction is highlighted empirically with a new analysis of the evolution of viviparity in squamate reptiles, which reveal a predominance of forward (oviparous-viviparous) transitions and very few reversals.
The Genealogical World of Phylogenetic Networks
BMC Evolutionary Biology