There are currently 0 users and 47 guests online.
Systematic Biology - RSS feed of current issue
Last update52 min 32 sec ago
April 14, 2014
Robust Regression and Posterior Predictive Simulation Increase Power to Detect Early Bursts of Trait Evolution
A central prediction of much theory on adaptive radiations is that traits should evolve rapidly during the early stages of a clade's history and subsequently slowdown in rate as niches become saturated—a so-called "Early Burst." Although a common pattern in the fossil record, evidence for early bursts of trait evolution in phylogenetic comparative data has been equivocal at best. We show here that this may not necessarily be due to the absence of this pattern in nature. Rather, commonly used methods to infer its presence perform poorly when when the strength of the burst—the rate at which phenotypic evolution declines—is small, and when some morphological convergence is present within the clade. We present two modifications to existing comparative methods that allow greater power to detect early bursts in simulated datasets. First, we develop posterior predictive simulation approaches and show that they outperform maximum likelihood approaches at identifying early bursts at moderate strength. Second, we use a robust regression procedure that allows for the identification and down-weighting of convergent taxa, leading to moderate increases in method performance. We demonstrate the utility and power of these approach by investigating the evolution of body size in cetaceans. Model fitting using maximum likelihood is equivocal with regards the mode of cetacean body size evolution. However, posterior predictive simulation combined with a robust node height test return low support for Brownian motion or rate shift models, but not the early burst model. While the jury is still out on whether early bursts are actually common in nature, our approach will hopefully facilitate more robust testing of this hypothesis. We advocate the adoption of similar posterior predictive approaches to improve the fit and to assess the adequacy of macroevolutionary models in general. [Adaptive Radiations, Early Burst, Posterior Predictive Simulations, Quantitative Characters]
We present two distinctly different posterior predictive approaches to Bayesian phylogenetic model selection and illustrate these methods using examples from green algal protein-coding cpDNA sequences and flowering plant rDNA sequences. The Gelfand–Ghosh (GG) approach allows dissection of an overall measure of model fit into components due to posterior predictive variance (GGp) and goodness-of-fit (GGg), which distinguishes this method from the posterior predictive P-value approach. The conditional predictive ordinate (CPO) method provides a site-specific measure of model fit useful for exploratory analyses and can be combined over sites yielding the log pseudomarginal likelihood (LPML) which is useful as an overall measure of model fit. CPO provides a useful cross-validation approach that is computationally efficient, requiring only a sample from the posterior distribution (no additional simulation is required). Both GG and CPO add new perspectives to Bayesian phylogenetic model selection based on the predictive abilities of models and complement the perspective provided by the marginal likelihood (including Bayes Factor comparisons) based solely on the fit of competing models to observed data. [Bayesian; conditional predictive ordinate; CPO; L-measure; LPML; model selection; phylogenetics; posterior predictive.]
Model checking is a critical part of Bayesian data analysis, yet it remains largely unused in systematic studies. Phylogeny estimation has recently moved into an era of increasingly complex models that simultaneously account for multiple evolutionary processes, the statistical fit of these models to the data has rarely been tested. Here we develop a posterior predictive simulation-based model check for a commonly used multispecies coalescent model, implemented in *BEAST, and apply it to 25 published data sets. We show that poor model fit is detectable in the majority of data sets; that this poor fit can mislead phylogenetic estimation; and that in some cases it stems from processes of inherent interest to systematists. We suggest that as systematists scale up to phylogenomic data sets, which will be subject to a heterogeneous array of evolutionary processes, critically evaluating the fit of models to data is an analytical step that can no longer be ignored. [Gene duplication and extinction; gene tree; hybridization; model fit; multispecies coalescent; next-generation sequencing; posterior predictive simulation; species delimitation; species tree.]
Systematic phylogenetic error caused by the simplifying assumptions made in models of molecular evolution may be impossible to avoid entirely when attempting to model evolution across massive, diverse data sets. However, not all deficiencies of inference models result in unreliable phylogenetic estimates. The field of phylogenetics lacks a direct method to identify cases where model specification adversely affects inferences. Posterior predictive simulation is a flexible and intuitive approach for assessing goodness-of-fit of the assumed model and priors in a Bayesian phylogenetic analysis. Here, I propose new test statistics for use in posterior predictive assessment of model fit. These test statistics compare phylogenetic inferences from posterior predictive data sets to inferences from the original data. A simulation study demonstrates the utility of these new statistics. The new tests reject the plausibility of inferred tree lengths or topologies more often when data/model combinations produce biased inferences. I also apply this approach to exemplar empirical data sets, highlighting the value of the novel assessments. [Bayesian; Markov chain Monte Carlo; model fit; phylogenetic; posterior predictive distribution; sequence evolution; simulation.
The temporal dynamics of species diversity are shaped by variations in the rates of speciation and extinction, and there is a long history of inferring these rates using first and last appearances of taxa in the fossil record. Understanding diversity dynamics critically depends on unbiased estimates of the unobserved times of speciation and extinction for all lineages, but the inference of these parameters is challenging due to the complex nature of the available data. Here, we present a new probabilistic framework to jointly estimate species-specific times of speciation and extinction and the rates of the underlying birth–death process based on the fossil record. The rates are allowed to vary through time independently of each other, and the probability of preservation and sampling is explicitly incorporated in the model to estimate the true lifespan of each lineage. We implement a Bayesian algorithm to assess the presence of rate shifts by exploring alternative diversification models. Tests on a range of simulated data sets reveal the accuracy and robustness of our approach against violations of the underlying assumptions and various degrees of data incompleteness. Finally, we demonstrate the application of our method with the diversification of the mammal family Rhinocerotidae and reveal a complex history of repeated and independent temporal shifts of both speciation and extinction rates, leading to the expansion and subsequent decline of the group. The estimated parameters of the birth–death process implemented here are directly comparable with those obtained from dated molecular phylogenies. Thus, our model represents a step towards integrating phylogenetic and fossil information to infer macroevolutionary processes.[BDMCMC; biodiversity trends; Birth–death process; incomplete fossil sampling; macroevolution; species rise and fall.]
Since the advent of molecular phylogenetics more than 25 years ago, a major goal of plant systematists has been to discern the root of the angiosperms. Although most studies indicate that Amborella trichopoda is sister to all remaining extant flowering plants, support for this position has varied with respect to both the sequence data sets and analyses employed. Recently, Goremykin et al. (2013) questioned the "Amborella-sister hypothesis" using a "noise-reduction" approach and reported a topology with Amborella + Nymphaeales (water lilies) sister to all remaining angiosperms. Through a series of analyses of both plastid genomes and mitochondrial genes, we continue to find mostly strong support for the Amborella-sister hypothesis and offer a rebuttal of Goremykin et al. (2013). The major tenet of Goremykin et al. is that the Amborella-sister position is determined by noisy data—that is, characters with high rates of change and lacking true phylogenetic signal. To investigate the signal in these noisy data further, we analyzed the discarded characters from their noise-reduced alignments. We recovered a tree identical to that of the currently accepted angiosperm framework, including the position of Amborella as sister to all other angiosperms, as well as all other major clades. Thus, the signal in the "noisy" data is consistent with that of our complete data sets—arguing against the use of their noise-reduction approach. We also determined that one of the alignments presented by Goremykin et al. yields results at odds with their central claim—their data set actually supports Amborella as sister to all other angiosperms, as do larger plastid data sets we present here that possess more complete taxon sampling both within the monocots and for angiosperms in general. Previous unpartitioned, multilocus analyses of mitochondrial DNA (mtDNA) data have provided the strongest support for Amborella + Nymphaeales as sister to other angiosperms. However, our analysis of third codon positions from mtDNA sequence data also supports the Amborella-sister hypothesis. Finally, we challenge the conclusion of Goremykin et al. that the first flowering plants were aquatic and herbaceous, reasserting that even if Amborella + water lilies, or water lilies alone, are sister to the rest of the angiosperms, the earliest angiosperms were not necessarily aquatic and/or herbaceous. [Angiosperms; Amborella; Nymphaeales; plastid genome; water lilies.]
Split networks are a type of phylogenetic network that allow visualization of conflict in evolutionary data. We present a new method for constructing such networks called FlatNetJoining (FlatNJ). A key feature of FlatNJ is that it produces networks that can be drawn in the plane in which labels may appear inside of the network. For complex data sets that involve, for example, non-neutral molecular markers, this can allow additional detail to be visualized as compared to previous methods such as split decomposition and NeighborNet. We illustrate the application of FlatNJ by applying it to whole HIV genome sequences, where recombination has taken place, fluorescent proteins in corals, where ancestral sequences are present, and mitochondrial DNA sequences from gall wasps, where biogeographical relationships are of interest. We find that the networks generated by FlatNJ can facilitate the study of genetic variation in the underlying molecular sequence data and, in particular, may help to investigate processes such as intra-locus recombination. FlatNJ has been implemented in Java and is freely available at www.uea.ac.uk/computing/software/flatnj. [flat split system; NeighborNet; Phylogenetic network; QNet; split; split network.]
We developed a linear-time algorithm applicable to a large class of trait evolution models, for efficient likelihood calculations and parameter inference on very large trees. Our algorithm solves the traditional computational burden associated with two key terms, namely the determinant of the phylogenetic covariance matrix V and quadratic products involving the inverse of V. Applications include Gaussian models such as Brownian motion-derived models like Pagel's lambda, kappa, delta, and the early-burst model; Ornstein-Uhlenbeck models to account for natural selection with possibly varying selection parameters along the tree; as well as non-Gaussian models such as phylogenetic logistic regression, phylogenetic Poisson regression, and phylogenetic generalized linear mixed models. Outside of phylogenetic regression, our algorithm also applies to phylogenetic principal component analysis, phylogenetic discriminant analysis or phylogenetic prediction. The computational gain opens up new avenues for complex models or extensive resampling procedures on very large trees. We identify the class of models that our algorithm can handle as all models whose covariance matrix has a 3-point structure. We further show that this structure uniquely identifies a rooted tree whose branch lengths parametrize the trait covariance matrix, which acts as a similarity matrix. The new algorithm is implemented in the R package phylolm, including functions for phylogenetic linear regression and phylogenetic logistic regression.
Lateral gene transfer (LGT)—which transfers DNA between two non-vertically related individuals belonging to the same or different species—is recognized as a major force in prokaryotic evolution, and evidence of its impact on eukaryotic evolution is ever increasing. LGT has attracted much public attention for its potential to transfer pathogenic elements and antibiotic resistance in bacteria, and to transfer pesticide resistance from genetically modified crops to other plants. In a wider perspective, there is a growing body of studies highlighting the role of LGT in enabling organisms to occupy new niches or adapt to environmental changes. The challenge LGT poses to the standard tree-based conception of evolution is also being debated. Studies of LGT have, however, been severely limited by a lack of computational tools. The best currently available LGT algorithms are parsimony-based phylogenetic methods, which require a pre-computed gene tree and cannot choose between sometimes wildly differing most parsimonious solutions. Moreover, in many studies, simple heuristics are applied that can only handle putative orthologs and completely disregard gene duplications (GDs). Consequently, proposed LGT among specific gene families, and the rate of LGT in general, remain debated. We present a Bayesian Markov-chain Monte Carlo-based method that integrates GD, gene loss, LGT, and sequence evolution, and apply the method in a genome-wide analysis of two groups of bacteria: Mollicutes and Cyanobacteria. Our analyses show that although the LGT rate between distant species is high, the net combined rate of duplication and close-species LGT is on average higher. We also show that the common practice of disregarding reconcilability in gene tree inference overestimates the number of LGT and duplication events. [Bayesian; gene duplication; gene loss; horizontal gene transfer; lateral gene transfer; MCMC; phylogenetics.]
Predicting the Ancestral Character Changes in a Tree is Typically Easier than Predicting the Root State
Predicting the ancestral sequences of a group of homologous sequences related by a phylogenetic tree has been the subject of many studies, and numerous methods have been proposed for this purpose. Theoretical results are available that show that when the substitution rates become too large, reconstructing the ancestral state at the tree root is no longer feasible. Here, we also study the reconstruction of the ancestral changes that occurred along the tree edges. We show that, that, depending on the tree and branch length distribution, reconstructing these changes (i.e., reconstructing the ancestral state of all internal nodes in the tree) may be easier or harder than reconstructing the ancestral root state. However, results from information theory indicate that for the standard Yule tree, the task of reconstructing internal node states remains feasible, even for very high substitution rates. Moreover, computer simulations demonstrate that for more complex trees and scenarios, this result still holds. For a large variety of counting, parsimony- and likelihood-based methods, the predictive accuracy of a randomly selected internal node in the tree is indeed much higher than the accuracy of the same method when applied to the tree root. Moreover, parsimony- and likelihood-based methods appear to be remarkably robust to sampling bias and model mis-specification. [Ancestral state prediction; character evolution; majority rule; Markov model; maximum likelihood; parsimony; phylogenetic tree.]
The Genealogical World of Phylogenetic Networks
BMC Evolutionary Biology