Latest issue

Systematic Biology - RSS feed of current issue

URL

XML feed
http://sysbio.oxfordjournals.org

Last update

1 hour 31 min ago

February 15, 2015

22:09

Paleontological systematics relies heavily on morphological data that have undergone decay and fossilization. Here, we apply a heuristic means to assess how a fossil's incompleteness detracts from inferring its phylogenetic relationships. We compiled a phylogenetic matrix for primates and simulated the extinction of living species by deleting an extant taxon's molecular data and keeping only those morphological characters present in actual fossils. The choice of characters present in a given living taxon (the subject) was defined by those present in a given fossil (the template). By measuring congruence between a well-corroborated phylogeny to those incorporating artificial fossils, and by comparing real vs. random character distributions and states, we tested the information content of paleontological datasets and determined if extinction of a living species leads to bias in phylogeny reconstruction. We found a positive correlation between fossil completeness and topological congruence. Real fossil templates sampled for 36 or more of the 360 available morphological characters (including dental) performed significantly better than similarly complete templates with random states. Templates dominated by only one partition performed worse than templates with randomly sampled characters across partitions. The template based on the Eocene primate Darwinius masillae performs better than most other templates with a similar number of sampled characters, likely due to preservation of data across multiple partitions. Our results support the interpretation that Darwinius is strepsirhine, not haplorhine, and suggest that paleontological datasets are reliable in primate phylogeny reconstruction.

22:09

The unique ability of modern turtles to retract their head and neck into the shell through a side-necked (pleurodiran) or hidden-necked (cryptodiran) motion is thought to have evolved independently in crown turtles. The anatomical changes that led to the vertebral shapes of modern turtles, however, are still poorly understood. Here we present comprehensive geometric morphometric analyses that trace turtle vertebral evolution and reconstruct disparity across phylogeny. Disparity of vertebral shape was high at the dawn of turtle evolution and decreased after the modern groups evolved, reflecting a stabilization of morphotypes that correspond to the two retraction modes. Stem turtles, which had a very simple mode of retraction, the lateral head tuck, show increasing flexibility of the neck through evolution towards a pleurodiran-like morphotype. The latter was the precondition for evolving pleurodiran and cryptodiran vertebrae. There is no correlation between the construction of formed articulations in the cervical centra and neck mobility. An increasing mobility between vertebrae, associated with changes in vertebral shape, resulted in a more advanced ability to retract the neck. In this regard, we hypothesize that the lateral tucking retraction of stem turtles was not only the precondition for pleurodiran but also of cryptodiran retraction. For the former, a kink in the middle third of the neck needed to be acquired, whereas for the latter modification was necessary between the eighth cervical vertebra and first thoracic vertebra. Our paper highlights the utility of 3D shape data, analyzed in a phylogenetic framework, to examine the magnitude and mode of evolutionary modifications to vertebral morphology. By reconstructing and visualizing ancestral anatomical shapes, we provide insight into the anatomical features underlying neck retraction mode, which is a salient component of extant turtle classification.

22:09

The phylogenetic literature contains numerous measures for assessing differences between two phylogenetic trees. Individual measures have been criticized on various grounds, but little is known about their comparative performance in typical applications. We evaluate the performance of nine tree distance measures on two tasks: 1) distinguishing trees separated by lesser versus greater numbers of recombinations, and 2) distinguishing trees inferred with lower versus higher quality data. We find that when the trees being compared are similar, measures that make use of branch lengths are superior, with the branch-length version of the Robinson–Foulds metric performing best. In contrast, for dissimilar trees topology-only measures are superior, with the Alignment metric of Nye et al. performing best. We also apply the measures to a mammalian dataset and observe that the best metric depends on whether branch-length information is of interest. We give practical recommendations for choosing a tree distance metric in different applications.

22:09

In disciplines such as macroevolution that are not amenable to experimentation, scientists usually rely on current observations to test hypotheses about historical events, assuming that "the present is the key to the past." Biogeographers, for example, used this assumption to reconstruct ancestral ranges from the distribution of extant species. Yet, under scenarios of high extinction rates, the biodiversity we observe today might not be representative of the historical diversity and this could result in incorrect biogeographic reconstructions. Here, we introduce a new approach to incorporate into biogeographic inference the temporal, spatial, and environmental information provided by the fossil record, as a direct evidence of the extinct biodiversity fraction. First, inferences of ancestral ranges for those nodes in the phylogeny calibrated with the fossil record are constrained to include the geographic distribution of the fossil. Second, we use fossil distribution and past climate data to reconstruct the climatic preferences and potential distribution of ancestral lineages over time, and use this information to build a biogeographic model that takes into account "ecological connectivity" through time. To show the power of this approach, we reconstruct the biogeographic history of the large angiosperm genus Hypericum, which has a fossil record extending back to the Early Cenozoic. Unlike previous reconstructions based on extant species distributions, our results reveal that Hypericum stem lineages were already distributed in the Holarctic before diversification of its crown-group, and that the geographic distribution of the genus has been relatively stable throughout the climatic oscillations of the Cenozoic. Geographical movement was mediated by the existence of climatic corridors, like Beringia, whereas the equatorial tropical belt acted as a climatic barrier, preventing Hypericum lineages to reach the southern temperate regions. Our study shows that an integrative approach to historical biogeography—that combines sources of evidence as diverse as paleontology, ecology, and phylogenetics—could help us obtain more accurate reconstructions of ancient evolutionary history. It also reveals the confounding effect different rates of extinction across regions have in biogeography, sometimes leading to ancestral areas being erroneously inferred as recent colonization events.

22:09

Despite impressive technical and theoretical developments, reconstruction of phylogenetic trees for enormous quantities of molecular data is still a challenging task. A key tool in analyses of large data sets has been the construction of separate trees for subsets (e.g., quartets) of sequences, and subsequent combination of these subtrees into a single tree for the full set (i.e., supertree analysis). Unfortunately, even amalgamating quartets into a supertree remains a computationally daunting task. Assigning weights to quartets to indicate importance or reliability was proposed more than a decade ago, but handling weighted quartets is even more challenging and has scarcely been attempted in the past. In this work, we focus on weighted quartet-based approaches. We propose a scheme to assign weights to quartets coming from weighted trees and devise a tree similarity measure for weighted trees based on weighted quartets. We also extend the quartet MaxCut (QMC algorithm) to handle weighted quartets. We evaluate these tools on simulated and real data. Our simulated data analysis highlights the additional information that is conveyed when using the new weighted tree similarity measure, and shows that extending QMC to a weighted setting improves the quality of tree reconstruction. Our analyses of a cyanobacterial data set with weighted QMC reinforce previous results achieved with other tools.

22:09

Previous work on the star-tree paradox has shown that Bayesian methods suffer from a long branch attraction bias. That work is extended to settings involving more taxa and partially resolved trees. The long branch attraction bias is confirmed to arise more broadly and an additional source of bias is found. A by-product of the analysis is methods that correct for biases toward particular topologies. The corrections can be easily calculated using existing Bayesian software. Posterior support for a set of two or more trees can thus be supplemented with corrected versions to cross-check or replace results. Simulations show the corrections to be highly effective.

22:09

The utility of fossils in evolutionary contexts is dependent on their accurate placement in phylogenetic frameworks, yet intrinsic and widespread missing data make this problematic. The complex taphonomic processes occurring during fossilization can make it difficult to distinguish absence from non-preservation, especially in the case of exceptionally preserved soft-tissue fossils: is a particular morphological character (e.g., appendage, tentacle, or nerve) missing from a fossil because it was never there (phylogenetic absence), or just happened to not be preserved (taphonomic loss)? Missing data have not been tested in the context of interpretation of non-present anatomy nor in the context of directional shifts and biases in affinity. Here, complete taxa, both simulated and empirical, are subjected to data loss through the replacement of present entries (1s) with either missing (?s) or absent (0s) entries. Both cause taxa to drift down trees, from their original position, toward the root. Absolute thresholds at which downshift is significant are extremely low for introduced absences (two entries replaced, 6% of present characters). The opposite threshold in empirical fossil taxa is also found to be low; two absent entries replaced with presences causes fossil taxa to drift up trees. As such, only a few instances of non-preserved characters interpreted as absences will cause fossil organisms to be erroneously interpreted as more primitive than they were in life. This observed sensitivity to coding non-present morphology presents a problem for all evolutionary studies that attempt to use fossils to reconstruct rates of evolution or unlock sequences of morphological change. Stem-ward slippage, whereby fossilization processes cause organisms to appear artificially primitive, appears to be a ubiquitous and problematic phenomenon inherent to missing data, even when no decay biases exist. Absent characters therefore require explicit justification and taphonomic frameworks to support their interpretation.

22:09

Genetic sequence data provide information about the distances between species or branch lengths in a phylogeny, but not about the absolute divergence times or the evolutionary rates directly. Bayesian methods for dating species divergences estimate times and rates by assigning priors on them. In particular, the prior on times (node ages on the phylogeny) incorporates information in the fossil record to calibrate the molecular tree. Because times and rates are confounded, our posterior time estimates will not approach point values even if an infinite amount of sequence data are used in the analysis. In a previous study we developed a finite-sites theory to characterize the uncertainty in Bayesian divergence time estimation in analysis of large but finite sequence data sets under a strict molecular clock. As most modern clock dating analyses use more than one locus and are conducted under relaxed clock models, here we extend the theory to the case of relaxed clock analysis of data from multiple loci (site partitions). Uncertainty in posterior time estimates is partitioned into three sources: Sampling errors in the estimates of branch lengths in the tree for each locus due to limited sequence length, variation of substitution rates among lineages and among loci, and uncertainty in fossil calibrations. Using a simple but analogous estimation problem involving the multivariate normal distribution, we predict that as the number of loci ($$L$$) goes to infinity, the variance in posterior time estimates decreases and approaches the infinite-data limit at the rate of 1/$$L$$, and the limit is independent of the number of sites in the sequence alignment. We then confirmed the predictions by using computer simulation on phylogenies of two or three species, and by analyzing a real genomic data set for six primate species. Our results suggest that with the fossil calibrations fixed, analyzing multiple loci or site partitions is the most effective way for improving the precision of posterior time estimation. However, even if a huge amount of sequence data is analyzed, considerable uncertainty will persist in time estimates.

22:09

The genetic distance between biological sequences is a fundamental quantity in molecular evolution. It pertains to questions of rates of evolution, existence of a molecular clock, and phylogenetic inference. Under the class of continuous-time substitution models, the distance is commonly defined as the expected number of substitutions at any site in the sequence. We eschew the almost ubiquitous assumptions of evolution under stationarity and time-reversible conditions and extend the concept of the expected number of substitutions to nonstationary Markov models where the only remaining constraint is of time homogeneity between nodes in the tree. Our measure of genetic distance reduces to the standard formulation if the data in question are consistent with the stationarity assumption. We apply this general model to samples from across the tree of life to compare distances so obtained with those from the general time-reversible model, with and without rate heterogeneity across sites, and the paralinear distance, an empirical pairwise method explicitly designed to address nonstationarity. We discover that estimates from both variants of the general time-reversible model and the paralinear distance systematically overestimate genetic distance and departure from the molecular clock. The magnitude of the distance bias is proportional to departure from stationarity, which we demonstrate to be associated with longer edge lengths. The marked improvement in consistency between the general nonstationary Markov model and sequence alignments leads us to conclude that analyses of evolutionary rates and phylogenies will be substantively improved by application of this model.

22:09

Although the use of landmark data to study shape changes along a phylogenetic tree has become a common practice in evolutionary studies, the role of this sort of data for the inference of phylogenetic relationships remains under debate. Theoretical issues aside, the very existence of historical information in landmark data has been challenged, since phylogenetic analyses have often shown little congruence with alternative sources of evidence. However, most analyses conducted in the past were based upon a single landmark configuration, leaving it unsettled whether the incorporation of multiple configurations may improve the rather poor performance of this data source in most previous phylogenetic analyses. In the present study, we present a phylogenetic analysis of landmark data that combines information derived from several skeletal structures to derive a phylogenetic tree for musteloids. The analysis includes nine configurations representing different skeletal structures for 24 species. The resulting tree presents several notable concordances with phylogenetic hypotheses derived from molecular data. In particular, Mephitidae, Procyonidae, and Lutrinae plus the genera Martes, Mustela, Galictis, and Procyon were retrieved as monophyletic. In addition, other groupings were in agreement with molecular phylogenies or presented only minor discordances. Complementary analyses have also indicated that the results improve substantially when an increasing number of landmark configurations are included in the analysis. The results presented here thus highlight the importance of combining information from multiple structures to derive phylogenetic hypotheses from landmark data.

22:09

Likelihood-based methods are commonplace in phylogenetic systematics. Although much effort has been directed toward likelihood-based models for molecular data, comparatively less work has addressed models for discrete morphological character (DMC) data. Among-character rate variation (ACRV) may confound phylogenetic analysis, but there have been few analyses of the magnitude and distribution of rate heterogeneity among DMCs. Using 76 data sets covering a range of plants, invertebrate, and vertebrate animals, we used a modified version of MrBayes to test equal, gamma-distributed and lognormally distributed models of ACRV, integrating across phylogenetic uncertainty using Bayesian model selection. We found that in approximately 80% of data sets, unequal-rates models outperformed equal-rates models, especially among larger data sets. Moreover, although most data sets were equivocal, more data sets favored the lognormal rate distribution relative to the gamma rate distribution, lending some support for more complex character correlations than in molecular data. Parsimony estimation of the underlying rate distributions in several data sets suggests that the lognormal distribution is preferred when there are many slowly evolving characters and fewer quickly evolving characters. The commonly adopted four rate category discrete approximation used for molecular data was found to be sufficient to approximate a gamma rate distribution with discrete characters. However, among the two data sets tested that favored a lognormal rate distribution, the continuous distribution was better approximated with at least eight discrete rate categories. Although the effect of rate model on the estimation of topology was difficult to assess across all data sets, it appeared relatively minor between the unequal-rates models for the one data set examined carefully. As in molecular analyses, we argue that researchers should test and adopt the most appropriate model of rate variation for the data set in question. As discrete characters are increasingly used in more sophisticated likelihood-based phylogenetic analyses, it is important that these studies be built on the most appropriate and carefully selected underlying models of evolution.

22:09

With the availability of genomic sequence data, there is increasing interest in using genes with a possible history of duplication and loss for species tree inference. Here we assess the performance of both nonprobabilistic and probabilistic species tree inference approaches using gene duplication and loss and coalescence simulations. We evaluated the performance of gene tree parsimony (GTP) based on duplication (Only-dup), duplication and loss (Dup-loss), and deep coalescence (Deep-c) costs, the NJst distance method, the MulRF supertree method, and PHYLDOG, which jointly estimates gene trees and species tree using a hierarchical probabilistic model. We examined the effects of gene tree and species sampling, gene tree error, and duplication and loss rates on the accuracy of phylogenetic estimates. In the 10-taxon duplication and loss simulation experiments, MulRF is more accurate than the other methods when the duplication and loss rates are low, and Dup-loss is generally the most accurate when the duplication and loss rates are high. PHYLDOG performs well in 10-taxon duplication and loss simulations, but its run time is prohibitively long on larger data sets. In the larger duplication and loss simulation experiments, MulRF outperforms all other methods in experiments with at most 100 taxa; however, in the larger simulation, Dup-loss generally performs best. In all duplication and loss simulation experiments with more than 10 taxa, all methods perform better with more gene trees and fewer missing sequences, and they are all affected by gene tree error. Our results also highlight high levels of error in estimates of duplications and losses from GTP methods and demonstrate the usefulness of methods based on generic tree distances for large analyses.

22:09

Species richness varies widely across the tree of life, and there is great interest in identifying ecological, geographic, and other factors that affect rates of species proliferation. Recent methods for explicitly modeling the relationships among character states, speciation rates, and extinction rates on phylogenetic trees— BiSSE, QuaSSE, GeoSSE, and related models—have been widely used to test hypotheses about character state-dependent diversification rates. Here, we document the disconcerting ease with which neutral traits are inferred to have statistically significant associations with speciation rate. We first demonstrate this unfortunate effect for a known model assumption violation: shifts in speciation rate associated with a character not included in the model. We further show that for many empirical phylogenies, characters simulated in the absence of state-dependent diversification exhibit an even higher Type I error rate, indicating that the method is susceptible to additional, unknown model inadequacies. For traits that evolve slowly, the root cause appears to be a statistical framework that does not require replicated shifts in character state and diversification. However, spurious associations between character state and speciation rate arise even for traits that lack phylogenetic signal, suggesting that phylogenetic pseudoreplication alone cannot fully explain the problem. The surprising severity of this phenomenon suggests that many trait–diversification relationships reported in the literature may not be real. More generally, we highlight the need for diagnosing and understanding the consequences of model inadequacy in phylogenetic comparative methods.

22:09

We introduce the Phylogenetic Likelihood Library (PLL), a highly optimized application programming interface for developing likelihood-based phylogenetic inference and postanalysis software. The PLL implements appropriate data structures and functions that allow users to quickly implement common, error-prone, and labor-intensive tasks, such as likelihood calculations, model parameter as well as branch length optimization, and tree space exploration. The highly optimized and parallelized implementation of the phylogenetic likelihood function and a thorough documentation provide a framework for rapid development of scalable parallel phylogenetic software. By example of two likelihood-based phylogenetic codes we show that the PLL improves the sequential performance of current software by a factor of 2–10 while requiring only 1 month of programming time for integration. We show that, when numerical scaling for preventing floating point underflow is enabled, the double precision likelihood calculations in the PLL are up to 1.9 times faster than those in BEAGLE. On an empirical DNA dataset with 2000 taxa the AVX version of PLL is 4 times faster than BEAGLE (scaling enabled and required). The PLL is available at http://www.libpll.org under the GNU General Public License (GPL).