Latest issue

Systematic Biology - RSS feed of current issue

URL

XML feed
http://sysbio.oxfordjournals.org

Last update

29 min 22 sec ago

August 17, 2015

07:19

A number of methods have been developed for modeling the evolution of a quantitative trait on a phylogeny. These methods have received renewed interest in the context of genome-wide studies of gene expression, in which the expression levels of many genes can be modeled as quantitative traits. We here develop a new method for joint analyses of quantitative traits within- and between species, the Expression Variance and Evolution (EVE) model. The model parameterizes the ratio of population to evolutionary expression variance, facilitating a wide variety of analyses, including a test for lineage-specific shifts in expression level, and a phylogenetic ANOVA that can detect genes with increased or decreased ratios of expression divergence to diversity, analogous to the famous Hudson Kreitman Aguadé (HKA) test used to detect selection at the DNA level. We use simulations to explore the properties of these tests under a variety of circumstances and show that the phylogenetic ANOVA is more accurate than the standard ANOVA (no accounting for phylogeny) sometimes used in transcriptomics. We then apply the EVE model to a mammalian phylogeny of 15 species typed for expression levels in liver tissue. We identify genes with high expression divergence between species as candidates for expression level adaptation, and genes with high expression diversity within species as candidates for expression level conservation and/or plasticity. Using the test for lineage-specific expression shifts, we identify several candidate genes for expression level adaptation on the catarrhine and human lineages, including genes putatively related to dietary changes in humans. We compare these results to those reported previously using a model which ignores expression variance within species, uncovering important differences in performance. We demonstrate the necessity for a phylogenetic model in comparative expression studies and show the utility of the EVE model to detect expression divergence, diversity, and branch-specific shifts.

07:19

Terraces are sets of trees with precisely the same likelihood or parsimony score, which can be induced by missing sequences in partitioned multi-locus phylogenetic data matrices. The potentially large set of trees on a terrace can be characterized by enumeration algorithms or consensus methods that exploit the pattern of partial taxon coverage in the data, independent of the sequence data themselves. Terraces can add ambiguity and complexity to phylogenetic inference, particularly in settings where inference is already challenging: data sets with many taxa and relatively few loci. In this article we present five new findings about terraces and their impacts on phylogenetic inference. First, we clarify assumptions about partitioning scheme model parameters that are necessary for the existence of terraces. Second, we explore the dependence of terrace size on partitioning scheme and indicate how to find the partitioning scheme associated with the largest terrace containing a given tree. Third, we highlight the impact of terrace size on bootstrap estimates of confidence limits in clades, and characterize the surprising result that the bootstrap proportion for a clade, as it is usually calculated, can be entirely determined by the frequency of bipartitions on a terrace, with some bipartitions receiving high support even when incorrect. Fourth, we dissect some effects of prior distributions of edge lengths on the computed posterior probabilities of clades on terraces, to understand an example in which long edges "attract" each other in Bayesian inference. Fifth, we describe how assuming relationships between edge-lengths of different loci, as an attempt to avoid terraces, can also be problematic when taxon coverage is partial, specifically when heterotachy is present. Finally, we discuss strategies for remediation of some of these problems. One promising approach finds a minimal set of taxa which, when deleted from the data matrix, reduces the size of a terrace to a single tree.

07:19

Phylogenetic relationships in recent, rapid radiations can be difficult to resolve due to incomplete lineage sorting and reliance on genetic markers that evolve slowly relative to the rate of speciation. By incorporating hundreds to thousands of unlinked loci, phylogenomic analyses have the potential to mitigate these difficulties. Here, we attempt to resolve phylogenetic relationships among eight shrew species (genus Crocidura) from the Philippines, a phylogenetic problem that has proven intractable with small (< 10 loci) data sets. We sequenced hundreds of ultraconserved elements and whole mitochondrial genomes in these species and estimated phylogenies using concatenation, summary coalescent, and hierarchical coalescent methods. The concatenated approach recovered a maximally supported and fully resolved tree. In contrast, the coalescent-based approaches produced similar topologies, but each had several poorly supported nodes. Using simulations, we demonstrate that the concatenated tree could be positively misleading. Our simulations also show that the tree shape we tend to infer, which involves a series of short internal branches, is difficult to resolve, even if substitution models are known and multiple individuals per species are sampled. As such, the low support we obtained for backbone relationships in our coalescent-based inferences reflects a real and appropriate lack of certainty. Our results illuminate the challenges of estimating a bifurcating tree in a rapid and recent radiation, providing a rare empirical example of a nearly simultaneous series of speciation events in a terrestrial animal lineage as it spreads across an oceanic archipelago.

07:19

Simulation experiments are used widely throughout evolutionary biology and bioinformatics to compare models, promote methods, and test hypotheses. The biggest practical constraint on simulation experiments is the computational demand, particularly as the number of parameters increases. Given the extraordinary success of Monte Carlo methods for conducting inference in phylogenetics, and indeed throughout the sciences, we investigate ways in which Monte Carlo framework can be used to carry out simulation experiments more efficiently. The key idea is to sample parameter values for the experiments, rather than iterate through them exhaustively. Exhaustive analyses become completely infeasible when the number of parameters gets too large, whereas sampled approaches can fare better in higher dimensions. We illustrate the framework with applications to phylogenetics and genetic archaeology.

07:19

The recent publication of a time-tree for the plant family Solanaceae (nightshades) provides the opportunity to use independent calibrations to test divergence times previously inferred for the diverse Neotropical butterfly tribe Ithomiini. Ithomiini includes clades that are obligate herbivores of Solanaceae, with some genera feeding on only one genus. We used 8 calibrations extracted from the plant tree in a new relaxed molecular-clock analysis to produce an alternative temporal framework for the diversification of ithomiines. We compared the resulting age estimates to: (i) a time-tree obtained using 7 secondary calibrations from the Nymphalidae tree of Wahlberg et al. (2009), and (ii) Wahlberg et al.'s (2009) original age estimates for the same clades. We found that Bayesian clock estimates were rather sensitive to a variety of analytical parameters, including taxon sampling. Regardless of this sensitivity however, ithomiine divergence times calibrated with the ages of nightshades were always on average half the age of previous estimates. Younger dates for ithomiine clades appear to fit better with factors long suggested to have promoted diversification of the group such as the uplifting of the Andes, in the case of montane genera. Alternatively, if ithomiines are as old as previous estimates suggest, the recent ages inferred for the diversification of Solanaceae seem likely to be seriously underestimated. Our study exemplifies the difficulty of testing hypotheses of divergence times and of choosing between alternative dating scenarios, and shows that age estimates based on seemingly plausible calibrations may be grossly incongruent.

07:19

A binary phylogenetic network may or may not be obtainable from a tree by the addition of directed edges (arcs) between tree arcs. Here, we establish a precise and easily tested criterion (based on "2-SAT") that efficiently determines whether or not any given network can be realized in this way. Moreover, the proof provides a polynomial-time algorithm for finding one or more trees (when they exist) on which the network can be based. A number of interesting consequences are presented as corollaries; these lead to some further relevant questions and observations, which we outline in the conclusion.

07:19

Phylogenetic inference is generally performed on the basis of multiple sequence alignments (MSA). Because errors in an alignment can lead to errors in tree estimation, there is a strong interest in identifying and removing unreliable parts of the alignment. In recent years several automated filtering approaches have been proposed, but despite their popularity, a systematic and comprehensive comparison of different alignment filtering methods on real data has been lacking. Here, we extend and apply recently introduced phylogenetic tests of alignment accuracy on a large number of gene families and contrast the performance of unfiltered versus filtered alignments in the context of single-gene phylogeny reconstruction. Based on multiple genome-wide empirical and simulated data sets, we show that the trees obtained from filtered MSAs are on average worse than those obtained from unfiltered MSAs. Furthermore, alignment filtering often leads to an increase in the proportion of well-supported branches that are actually wrong. We confirm that our findings hold for a wide range of parameters and methods. Although our results suggest that light filtering (up to 20% of alignment positions) has little impact on tree accuracy and may save some computation time, contrary to widespread practice, we do not generally recommend the use of current alignment filtering methods for phylogenetic inference. By providing a way to rigorously and systematically measure the impact of filtering on alignments, the methodology set forth here will guide the development of better filtering algorithms.

07:19

Polyploidization is an important speciation mechanism in the barley genus Hordeum. To analyze evolutionary changes after allopolyploidization, knowledge of parental relationships is essential. One chloroplast and 12 nuclear single-copy loci were amplified by polymerase chain reaction (PCR) in all Hordeum plus six out-group species. Amplicons from each of 96 individuals were pooled, sheared, labeled with individual-specific barcodes and sequenced in a single run on a 454 platform. Reference sequences were obtained by cloning and Sanger sequencing of all loci for nine supplementary individuals. The 454 reads were assembled into contigs representing the 13 loci and, for polyploids, also homoeologues. Phylogenetic analyses were conducted for all loci separately and for a concatenated data matrix of all loci. For diploid taxa, a Bayesian concordance analysis and a coalescent-based dated species tree was inferred from all gene trees. Chloroplast matK was used to determine the maternal parent in allopolyploid taxa. The relative performance of different multilocus analyses in the presence of incomplete lineage sorting and hybridization was also assessed. The resulting multilocus phylogeny reveals for the first time species phylogeny and progenitor-derivative relationships of all di- and polyploid Hordeum taxa within a single analysis. Our study proves that it is possible to obtain a multilocus species-level phylogeny for di- and polyploid taxa by combining PCR with next-generation sequencing, without cloning and without creating a heavy load of sequence data.

07:19

Genome sequence data contain abundant information about genealogical history, but methods for extracting and interpreting this information are not yet fully developed. We analyzed genome sequences for multiple accessions of the selfing plant, Arabidopsis thaliana, with the goal of better understanding its genealogical history. As expected from accessions of the same species, we found much discordance between nuclear gene trees. Nonetheless, we inferred the optimal population tree under the assumption that all discordance is due to incomplete lineage sorting. To cope with the size of the data (many genes and many taxa), our pipeline is based on parallel computing and divides the problem into four-taxon trees. However, just because a population tree can be estimated does not mean that the assumptions of the multispecies coalescent model hold. Therefore, we implemented a new, nonparametric test to evaluate whether a population tree adequately explains the observed quartet frequencies (the frequencies of gene trees with each resolution of each four-taxon set). This test also considers other models: panmixia and a partially resolved population tree, that is, a tree in which some nodes are collapsed into local panmixia. We found that a partially resolved population tree provides the best fit to the data, providing evidence for tree-like structure within A. thaliana, qualitatively similar to what might be expected between different, closely related species. Further, we show that the pattern of deviation from expectations can be used to identify instances of introgression and detect one clear case of reticulation among ecotypes that have come into contact in the United Kingdom. Our study illustrates how we can use genome sequence data to evaluate whether phylogenetic relationships are strictly tree-like or reticulating.

07:19

Topological heterogeneity among gene trees is widely observed in phylogenomic analyses and some of this variation is likely caused by systematic error in gene tree estimation. Systematic error can be mitigated by improving models of sequence evolution to account for all evolutionary processes relevant to each gene or identifying those genes whose evolution best conforms to existing models. However, the best method for identifying such genes is not well established. Here, we ask if filtering genes according to their clock-likeness or posterior predictive effect size (PPES, an inference-based measure of model violation) improves phylogenetic reliability and congruence. We compared these approaches to each other, and to the common practice of filtering based on rate of evolution, using two different metrics. First, we compared gene-tree topologies to accepted reference topologies. Second, we examined topological similarity among gene trees in filtered sets. Our results suggest that filtering genes based on clock-likeness and PPES can yield a collection of genes with more reliable phylogenetic signal. For the two exemplar data sets we explored, from yeast and amniotes, clock-likeness and PPES outperformed rate-based filtering in both congruence and reliability.

07:19

Two characters are stratigraphically compatible if some phylogenies indicate that their combinations (state-pairs) evolved without homoplasy and in an order consistent with the fossil record. Simulations assuming independent character change indicate that we expect approximately 95% of compatible character pairs to also be stratigraphically compatible over a wide range of sampling regimes and general evolutionary models. However, two general models of rate heterogeneity elevate expected stratigraphic incompatibility: "early burst" models, where rates of change are higher among early members of a clade than among later members of that clade, and "integration" models, where the evolution of characters is correlated in some manner. Both models have important theoretical and methodological implications. Therefore, we examine 259 metazoan clades for deviations from expected stratigraphic compatibility. We do so first assuming independent change with equal rates of character change through time. We then repeat the analysis assuming independent change with separate "early" and "late" rates (with "early" = the first third of taxa in a clade), with the early and late rates chosen to maximize the probability of the observed compatibility among the early taxa and then the whole clade. We single out Cambrian trilobites as a possible "control" group because morphometric studies suggest that integration patterns are not conserved among closely related species. Even allowing for early bursts, we see excess stratigraphic incompatibility (i.e., negative deviations) in significantly more clades than expected at 0.50, 0.25, and 0.05 P values. This pattern is particularly strong in chordates, echinoderms, and arthropods. However, stratigraphic compatibility among Cambrian trilobites matches the expectations of integration studies, as they (unlike post-Cambrian trilobites) do not deviate from the expectations of independent change with no early bursts. Thus, these results suggest that processes such as integration strongly affect the data that paleontologists use to study phylogeny, disparity, and rates.

07:19

Fossils provide the principal basis for temporal calibrations, which are critical to the accuracy of divergence dating analyses. Translating fossil data into minimum and maximum bounds for calibrations is the most important—often least appreciated—step of divergence dating. Properly justified calibrations require the synthesis of phylogenetic, paleontological, and geological evidence and can be difficult for nonspecialists to formulate. The dynamic nature of the fossil record (e.g., new discoveries, taxonomic revisions, updates of global or local stratigraphy) requires that calibration data be updated continually lest they become obsolete. Here, we announce the Fossil Calibration Database (http://fossilcalibrations.org), a new open-access resource providing vetted fossil calibrations to the scientific community. Calibrations accessioned into this database are based on individual fossil specimens and follow best practices for phylogenetic justification and geochronological constraint. The associated Fossil Calibration Series, a calibration-themed publication series at Palaeontologia Electronica, will serve as a key pipeline for peer-reviewed calibrations to enter the database.

07:19

Current science evaluation still relies on citation performance, despite criticisms of purely bibliometric research assessments. Biological taxonomy suffers from a drain of knowledge and manpower, with poor citation performance commonly held as one reason for this impediment. But is there really such a citation impediment in taxonomy? We compared the citation numbers of 306 taxonomic and 2291 non-taxonomic research articles (2009–2012) on mosses, orchids, ciliates, ants, and snakes, using Web of Science (WoS) and correcting for journal visibility. For three of the five taxa, significant differences were absent in citation numbers between taxonomic and non-taxonomic papers. This was also true for all taxa combined, although taxonomic papers received more citations than non-taxonomic ones. Our results show that, contrary to common belief, taxonomic contributions do not generally reduce a journal's citation performance and might even increase it. The scope of many journals rarely featuring taxonomy would allow editors to encourage a larger number of taxonomic submissions. Moreover, between 1993 and 2012, taxonomic publications accumulated faster than those from all biological fields. However, less than half of the taxonomic studies were published in journals in WoS. Thus, editors of highly visible journals inviting taxonomic contributions could benefit from taxonomy's strong momentum. The taxonomic output could increase even more than at its current growth rate if: (i) taxonomists currently publishing on other topics returned to taxonomy and (ii) non-taxonomists identifying the need for taxonomic acts started publishing these, possibly in collaboration with taxonomists. Finally, considering the high number of taxonomic papers attracted by the journal Zootaxa, we expect that the taxonomic community would indeed use increased chances of publishing in WoS indexed journals. We conclude that taxonomy's standing in the present citation-focused scientific landscape could easily improve—if the community becomes aware that there is no citation impediment in taxonomy.

07:19

Dating analyses based on molecular data imply that crown angiosperms existed in the Triassic, long before their undisputed appearance in the fossil record in the Early Cretaceous. Following a re-analysis of the age of angiosperms using updated sequences and fossil calibrations, we use a series of simulations to explore the possibility that the older age estimates are a consequence of (i) major shifts in the rate of sequence evolution near the base of the angiosperms and/or (ii) the representative taxon sampling strategy employed in such studies. We show that both of these factors do tend to yield substantially older age estimates. These analyses do not prove that younger age estimates based on the fossil record are correct, but they do suggest caution in accepting the older age estimates obtained using current relaxed-clock methods. Although we have focused here on the angiosperms, we suspect that these results will shed light on dating discrepancies in other major clades.

07:19

Support for Amborella as the sole survivor of an evolutionary lineage that is sister to all other angiosperms comes from positions in DNA multiple-sequence alignments that have a poor fit to time-reversible substitution models. These sites exhibit significant levels of homoplasy, compositional heterogeneity, and strong heterotachy. We report phylogenetic analyses with observed, randomized, and simulated data which show there is little or no expectation that these sites provide useful information for understanding relationships among basal angiosperms. Their inclusion in phylogenetic analyses leads to a long-branch attraction artifact that favors Amborella as sister to other angiosperms in reconstructed phylogenies. Using parametric simulations, we show that sites in chloroplast sequences that exhibit less homoplasy between angiosperms and gymnosperms provide more reliable information for inferring basal angiosperm relationships. We confirm our earlier findings that the basal angiosperm Amborella is most closely related to aquatic herbs. Our current and previously reported (Goremykin et al. 2013) analyses highlight an essential aspect of the total evidence approach to phylogenetic inference. They suggest that data partitioning aimed at identifying components of the data that better fit evolutionary models is a more reliable approach to phylogeny reconstruction at deep taxonomic levels.