Latest issue

Systematic Biology - RSS feed of current issue

URL

XML feed
http://sysbio.oxfordjournals.org

Last update

1 hour 16 min ago

August 14, 2014

06:04

We describe new methods for characterizing gene tree discordance in phylogenomic data sets, which screen for deviations from neutral expectations, summarize variation in statistical support among gene trees, and allow comparison of the patterns of discordance induced by various analysis choices. Using an exceptionally complete set of genome sequences for the short arm of chromosome 3 in Oryza (rice) species, we applied these methods to identify the causes and consequences of differing patterns of discordance in the sets of gene trees inferred using a panel of 20 distinct analysis pipelines. We found that discordance patterns were strongly affected by aspects of data selection, alignment, and alignment masking. Unusual patterns of discordance evident when using certain pipelines were reduced or eliminated by using alternative pipelines, suggesting that they were the product of methodological biases rather than evolutionary processes. In some cases, once such biases were eliminated, evolutionary processes such as introgression could be implicated. Additionally, patterns of gene tree discordance had significant downstream impacts on species tree inference. For example, inference from supermatrices was positively misleading when pipelines that led to biased gene trees were used. Several results may generalize to other data sets: we found that gene tree and species tree inference gave more reasonable results when intron sequence was included during sequence alignment and tree inference, the alignment software PRANK was used, and detectable "block-shift" alignment artifacts were removed. We discuss our findings in the context of well-established relationships in Oryza and continuing controversies regarding the domestication history of O. sativa. [gene trees; multilocus data; Oryza; phylogenomics; phylogeny reconstruction; species trees.]

06:04

More than a decade of phylogenetic research has yielded a well-sampled, strongly supported hypothesis of relationships within the large ( > 4000 species) plant family Acanthaceae. This hypothesis points to intriguing biogeographic patterns and asymmetries in sister clade diversity but, absent a time-calibrated estimate for this evolutionary history, these patterns have remained unexplored. Here, we reconstruct divergence times within Acanthaceae using fossils as calibration points and experimenting with both fossil selection and effects of invoking a maximum age prior related to the origin of Eudicots. Contrary to earlier reports of a paucity of fossils of Lamiales (an order of ~23,000 species that includes Acanthaceae) and to the expectation that a largely herbaceous to soft-wooded and tropical lineage would have few fossils, we recovered 51 reports of fossil Acanthaceae. Rigorous evaluation of these for accurate identification, quality of age assessment and utility in dating yielded eight fossils judged to merit inclusion in analyses. With nearly 10 kb of DNA sequence data, we used two sets of fossils as constraints to reconstruct divergence times. We demonstrate differences in age estimates depending on fossil selection and that enforcement of maximum age priors substantially alters estimated clade ages, especially in analyses that utilize a smaller rather than larger set of fossils. Our results suggest that long-distance dispersal events explain present-day distributions better than do Gondwanan or northern land bridge hypotheses. This biogeographical conclusion is for the most part robust to alternative calibration schemes. Our data support a minimum of 13 Old World (OW) to New World (NW) dispersal events but, intriguingly, only one in the reverse direction. Eleven of these 13 were among Acanthaceae s.s., which comprises > 90% of species diversity in the family. Remarkably, if minimum age estimates approximate true history, these 11 events occurred within the last ~20 myr even though Acanthaceae s.s is over 3 times as old. A simulation study confirmed that these dispersal events were significantly skewed toward the present and not simply a chance occurrence. Finally, we review reports of fossils that have been assigned to Acanthaceae that are substantially older than the lower Cretaceous estimate for Angiosperms as a whole (i.e., the general consensus that has resulted from several recent dating and fossil-based studies in plants). This is the first study to reconstruct divergence times among clades of Acanthaceae and sets the stage for comparative evolutionary research in this and related families that have until now been thought to have extremely poor fossil resources. [Acanthaceae; BEAST; biogeography; calibration; clade age; comparative; Cretaceous; divergence time estimation; diversification; evolution; fossil; Jurassic; Lamiales; palynology; pollen; simulation; Triassic.]

06:04

Phylogenetic signal is the tendency for closely related species to display similar trait values due to their common ancestry. Several methods have been developed for quantifying phylogenetic signal in univariate traits and for sets of traits treated simultaneously, and the statistical properties of these approaches have been extensively studied. However, methods for assessing phylogenetic signal in high-dimensional multivariate traits like shape are less well developed, and their statistical performance is not well characterized. In this article, I describe a generalization of the K statistic of Blomberg et al. that is useful for quantifying and evaluating phylogenetic signal in highly dimensional multivariate data. The method (Kmult) is found from the equivalency between statistical methods based on covariance matrices and those based on distance matrices. Using computer simulations based on Brownian motion, I demonstrate that the expected value of Kmult remains at 1.0 as trait variation among species is increased or decreased, and as the number of trait dimensions is increased. By contrast, estimates of phylogenetic signal found with a squared-change parsimony procedure for multivariate data change with increasing trait variation among species and with increasing numbers of trait dimensions, confounding biological interpretations. I also evaluate the statistical performance of hypothesis testing procedures based on Kmult and find that the method displays appropriate Type I error and high statistical power for detecting phylogenetic signal in high-dimensional data. Statistical properties of Kmult were consistent for simulations using bifurcating and random phylogenies, for simulations using different numbers of species, for simulations that varied the number of trait dimensions, and for different underlying models of trait covariance structure. Overall these findings demonstrate that Kmult provides a useful means of evaluating phylogenetic signal in high-dimensional multivariate traits. Finally, I illustrate the utility of the new approach by evaluating the strength of phylogenetic signal for head shape in a lineage of Plethodon salamanders. [Geometric morphometrics; macroevolution; morphological evolution; phylogenetic comparative method.]

06:04

Patterns of adaptation in response to environmental variation are central to our understanding of biodiversity, but predictions of how and when broad-scale environmental conditions such as climate affect organismal form and function remain incomplete. Succulent plants have evolved in response to arid conditions repeatedly, with various plant organs such as leaves, stems, and roots physically modified to increase water storage. Here, we investigate the role played by climate conditions in shaping the evolution of succulent forms in a plant clade endemic to Madagascar and the surrounding islands, part of the hyper-diverse genus Euphorbia (Euphorbiaceae). We used multivariate ordination of 19 climate variables to identify links between particular climate variables and three major forms of succulence—succulent leaves, cactiform stem succulence, and tubers. We then tested the relationship between climatic conditions and succulence, using comparative methods that account for shared evolutionary history. We confirm that plant water storage is associated with the two components of aridity, temperature, and precipitation. Cactiform stem succulence, however, is not prevalent in the driest environments, countering the widely held view of cactiforms as desert icons. Instead, leaf succulence and tubers are significantly associated with the lowest levels of precipitation. Our findings provide a clear link between broad-scale climatic conditions and adaptation in land plants, and new insights into the climatic conditions favoring different forms of succulence. This evidence for adaptation to climate raises concern over the evolutionary future of succulent plants as they, along with other organisms, face anthropogenic climate change. [Adaptation; climate; comparative analysis; Euphorbia; ordination; phylogeny.]

06:04

Public DNA databases are composed of data from many different taxa, although the taxonomic annotation on sequences is not always complete, which impedes the utilization of mined data for species-level applications. There is much ongoing work on species identification and delineation based on the molecular data itself, although applying species clustering to whole databases requires consolidation of results from numerous undefined gene regions, and introduces significant obstacles in data organization and computational load. In the current paper, we demonstrate an approach for species delineation of a sequence database. All DNA sequences for the insects were obtained and processed. After filtration of duplicated data, delineation of the database into species or molecular operational taxonomic units (MOTUs) followed a three-step process in which (i) the genetic loci L are partitioned, (ii) the species S are delineated within each locus, then (iii) species units are matched across loci to form the matrix L x S, a set of global (multilocus) species units. Partitioning the database into a set of homologous gene fragments was achieved by Markov clustering using edge weights calculated from the amount of overlap between pairs of sequences, then delineation of species units and assignment of species names were performed for the set of genes necessary to capture most of the species diversity. The complexity of computing pairwise similarities for species clustering was substantial at the cytochrome oxidase subunit I locus in particular, but made feasible through the development of software that performs pairwise alignments within the taxonomic framework, while accounting for the different ranks at which sequences are labeled with taxonomic information. Over 24 different homologs, the unidentified sequences numbered approximately 194,000, containing 41,525 species IDs (98.7% of all found in the insect database), and were grouped into 59,173 single-locus MOTUs by hierarchical clustering under parameters optimized independently for each locus. Species units from different loci were matched using a multipartite matching algorithm to form multilocus species units with minimal incongruence between loci. After matching, the insect database as represented by these 24 loci was found to be composed of 78,091 species units in total. 38,574 of these units contained only species labeled data, 34,891 contained only unlabeled data, leaving 4,626 units composed both of labeled and unlabeled sequences. In addition to giving estimates of species diversity of sequence repositories, the protocol developed here will facilitate species-level applications of modern-day sequence data sets. In particular, the L x S matrix represents a post-taxonomic framework that can be used for species-level organization of metagenomic data, and incorporation of these methods into phylogenetic pipelines will yield matrices more representative of species diversity. [Database partitioning; MOTU; multi-locus clustering; species delineation.]

06:04

Molecular phylogenetic studies of homologous sequences of nucleotides often assume that the underlying evolutionary process was globally stationary, reversible, and homogeneous (SRH), and that a model of evolution with one or more site-specific and time-reversible rate matrices (e.g., the GTR rate matrix) is enough to accurately model the evolution of data over the whole tree. However, an increasing body of data suggests that evolution under these conditions is an exception, rather than the norm. To address this issue, several non-SRH models of molecular evolution have been proposed, but they either ignore heterogeneity in the substitution process across sites (HAS) or assume it can be modeled accurately using the distribution. As an alternative to these models of evolution, we introduce a family of mixture models that approximate HAS without the assumption of an underlying predefined statistical distribution. This family of mixture models is combined with non-SRH models of evolution that account for heterogeneity in the substitution process across lineages (HAL). We also present two algorithms for searching model space and identifying an optimal model of evolution that is less likely to over- or underparameterize the data. The performance of the two new algorithms was evaluated using alignments of nucleotides with 10 000 sites simulated under complex non-SRH conditions on a 25-tipped tree. The algorithms were found to be very successful, identifying the correct HAL model with a 75% success rate (the average success rate for assigning rate matrices to the tree's 48 edges was 99.25%) and, for the correct HAL model, identifying the correct HAS model with a 98% success rate. Finally, parameter estimates obtained under the correct HAL-HAS model were found to be accurate and precise. The merits of our new algorithms were illustrated with an analysis of 42 337 second codon sites extracted from a concatenation of 106 alignments of orthologous genes encoded by the nuclear genomes of Saccharomyces cerevisiae, S. paradoxus, S. mikatae, S. kudriavzevii, S. castellii, S. kluyveri, S. bayanus, and Candida albicans. Our results show that second codon sites in the ancestral genome of these species contained 49.1% invariable sites, 39.6% variable sites belonging to one rate category (V1), and 11.3% variable sites belonging to a second rate category (V2). The ancestral nucleotide content was found to differ markedly across these three sets of sites, and the evolutionary processes operating at the variable sites were found to be non-SRH and best modeled by a combination of eight edge-specific rate matrices (four for V1 and four for V2). The number of substitutions per site at the variable sites also differed markedly, with sites belonging to V1 evolving slower than those belonging to V2 along the lineages separating the seven species of Saccharomyces. Finally, sites belonging to V1 appeared to have ceased evolving along the lineages separating S. cerevisiae, S. paradoxus, S. mikatae, S. kudriavzevii, and S. bayanus, implying that they might have become so selectively constrained that they could be considered invariable sites in these species. [Evolution; heterotachy; mixture model; non-homogeneous model; phylogeny; rate heterogeneity across sites; rate heterogeneity across lineages; yeast]

06:04

Competition between organisms influences the processes governing the colonization of new habitats. As a consequence, species or populations arriving first at a suitable location may prevent secondary colonization. Although adaptation to environmental variables (e.g., temperature, altitude, etc.) is essential, the presence or absence of certain species at a particular location often depends on whether or not competing species co-occur. For example, competition is thought to play an important role in structuring mammalian communities assembly. It can also explain spatial patterns of low genetic diversity following rapid colonization events or the "progression rule" displayed by phylogenies of species found on archipelagos. Despite the potential of competition to maintain populations in isolation, past quantitative analyses have largely ignored it because of the difficulty in designing adequate methods for assessing its impact. We present here a new model that integrates competition and dispersal into a Bayesian phylogeographic framework. Extensive simulations and analysis of real data show that our approach clearly outperforms the traditional Mantel test for detecting correlation between genetic and geographic distances. But most importantly, we demonstrate that competition can be detected with high sensitivity and specificity from the phylogenetic analysis of genetic variation in space. [Competition; dispersal; phylogeography.]

06:04

Recent years have seen a rapid expansion of the model space explored in statistical phylogenetics, emphasizing the need for new approaches to statistical model representation and software development. Clear communication and representation of the chosen model is crucial for: (i) reproducibility of an analysis, (ii) model development, and (iii) software design. Moreover, a unified, clear and understandable framework for model representation lowers the barrier for beginners and nonspecialists to grasp complex phylogenetic models, including their assumptions and parameter/variable dependencies. Graphical modeling is a unifying framework that has gained in popularity in the statistical literature in recent years. The core idea is to break complex models into conditionally independent distributions. The strength lies in the comprehensibility, flexibility, and adaptability of this formalism, and the large body of computational work based on it. Graphical models are well-suited to teach statistical models, to facilitate communication among phylogeneticists and in the development of generic software for simulation and statistical inference. Here, we provide an introduction to graphical models for phylogeneticists and extend the standard graphical model representation to the realm of phylogenetics. We introduce a new graphical model component, tree plates, to capture the changing structure of the subgraph corresponding to a phylogenetic tree. We describe a range of phylogenetic models using the graphical model framework and introduce modules to simplify the representation of standard components in large and complex models. Phylogenetic model graphs can be readily used in simulation, maximum likelihood inference, and Bayesian inference using, for example, Metropolis–Hastings or Gibbs sampling of the posterior distribution. [Computation; graphical models; inference; modularization; statistical phylogenetics; tree plate.]

06:04

The reconstruction of a central tendency "species tree" from a large number of conflicting gene trees is a central problem in systematic biology. Moreover, it becomes particularly problematic when taxon coverage is patchy, so that not all taxa are present in every gene tree. Here, we list four apparently desirable properties that a method for estimating a species tree from gene trees could have (the strongest property states that building a species tree from input gene trees and then pruning leaves gives a tree that is the same as, or more resolved than, the tree obtained by first removing the taxa from the input trees and then building the species tree). We show that although it is technically possible to simultaneously satisfy these properties when taxon coverage is complete, they cannot all be satisfied in the more general supertree setting. In part two, we discuss a concordance-based consensus method based on Baum's "plurality clusters", and an extension to concordance supertrees. [Concordance; consensus tree; phylogenetics; plurality cluster; supertree.]

06:04

Amphibia comprises over 7000 extant species distributed in almost every ecosystem on every continent except Antarctica. Most species also show high specificity for particular habitats, biomes, or climatic niches, seemingly rendering long-distance dispersal unlikely. Indeed, many lineages still seem to show the signature of their Pangaean origin, approximately 300 Ma later. To date, no study has attempted a large-scale historical-biogeographic analysis of the group to understand the distribution of extant lineages. Here, I use an updated chronogram containing 3309 species (~45% of extant diversity) to reconstruct their movement between 12 global ecoregions. I find that Pangaean origin and subsequent Laurasian and Gondwanan fragmentation explain a large proportion of patterns in the distribution of extant species. However, dispersal during the Cenozoic, likely across land bridges or short distances across oceans, has also exerted a strong influence. Finally, there are at least three strongly supported instances of long-distance oceanic dispersal between former Gondwanan landmasses during the Cenozoic. Extinction from intervening areas seems to be a strong factor in shaping present-day distributions. Dispersal and extinction from and between ecoregions are apparently tied to the evolution of extraordinarily adaptive expansion-oriented phenotypes that allow lineages to easily colonize new areas and diversify, or conversely, to extremely specialized phenotypes or heavily relictual climatic niches that result in strong geographic localization and limited diversification. [Amphibians; caecilians; dispersal; frogs; historical biogeography; oceanic dispersal; salamanders; vicariance.]

06:04

The statistical basis of maximum likelihood (ML), its robustness, and the fact that it appears to suffer less from biases lead to it being one of the most popular methods for tree reconstruction. Despite its popularity, very few analytical solutions for ML exist, so biases suffered by ML are not well understood. One possible bias is long branch attraction (LBA), a regularly cited term generally used to describe a propensity for long branches to be joined together in estimated trees. Although initially mentioned in connection with inconsistency of parsimony, LBA has been claimed to affect all major phylogenetic reconstruction methods, including ML. Despite the widespread use of this term in the literature, exactly what LBA is and what may be causing it is poorly understood, even for simple evolutionary models and small model trees. Studies looking at LBA have focused on the effect of two long branches on tree reconstruction. However, to understand the effect of two long branches it is also important to understand the effect of just one long branch. If ML struggles to reconstruct one long branch, then this may have an impact on LBA. In this study, we look at the effect of one long branch on three-taxon tree reconstruction. We show that, counterintuitively, long branches are preferentially placed at the tips of the tree. This can be understood through the use of analytical solutions to the ML equation and distance matrix methods. We go on to look at the placement of two long branches on four-taxon trees, showing that there is no attraction between long branches, but that for extreme branch lengths long branches are joined together disproportionally often. These results illustrate that even small model trees are still interesting to help understand how ML phylogenetic reconstruction works, and that LBA is a complicated phenomenon that deserves further study. [analytic solutions; long branch attraction; maximum likelihood; simulation.]

06:04

We introduce molecularevolution.org, a publicly available gateway for high-throughput, maximum-likelihood phylogenetic analysis powered by grid computing. The gateway features a garli 2.0 web service that enables a user to quickly and easily submit thousands of maximum likelihood tree searches or bootstrap searches that are executed in parallel on distributed computing resources. The garli web service allows one to easily specify partitioned substitution models using a graphical interface, and it performs sophisticated post-processing of phylogenetic results. Although the garli web service has been used by the research community for over three years, here we formally announce the availability of the service, describe its capabilities, highlight new features and recent improvements, and provide details about how the grid system efficiently delivers high-quality phylogenetic results. [garli, gateway, grid computing, maximum likelihood, molecular evolution portal, phylogenetics, web service.]