Latest issue

Systematic Biology - RSS feed of current issue

URL

XML feed
http://sysbio.oxfordjournals.org

Last update

51 min 28 sec ago

June 14, 2010

10:38

The rich knowledge of morphological variation among organisms reported in the systematic literature has remained in free-text format, impractical for use in large-scale synthetic phylogenetic work. This noncomputable format has also precluded linkage to the large knowledgebase of genomic, genetic, developmental, and phenotype data in model organism databases. We have undertaken an effort to prototype a curated, ontology-based evolutionary morphology database that maps to these genetic databases (http://kb.phenoscape.org) to facilitate investigation into the mechanistic basis and evolution of phenotypic diversity. Among the first requirements in establishing this database was the development of a multispecies anatomy ontology with the goal of capturing anatomical data in a systematic and computable manner. An ontology is a formal representation of a set of concepts with defined relationships between those concepts. Multispecies anatomy ontologies in particular are an efficient way to represent the diversity of morphological structures in a clade of organisms, but they present challenges in their development relative to single-species anatomy ontologies. Here, we describe the Teleost Anatomy Ontology (TAO), a multispecies anatomy ontology for teleost fishes derived from the Zebrafish Anatomical Ontology (ZFA) for the purpose of annotating varying morphological features across species. To facilitate interoperability with other anatomy ontologies, TAO uses the Common Anatomy Reference Ontology as a template for its upper level nodes, and TAO and ZFA are synchronized, with zebrafish terms specified as subtypes of teleost terms. We found that the details of ontology architecture have ramifications for querying, and we present general challenges in developing a multispecies anatomy ontology, including refinement of definitions, taxon-specific relationships among terms, and representation of taxonomically variable developmental pathways.

10:38

Long branches are potentially problematic in molecular dating because they can encompass a vast number of combinations of substitution rate and time. A long branch is suspected to have biased molecular clock estimates of the age of flowering plants (angiosperms) to be much older than their earliest fossils. This study explores the effect of the long branch subtending angiosperms in molecular dating and how different relaxed clocks react to it. Fossil angiosperm relatives, identified through a combined morphological and molecular phylogenetic analysis for living and fossil seed plants, were used to break the long angiosperm stem branch. Nucleotide sequences of angiosperm fossil relatives were simulated using a phylogeny and model parameters from living taxa and incorporated in molecular dating. Three relaxed clocks, which implement among-lineage rate heterogeneity differently, were used: penalized likelihood (using 2 different rate smoothing optimization criteria), a Bayesian rate-autocorrelated method, and a Bayesian uncorrelated method. Different clocks provided highly correlated ages across the tree. Breaking the angiosperm stem branch did not result in major age differences, except for a few sensitive nodes. Breaking the angiosperm stem branch resulted in a substantially younger age for crown angiosperms only with 1 of the 4 methods, but, nevertheless, the obtained age is considerably older than the oldest angiosperm fossils. The origin of crown angiosperms is estimated between the Upper Triassic and the early Permian. The difficulty in estimating crown angiosperm age probably lies in a combination of intrinsic and extrinsic complicating factors, including substantial molecular rate heterogeneity among lineages and through time. A more adequate molecular dating approach might combine moderate background rate heterogeneity with large changes in rate at particular points in the tree.

10:38

Coalescent model–based methods for phylogeny estimation force systematists to confront issues related to the identification of species boundaries. Unlike conventional phylogenetic analysis, where species membership can be assessed qualitatively after the phylogeny is estimated, the phylogenies that are estimated under a coalescent model treat aggregates of individuals as the operational taxonomic units and thus require a priori definition of these sets because the models assume that the alleles in a given lineage are sampled from a single panmictic population. Fortunately, the use of coalescent model–based approaches allows systematists to conduct probabilistic tests of species limits by calculating the probability of competing models of lineage composition. Here, we conduct the first exploration of the issues related to applying such tests to a complex empirical system. Sequence data from multiple loci were used to assess species limits and phylogeny in a clade of North American Myotis bats. After estimating gene trees at each locus, the likelihood of models representing all hierarchical permutations of lineage composition was calculated and Akaike information criterion scores were computed. Metrics borrowed from information theory suggest that there is strong support for several models that include multiple evolutionary lineages within the currently described species Myotis lucifugus and M. evotis. Although these results are preliminary, they illustrate the practical importance of coupled species delimitation and phylogeny estimation.

10:38

Nested clade phylogeographic analysis (NCPA) is a widely used method that aims to identify past demographic events that have shaped the history of a population. In an earlier study, NCPA has been fully automated, allowing it to be tested with simulated data sets generated under a null model in which samples simulated from a panmictic population are geographically distributed. It was noted that NCPA was prone to inferring false positives, corroborating earlier findings. The present study aims to evaluate both single-locus and multilocus NCPA under the scenario of restricted gene flow among spatially distributed populations. We have developed a new program, ANeCA-ML, which implements multilocus NCPA. Data were simulated under 3 models of gene flow: a stepping stone model, an island model, and a stepping stone model with some long-distance dispersal. Results indicate that single-locus NCPA tends to give a high frequency of false positives, but, unlike the random-mating scenario presented previously, inferences are not limited to restricted gene flow with isolation by distance or contiguous range expansion. The proportion of single-locus data sets that contained false inferences was 76% for the panmictic case, 87% for the stepping stone model, 79% for the stepping stone model with long-distance dispersal, and more than 99% for the island model. The frequency of inferences is inversely related to the amount of gene flow between demes. We performed multilocus NCPA by grouping the simulated loci into data sets of 5 loci. The false-positive rate was reduced in multilocus NCPA for some inferences but remained high for others. The proportion of multilocus data sets that contained false inferences was 17% for the panmictic case, 30% for the stepping stone model, 4% for the stepping stone model with long-distance dispersal, and 54% for the island model. Multilocus NCPA reduces the false-positive rate by restricting the sensitivity of the method but does not appear to increase the accuracy of the approach. Three classical tests—the analysis of molecular variance method, Fu's Fs, and the Mantel test—show that there is information in the data that gives rise to explicable results using these standard approaches. In conclusion, for the scenarios that we have examined, our simulation study suggests that the NCPA method is unreliable and its inferences may be misleading. We suggest that the NCPA method should not be used without objective simulation-based testing by independent researchers.

10:38

A phylogenetic tree comprising clades with high bootstrap values or other strong measures of statistical support is usually interpreted as providing a good estimate of the true phylogeny. Convergent evolution acting on groups of characters in concert, however, can lead to highly supported but erroneous phylogenies. Identifying such groups of phylogenetically misleading characters is obviously desirable. Here we present a procedure that uses an independent data source to identify sets of characters that have undergone concerted convergent evolution. We examine the problematic case of the cormorants and shags, for which trees constructed using osteological and molecular characters both have strong statistical support and yet are fundamentally incongruent. We find that the osteological characters can be separated into those that fit the phylogenetic history implied by the molecular data set and those that do not. Moreover, these latter nonfitting osteological characters are internally consistent and form groups of mutually compatible characters or "cliques," which are significantly larger than cliques of shuffled characters. We suggest, therefore, that these cliques of characters are the result of similar selective pressures and are a signature of concerted convergence.

10:38

A controversial topic that underlies much of phylogenetic experimental design is the relative utility of increased taxonomic versus character sampling. Conclusions about the relative utility of adding characters or taxa to a current phylogenetic study have subtly hinged upon the appropriateness of the rate of evolution of the characters added for resolution of the phylogeny in question. Clearly, the addition of characters evolving at optimal rates will have much greater impact upon accurate phylogenetic analysis than will the addition of characters with an inappropriate rate of evolution. Development of practical analytical predictions of the asymptotic impact of adding additional taxa would complement computational investigations of the relative utility of these two methods of expanding acquired data. Accordingly, we here formulate a measure of the phylogenetic informativeness of the additional sampling of character states from a new taxon added to the canonical phylogenetic quartet. We derive the optimal rate of evolution for characters assessed in taxa to be sampled and a metric of informativeness based on the rate of evolution of the characters assessed in the new taxon and the distance of the new taxon from the internode of interest. Calculation of the informativeness per base pair of additional character sampling for included taxa versus additional character sampling for novel taxa can be used to estimate cost-effectiveness and optimal efficiency of phylogenetic experimental design. The approach requires estimation of rates of evolution of individual sites based on an alignment of genes orthologous to those to be sequenced, which may be identified in a well-established clade of sister taxa or of related taxa diverging at a deeper phylogenetic scale. Some approximate idea of the potential phylogenetic relationships of taxa to be sequenced is also desirable, such as may be obtained from ribosomal RNA sequence alone. Application to the solution of recalcitrant unresolved nodes in an otherwise well-known phylogeny is the most obvious application. We validate the theory by analysis of its predictions regarding the phylogenetic informativeness for taxon addition of 46 amino acid alignments of 21 fungal taxa. Gene and taxon sampling according to the theory herein and following a "deepest ingroup" heuristic are shown to provide significantly improved resolution of specified deep internodes.

10:38

Studies of diversification patterns often find a slowing in lineage accumulation toward the present. This seemingly pervasive pattern of rate downturns has been taken as evidence for adaptive radiations, density-dependent regulation, and metacommunity species interactions. The significance of rate downturns is evaluated with statistical tests (the statistic and Monte Carlo constant rates (MCCR) test; birth–death likelihood models and Akaike Information Criterion [AIC] scores) that rely on null distributions, which assume that the included species are a random sample of the entire clade. Sampling in real phylogenies, however, often is nonrandom because systematists try to include early-diverging species or representatives of previous intrataxon classifications. We studied the effects of biased sampling, structured sampling, and random sampling by experimentally pruning simulated trees (60 and 150 species) as well as a completely sampled empirical tree (58 species) and then applying the statistic/MCCR test and birth–death likelihood models/AIC scores to assess rate changes. For trees with random species sampling, the true model (i.e., the one fitting the complete phylogenies) could be inferred in most cases. Oversampling deep nodes, however, strongly biases inferences toward downturns, with simulations of structured and biased sampling suggesting that this occurs when sampling percentages drop below 80%. The magnitude of the effect and the sensitivity of diversification rate models is such that a useful rule of thumb may be not to infer rate downturns from real trees unless they have >80% species sampling.

10:38

A wide range of evolutionary models for species-level (and higher) diversification have been developed. These models can be used to test evolutionary hypotheses and provide comparisons with phylogenetic trees constructed from real data. To carry out these tests and comparisons, it is often necessary to sample, or simulate, trees from the evolutionary models. Sampling trees from these models is more complicated than it may appear at first glance, necessitating careful consideration and mathematical rigor. Seemingly straightforward sampling methods may produce trees that have systematically biased shapes or branch lengths. This is particularly problematic as there is no simple method for determining whether the sampled trees are appropriate. In this paper, we show why a commonly used simple sampling approach (SSA)—simulating trees forward in time until n species are first reached—should only be applied to the simplest pure birth model, the Yule model. We provide an alternative general sampling approach (GSA) that can be applied to most other models. Furthermore, we introduce the constant-rate birth–death model sampling approach, which samples trees very efficiently from a widely used class of models. We explore the bias produced by SSA and identify situations in which this bias is particularly pronounced. We show that using SSA can lead to erroneous conclusions: When using the inappropriate SSA, the variance of a gradually evolving trait does not correlate with the age of the tree; when the correct GSA is used, the trait variance correlates with tree age. The algorithms presented here are available in the Perl Bio::Phylo package, as a stand-alone program TreeSample, and in the R TreeSim package.