There are currently 0 users and 41 guests online.
Last update1 hour 58 min ago
July 13, 2014
July 11, 2014
Jaime Huerta-Cepas wrote:
I have a phylogenetic hypothesis that I would like to test statistically. Although the best Bayesian and ML trees support that hypothesis, bootstrap and posterior probabilities are far from great, so I followed the advice given to me in this forum about testing all possible alternatives and see if I could statistically ruled them out.
For this I used CONSEL to evaluate over a thousand of alternative constraint topologies. All but 10 of the accepted topologies are compatible with my hypothesis using the AU test (pvalue
July 8, 2014
Brian Foley wrote:
This paper is rather specific to HIV-1 with its very large population size within each infected individual, and rapid evolution rate. It would be interesting to see similar work with other organisms. Human Influenza A virus, for example, has an evolution rate very similar to HIV-1 but a very different transmission rate between infected individuals.
July 3, 2014
Trevor Bedford wrote:
Andreas Wagner has a new paper on analyzing influenza sequence data using a super simple Hamming-distance network-based approach.
www.ncbi.nlm.nih.gov A genotype network reveals homoplastic cycles of convergent evolution in influenza A (H3N2) haemagglutinin. A Wagner, Proceedings. Biological sciences / The Royal Society, Jul 7 2014
Networks of evolving genotypes can be constructed from the worldwide time-resolved genotyping of pathogens like influenza viruses. Such genotype networks are graphs where neighbouring vertices (viral strains) differ in a single nucleotide or amino acid. A rich trove of network analysis methods can help understand the evolutionary dynamics reflected in the structure of these networks. Here, I analyse a genotype network comprising hundreds of influenza A (H3N2) haemagglutinin genes. The network is rife with cycles that reflect non-random parallel or convergent (homoplastic) evolution. These cycles also show patterns of sequence change characteristic for strong and local evolutionary constraints, positive selection and mutation-limited evolution. Such cycles would not be visible on a phylogenetic tree, illustrating that genotype network analysis can complement phylogenetic analyses. The network also shows a distinct modular or community structure that reflects temporal more than spatial proximity of viral strains, where lowly connected bridge strains connect different modules. These and other organizational patterns illustrate that genotype networks can help us study evolution in action at an unprecedented level of resolution.
He ends up with plots like:
Fundamentally non-phylogenetic, this approach doesn't try to reconstruct evolutionary history, but instead shows a simple overview of genetic relationships. Andreas suggests that these graphs make it easy to detect convergent evolution that would not be apparent in the strictly branching tree.
I don't have a good intuition for how these sorts of graphs translate to trees and vice versa. Does this seem like it's a useful addition to constructing a tree or more of a distraction?
July 2, 2014
Miao Sun wrote:
A really big headache issue haunting me recently, required your insightful suggestions to help me out:
I have a large matrix, about 12,000 taxa, and the data is formalized as below:
So using information like "taxa name or gi No.", how can I get the corresponding Accession number from each taxon in this large matrix via NCBI website as a batch job?
Any ideas or experience to share?
June 30, 2014
Craig Nelson wrote:
This is as much a moral rumination as a call for opinions and guidance. How can we better practically resolve taxa as amplicon surveys grow out of control? Can placement algorithms replace identity binning? Should they?
Sequence reads derived from environmental surveys of phylogenetic marker gene amplicons, such as 16S rRNA, CO1, etc. are typically “clustered” (Uclust, CD-Hit, mothur) to form operational taxonomic units (OTUs) after alignment to a reference database and before subsequent phylogenetic analysis or classification. In the widespread application of these genes (molecular “clocks”), this creates problems of cohesion across studies and sequencing platforms (and even run-to-run) because OTUs are internally defined by local neighbors.
Because ecology is important (right?), identifying ecologically and evolutionarily meaningful OTUs has become important, in the microbial world now often described as finding “Ecotypes” of broad clades of microbes. This is impossible with databases, where huge diverse groups are often lumped with a code derived from a single clone picked decades ago. Nonetheless, many marker genes have robust, curated databases, and the problem becomes one of annotation. Comparing organisms across studies has become a real problem in my field of aquatic microbiology. We have lots of groups independently "naming" clades, and lots of reference libraries for marker genes (especially 16S), but read binning by identity is dataset dependent and it can be hard to maintain continuity through time or quickly determine if two groups are talking about the same organism.
I am particularly interested in stabilizing this trajectory in time-series work by using placement to assign reads to nodes of a reference tree. Frankly in practice this is now practical computationally because binning can be so slow as datasets grow. I'd guess a ref should be robustly calculated (ML) with backbone constraints from a curated MSA database and possibly initially expanded using existing sequence libraries from previous work in the ecosystem in question (or analogues) to establish un-curated clades relevant locally. Subsequent amplicon surveys could then “classify” reads according to placements (using pplacer, for example @ematsen ) and nodes could serve as stable, reference-able, visualizable, (expandable?), classifiable taxonomic units.
I’ve been working for the last year to derive a robust reference alignment from databases and structure a classified, constrained tree and workflow for curation of marker gene survey outputs within the pplacer ecosystem (pplacer/guppy/rppr/ and especially taxtastic). Importantly, we wouldn't be progressively adding sequences to a tree. We would just be allowing for stable nodal annotation of reads as a short-term way of detecting ecologically meaningful differential placements. In essence our goal is to "classify" to node rather than database annotations.
One topic of discussion would be nodal “Assignment”. In the context of pplacer, it would be nice to get an alternate "unambiguous" placement for a given pquery which is the most derived node for which likelihood weight is above some threshold (say 70%). This seems like a reasonable option to incorporate into pplacer: if pquery has an unacceptably low pointmass likelihood weight, reassign to a basal node monophyletic for the placement using an LCA-like algorithm (already in use for your classification algorithms). Giovannoni's group at OSU has taken an approach to this by modifying pplacer placements with the BioPerl script LCA, which basically re-attaches pqueries to common basal nodes when placements are unacceptably "Fuzzy" as a single pointmass (their group calls this pipeline "Phylotyper" Vergin et al. 2013 (ISME Journal).
Another discussion point that would be really useful is some criteria for determining if pqueries are likely to belong to a "new clade" that isn't well-represented in the tree. Said differently, it would be nice if post-hoc analyses on a placement mass can suggest if the tree needs to be expanded with additional reference sequences to resolve/accommodate new subclades in specific regions Are any of the existing metrics (adcl, edpl) useful for quantifying this likelihood? Would there be a way to make a "new node" in a refpkg during the placement process if some critical mass of sequences was attaching to a basal node in a clade with better likelihood than more derived nodes?
June 26, 2014
Erick Matsen wrote:
Efficient Continuous-Time Markov Chain Estimation
Monir Hajiaghayi, Bonnie Kirkpatrick, Liangliang Wang, Alexandre Bouchard-Côté http://jmlr.org/proceedings/papers/v32/hajiaghayi14.pdf
Many problems of practical interest rely on Continuous-time Markov chains~(CTMCs) defined over combinatorial state spaces, rendering the computation of transition probabilities, and hence probabilistic inference, difficult or impossible with existing methods. For problems with countably infinite states, where classical methods such as matrix exponentiation are not applicable, the main alternative has been particle Markov chain Monte Carlo methods imputing both the holding times and sequences of visited states. We propose a particle-based Monte Carlo approach where the holding times are marginalized analytically. We demonstrate that in a range of realistic inferential setups, our scheme dramatically reduces the variance of the Monte Carlo approximation and yields more accurate parameter posterior approximations given a fixed computational budget. These experiments are performed on both synthetic and real datasets, drawing from two important examples of CTMCs having combinatorial state spaces: string-valued mutation models in phylogenetics and nucleic acid folding pathways.
The first important thing is to figure out how to calculate the transition probability of an x to a y given that some change occurs in the case when the state space is very big. String-valued processes fall in this category, for example. They bias things with a potential:
Second, one needs to marginalize out the event (i.e. jump) times. This is done by constructing a CTMC such that the difficult part of the marginalization are the transition probabilities of the CTMC:
Alexandre Bouchard-Côté does fantastic work. H/T @cmccoy.
Rob Lanfear wrote:
This post is about a recent Drosophila phylogeny published in MPE, a critique of that paper, and whether MPE has done enough by just publishing the critique. Opinions welcome.
The original paper presented new data and a new tree of Drosophilidae. Obviously lots of people care about this tree since it encompasses some of the best-studied model organisms we've got. It's been cited 23 times since 2012 according to google scholar. Here's the original:
Increasing the data size to accurately reconstruct the phylogenetic relationships between nine subgroups of the Drosophila melanogaster species group (Drosophilidae, Diptera). Yang Y, Hou ZC, Qian YH, Kang H, Zeng QT.
A critique has just been published, showing lots of issues with the original analysis (full disclosure - I have published with the first author of this critique, although I had nothing to do with this critique and hadn't read it until today). Here's the critique:
Problems with data quality in the reconstruction of evolutionary relationships in the Drosophila melanogaster species group: Comments on Yang et al 2012. Catullo RA, Oakeshott JG.
In short - they found many issues with the data in the ms (problems with ~150 fo the ~800 sequences), and couldn't replicate their results. Most worryingly, they show at least one example where this published tree may have already led to incorrect inferences in a published comparative study that relied on the tree.
What seems odd to me is that although the critique seems fairly damning, nothing has changed on the original paper. My understanding was that this is what the COPE guidelines were for:
While there is no evidence of fraud here, if you take the critique at face value then there were a lot of mistakes in the original article and the validity of the results is certainly in question.
MPE is a premier venue for publishing trees, and it would be nice to think they were committed to their publications being accurate. So I'd be interested to hear others' opinions on this paper and the critique. Specifically, have MPE done enough here by just publishing the critique? Should they issue a correction / expression of concern / or worse of the original article? Or should the original article stand unchanged despite the critique?
June 25, 2014
Jaime Huerta-Cepas wrote:
We have just released the first beta version of ETE-NPR.
The software is intended for Nested Phylogenetic Reconstruction (NPR) and workflow design. It works as a wrapper to all the necessary steps and programs used in common phylogenetic and phylogenomic pipelines, from input parsing to final image generation.
This is still a work in progress and we will be happy to get any feedback.
June 19, 2014
Anyone know the best way to visualize conflicting tree toopologies with incomplete overlap in taxon sampling?
What we got: a complete tree (say, species tree), and a whack of gene trees which may or may not have complete taxon sampling. We want a figure with single set of taxon labels that all trees map to. Ignoring edge lengths, as things get messy very quickly. If a gene tree does not contain taxa in the basal split of the species tree, don't want it's root to start at the species tree root, but instead more tipward; otherwise, relationships get obscured.
DensiTree is something we have explored, but it doesn't seem to work well with uneven sampling across trees. We've also been playing with R code graciously provided by @liamjrevell, and we may be able to get this to do what we want, but I thought I would check with with the phylo-timaliids to see if something already exists.
June 17, 2014
Erick Matsen wrote:
New from @bredelings:
www.ncbi.nlm.nih.gov Erasing Errors Due to Alignment Ambiguity When Estimating Positive Selection. B Redelings, Molecular biology and evolution, May 27 2014
Current estimates of diversifying positive selection rely on first having an accurate multiple sequence alignment. Simulation studies have shown that under biologically plausible conditions, relying on a single estimate of the alignment from commonly used alignment software can lead to unacceptably high false positive rates in detecting diversifying positive selection. We present a novel statistical method that eliminates excess false positives resulting from alignment error by jointly estimating the degree of positive selection and the alignment under an evolutionary model. Our model treats both substitutions and insertions/deletions as sequence changes on a tree, and allows site-heterogeneity in the substitution process. We conduct inference starting from unaligned sequence data by integrating over all alignments. This approach naturally accounts for ambiguous alignments without requiring ambiguously aligned sites to be identified and removed prior to analysis. We take a Bayesian approach and conduct inference using MCMC to integrate over all alignments on a fixed evolutionary tree topology. We introduce a Bayesian version of the branch-site test and assess the evidence for positive selection using Bayes factors. We compare two models of differing dimensionality using a simple alternative to reversible-jump methods. We also describe a more accurate method of estimating the Bayes factor using Rao-Blackwellization. We then show using simulated data that jointly estimating the alignment and the presence of positive selection solves the problem with excessive false positives from erroneous alignments, and has nearly the same power to detect positive selection as when the true alignment is known. We also show that samples taken from the posterior alignment distribution using the software BAli-Phy have substantially lower alignment error compared to MUSCLE, MAFFT, PRANK, and FSA alignments.
This figure definitely made me sit up and pay attention:
The sequences were simulated with INDELible.
June 7, 2014
Bojian Zhong wrote:
I am currently doing some phylogenetic analyses using Phylobayes, but I haven't figure out how to measure the compositional heterogeneity of each taxa using Phylobayes? I really appreciate it if anyone could provide the commands/details of how to do it.
June 3, 2014
Andrew Rambaut wrote:
Firstly - sorry, this is not an announcement but a suggestion/call for a collaborative project. I have been using iPython Notebook for manipulating data and plotting and think it would be a great environment for phylogenetics. For it to work, it would need a coherent library with standardised objects for storing trees, etc., and some visualisation tools for trees, alignments etc. And some embedded tree building/alignment software.
Any thoughts on this? My primary motivation is to replace the various software packages I use for teaching and produce a coherent framework.
Jaime Huerta-Cepas wrote:
Hi, I have been days struggling with a phylogenetic tree of around 90 short sequences (domain based) whose support values for many branches are really low (
May 29, 2014
I'm really new to all of this and learning as I go. I'm hoping a group like this will make all the difference, so I'm glad I found you! I am using Lagrange and am just wondering how to view the tree from the output file. I tried opening it in FigTree, but it only shows me the ML tree without any geographic range data.
May 28, 2014
Taxon Bytes wrote:
Possibly only eripheral to phylobabble; but posting here because it is a rather open, malleable position.
May 26, 2014
Laura Eme wrote:
I wonder if someone can explain how maxdiff is computed when more than 2 chains are compared?
I am puzzled by this: I have run 4 chains on a dataset, and if I use 'bpcomp' on chains 1, 2 and 3, I get a maxdiff = 0.51, but if I compute it on all four chains, I get a maxdiff = 0.35. I don't understand how that's possible.
Many thanks for any answer to my naive question!
May 19, 2014
Benjamin Redelings wrote:
I was wondering if people have any helpful hints on reading in codon alignments while handling ambiguity codes. I need to rewrite how I do this, because my current approach takes a ton of time and memory when the number of ambiguous patterns (e.g. NGC, or RGC) get large. As a result, I currently only allow Y,R,W,S, and N in codons, and disallow K,M,B,D,H,V. Any thoughts?
May 14, 2014
Guangchuang Yu wrote:
I am new to phylogeny, and I start to learn by solving problems in ROSALIND. I got stuck with the problem of : http://rosalind.info/problems/nwck/
My solution to it is define a NEWICK class to store the tree and after setting the current node, the getParentList() method can return all its ancester nodes.
So the problem turn out to be finding the most recent common ancestor, and the distance between to nodes can be calculated then.
Source code and sample data can be found at https://github.com/GuangchuangYu/ROSALIND
java/NWCK.java java/tree/Newick.java java/FILE/ReadFile.java DATA/rosalind_nwck.txt
My code will return the number of correct answer+2.
It took me many times to figure out this bug, and I have no idea why this happened.
Anyone has some ideas?
May 13, 2014
The Genealogical World of Phylogenetic Networks
BMC Evolutionary Biology